Home
Intel(R) C++ Compiler for Linux* Systems User's Guide
Contents
1. F32vec4 R rep_nr F32vec4 A _mm_sub_ps _mm_add_ps _mm_mul_ps _mm_rcp_ps 2 doubles F64vec2 R rep_nr F64vec2 A _mm_sub_pd _mm_add_pd _mm_mul_pd _mm_rcp_pd F32vecl R rep_nr F32vecl A _mm_sub_ss _mm_add_ss _mm_mul_ss _mm_rcp_ss Reciprocal Square Root Newton Raphson F32vec4 R rsqrt_nr F32vec4 A _mm_sub_pd mm_mul_pd mm_rsqrt_ps 2 doubles F6 4vec2 R rsgqrt_nr F 64vec2 A mm_sub_pd mm_mul_pd mm_rsqrt_pd F32vecl R rsgqrt_nr F32vecl A mm_sub_ss mm_mul_ss mm_rsqrt_ss Horizontal Add 1 float float f add_horizontal F32vec4 A _mm_add_ss _mm_shuffle_ss 1 double double d add_horizontal F6 4vec2 A _mm_add_sd mm_shuffle_sd Minimum and Maximum Operators Compute the minimums of the two double precision floating point values of A and B F64vec2 R simd_min F64vec2 A F 64vec2 B RO min AO0O BO R1 min A1 B1 Corresponding intrinsic _mm_min_pd Compute the minimums of the four single precision floating point values of A and B F32vec4 R simd_min F32vec4 A F32vec4 B RO min AO0O BO Rl min A1 B1 421 Intel C Compiler for Linux Systems User s Guide R2 min A2 B2 R3 min A3 B3 Corresponding intrinsic _mm_min_ps Compute the minimum of the lowest single precision floating point values of A and B F32vecl R simd_min F32vecl A F32vecl B RO min A0 BO Corresponding intrinsic _mm_min_ss Compute the maxim
2. pL p 4 L pH p 4 U qL q 4 L qH q 4 U if pH lt qL pL gt qH loop without data dependence for i L i lt U i Vectorization Examples This section contains a few simple examples of some common issues in vector programming Argument Aliasing A Vector Copy The following loop example a vector copy operation vectorizes because the compiler can prove dest i and src i are distinct Vectorizable Copy Due To Unproven Distinction void vec_copy float dest float src int len int i for i 0 i lt len i dest i l src il The restrict keyword in the following example indicates that the pointers refer to distinct objects Therefore the compiler allows vectorization without generation of multi version code 167 Intel C Compiler for Linux Systems User s Guide Using restrict to Prove Vectorizable Distinction void vec_copy float restrict dest float restrict src int len int i for i 0 i lt len i dest i src il Data Alignment A 16 byte or greater data structure or array should be aligned so that the beginning of each structure or array element is aligned in a way that its base address is a multiple of sixteen The Misaligned Data Crossing 16 Byte Boundary figure shows the effect of a data cache unit DCU split due to misaligned data The code loads the misaligned data acros
3. ipo_function_reorder Interprocedural Optimizer function reorder lo_constant_propagation Intermediate Language Scalar Optimizer constant propagation lo_copy_propagation Intermediate Language Scalar Optimizer copy propagation ecg_software_pipelining Code Generator software pipelining All optimization reports that have a matching prefix with the specified optimizer are generated For example if opt_report_phase ilo_co is specified a report from both the constant propagation and the copy propagation are generated The Availability of Report Generation The opt_report_help option lists the logical names of optimizers available for report generation 200 Volume II Optimizing Applications Timing Your Application How fast your application executes is one indication of performance When timing the speed of applications consider the following circumstances Run program timings when other users are not active Your timing results can be affected by one or more CPU intensive processes also running while doing your timings Try to run the program under the same conditions each time to provide the most accurate results especially when comparing execution times of a previous version of the same program Use the same system processor model amount of memory version of the operating system and so on if possible If you do need to change systems you should measure the time using the same version of the program on both syste
4. int64 nt64 increment InterlockedAdd64 volatile __int64 increment Intrinsic unsigned __int64 _InterlockedExchangeU64 volatile unsigned __int64 Target unsigned __int64 value unsigned __int64 InterlockedCompareExchange64 rel volatile unsigned __int64 Destination unsigned __int64 Exchange unsigned __int64 InterlockedCompareExchange64 acq volatile unsigned __int64 Destination unsigned __int64 unsigned e Exchange InterlockedExchangeAdd64 volatile addend _ i int64 Description Same as InterlockedExchange64 for unsigned quantities Maps to the cmpxchg rel instruction with appropriate setup Atomically compare and exchange the value specified by the first argument a 64 bit pointer Maps to the cmpxchg acq instruction with appropriate setup Atomically compare and exchange the value specified by the first argument a 64 bit pointer Same as the previous intrinsic for signed quantities Use compare and exchange to do an atomic add of the increment value to the addend Maps to a loop with the cmpxchg instruction to guarantee atomicity Same as the previous intrinsic but returns the new value not the original value See Note _InterlockedSub64 is provided as a macro definition based on _InterlockedAdd64 defin incr _InterlockedSub64 target incr _InterlockedAdd64 target p Uses cmpxchg to do an atomic sub of the incr value to the
5. Note There is no intrinsic for move operations To move data from one register to another a simple assignment A B suffices where A and B are the source and target registers for the move operation The prototypes for Streaming SIMD Extensions 2 SSE2 intrinsics are in the enmint rin h header file Floating point Load Operations for Streaming SIMD Extensions 2 The following load operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmintrin h header file __m128d _mm_load_pd double const dp uses MOVAPD Loads two DP FP values The address p must be 16 byte aligned ro p 0 rl p 1 __m128d _mm_loadl_pd double const dp uses MOVSD shuffling Loads a single DP FP value copying to both elements The address p need not be 16 byte aligned rO p rl p 310 Reference __m128d _mm_loadr_pd double const dp uses MOVAPD shuffling Loads two DP FP values in reverse order The address p must be 16 byte aligned ro p 1 rl p 0 __m128d _mm_loadu_pd double const dp uses MOVUPD Loads two DP FP values The address p need not be 16 byte aligned r0 p 0 ri p 1 __m128d _mm_load_sd double const dp uses MOVSD Loads a DP FP value The upper DP FP is set to zero The address p need not be 16 byte aligned rO lt A rl 0 0 __m128d _mm_loadh_pd __m128d a double const dp
6. Produces executable output with filename a out Invokes options specified in a configuration file first See Configuration Files The location of shared objects is specified by the LD_LIBRARY_PATH environment variable Sets 8 bytes as the strictest alignment constraint for structures Displays error and warning messages Performs standard optimizations using the default O2 option See Setting Optimization Levels On operating systems that support characters in Unicode multi byte format the compiler will process file names containing these characters If the compiler does not recognize a command line option that option is ignored and a warning is displayed See Diagnostic Messages for detailed descriptions about system messages Compilation Phases To produce an executable file the compiler performs by default the compile and link phases When invoked the compiler driver determines which compilation phases to perform based on the file name extension and the compilation options specified in the command line The compiler passes object files and any unrecognized file name to the linker The linker then determines whether the file is an object file o or a library a The compiler driver handles all types of input files correctly thus it can be used to invoke any phase of compilation The relationship of the compiler to system specific programming support tools is presented in this diagram 39 Intel C Compiler for Linux
7. c99 option can be used as c 99 enable c99 support or c99 disable c99 support Indicates that the value n in can be omitted or have various values Used for option s version for example option x K W N B P has these versions xK xW xN xB and xP Indicates that option must include one of the fixed values for n Indicate option s required argument s Arguments are separated by comma if more than one are required Some compiler options are only available on certain systems In the following table these options are indicated with labels as follows Label Meaning systems 132 Option available on IA 32 based systems i32em Option available on Intel Extended Memory 64 Technology Intel EM64T 164 Option available on Itanium based systems e Ifno label is present the option is available on all supported systems e If only appears in the label that option is only available on the identified system Compiler Options Quick Reference The compiler options listed in the following table are new to this release Option cxxlib gcc GCC root dir debug no inline_debug_info no variable_locations debug debug extended export export_dir dir fabi version finline functions fno exceptions fno implicit inline templates Description Default Specifies the top level location of OFF the gcc binaries and libraries Produces enhanced so
8. red 132 only Disables changing of the FPU OFF rounding control Enables fast float to int conversions reserve kernel regs 164 only Reserves registers f12 f15 OFF and 32 127 for use by the kernel These will not be used by the compiler 25 Intel C Compiler for Linux Systems User s Guide Option Description Default Enables disables pointer OFF disambiguation with the restrict qualifier no restrict Generates assemblable files with OFF s suffix then stops the compilation scalar_rep The scalar_rep OFF compiler option enables disables scalar replacement performed during loop transformations shared Produce a shared object OFF shared libcxa Link Intel Libcxa C library OFF dynamically s0x Enables disables the saving of sox i32 i32em compiler options and version information in the executable file static Prevents linking with shared OFF libraries static libcxa Link Intel Libcxa C library OFF statically std gnu89 ISO C90 plus GNU extensions ON Includes some C99 features std gnu 98 Same as std gnu89 OFF strict_ansi Strict ANSI conformance OFF dialect syntax Checks the syntax ofa program OFF and stops the compilation process after the C or C source files and preprocessed source files have been parsed Generates no code and produces no output files Warnings and messages appear on stderr T fi
9. 0 p 16 p 4 for 1 0 i lt p itt afi a i 1 0f loop with a aligned will be vectorized accordingly for i p i lt 100 i a i a i 1 0f 165 Intel C Compiler for Linux Systems User s Guide pragma novector Syntax pragma novector Definition The novector loop pragma specifies that the loop should never be vectorized even if it is legal to do so In this example suppose you know the trip count ub 1b is too low to make vectorization worthwhile You can use pragma novector to tell the compiler not to vectorize even if the loop is considered vectorizable Example void foo int lb int ub pragma novector for j lb j lt ub j a j a j b j pragma vector nontemporal Syntax pragma vector nontemporal Definition pragma vector nontemporal results in streaming stores on Pentium 4 based systems An example loop float type together with the generated assembly are shown in the example For large N significant performance improvements result on a Pentium 4 systems over a non streaming implementation Example pragma vector nontemporal for i 0 i lt N itt a i 1 Blai movntps XMMWORD PTR _a eax xmm0 movntps XMMWORD PTR a eax 16 xmm0 add eax 32 cmp eax 4096 jl B1l 2 166 Volume II Optimizing Applications Dynamic Dependence Testing Example float p q for i L I lt U itt aw
10. Exported Templates The Intel C Compiler supports exported templates using the following options Option export export_dir dir Description Enable recognition of exported templates Supported in C mode only Specifies a directory name to be placed on the exported template search path The directories are used to find the definitions of exported templates and are searched in the order in which they are specified on the command line The current directory is always the first entry on the search path 105 Intel C Compiler for Linux Systems User s Guide Exported templates are templates declared with the export keyword Exporting a class template is equivalent to exporting each of its static data members and each of its non inline member functions An exported template is unique because its definition does not need to be present in a translation unit that uses that template For example the following C program consists of two separate translation units filel cpp include lt stdio h gt static void trace printf File 1 n export template lt class T gt T const amp min T const amp T const amp int main trace return min 2 3 file2 cpp include lt stdio h gt static void trace printf File 2 n export template lt class T gt T const amp min T const amp a T const amp b trace return a lt b a b Note that these two files are separate tran
11. gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A stream_sil28 L clflush lfence _mfence stream_si32 pause N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A gt gt gt gt gt gt Reference 381 Intel C Compiler for Linux Systems User s Guide Intel C Class Libraries The Intel C Class Libraries enable Single Instruction Multiple Data SIMD operations The principle of SIMD operations is to exploit microprocessor architecture through parallel processing The effect of parallel processing is increased data throughput using fewer clock cycles The objective is to improve application performance of complex and computation intensive audio video and graphical data bit streams Hardware and Software Requirements You must have the Intel C Compiler version 4 0 or higher installed on your system to use the class libraries The Intel C Class Libraries are functions abstracted from the instruction extensions available on Intel processors as specified in the table that follows Processor Requirements for Use of C
12. 41 Intel C Compiler for Linux Systems User s Guide Invoking the Compiler with icc or icpc You can invoke the Intel C Compiler on the command line with either icc or icpc e When you invoke the compiler with icc the compiler builds C source files using C libraries and C include files If you use icc with a C source file it will be compiled as a C file Use icc to link C object files e When you invoke the compiler with icpc the compiler builds C source files using C libraries and C include files If you use icpc with a C source file it will be compiled as a C file Use icpc to link C object files Command line Syntax When you invoke the Intel C Compiler with icc or icpc use the following syntax prompt gt icclicpc options filel file2 Argument Description options Indicates one or more command line options The compiler recognizes one or more letters preceded by a hyphen This includes linker options See the Options Quick Reference Indicates one or more files to be processed by the compilation system You can specify more than one file Use a space as a delimiter for multiple files Example prompt gt icpce prec_div axP Bstatic my_sourcel cpp my_source2 cpp Invoking the Compiler from the Command Line with make To run make from the command line using Intel C Compiler make sure that usr bin is in your path If you use a C shell you can edit yo
13. 8 8 4 16 8 8 4 16 2 32 8 8 4 16 8 8 4 16 2 32 4 16 high 4 16 low Add the eight signed 8 bit values in m1 to the eight signed 8 bit values in m2 using saturating arithmetic __m64 _m_paddsw __m64 ml m64 m2 Add the four signed 16 bit values in m1 to the four signed 16 bit values in m2 using saturating arithmetic __m64 _m_paddusb __m64 ml m64 m2 Add the eight unsigned 8 bit values in m1 to the eight unsigned 8 bit values in m2 and using saturating arithmetic 259 Intel C Compiler for Linux Systems User s Guide __m64 _m_paddusw __m64 ml __m64 m2 Add the four unsigned 16 bit values in m1 to the four unsigned 16 bit values in m2 using saturating arithmetic __m64 _m_psubb __m64 ml __m64 m2 Subtract the eight 8 bit values in m2 from the eight 8 bit values in m1 __m64 _m_psubw __m64 ml __m64 m2 Subtract the four 16 bit values in m2 from the four 16 bit values in m1 __m64 _m_psubd __m64 ml __m64 m2 Subtract the two 32 bit values in m2 from the two 32 bit values in m1 __m64 _m_psubsb __m64 ml __m64 m2 Subtract the eight signed 8 bit values in m2 from the eight signed 8 bit values in m1 using saturating arithmetic __m64 _m_psubsw __m64 ml __m64 m2 Subtract the four signed 16 bit values in m2 from the four signed 16 bit values in m1 using saturating arithmetic __m64 _m_psubusb __m64 ml __m64 m2 Subtract the eight unsigned 8 bit values in m2 from the eight unsigned 8
14. Convert the lower double precision floating point value of A to a 32 bit integer with truncation int Fo64vec2TolInt F6 4vec42 A r int A0O Convert the four floating point values of A to two the two least significant double precision floating point values F64vec2 F32vec4ToF 64vec2 F32vec4 A ro double AO rel double Al Convert the two double precision floating point values of A to two single precision floating point values F32vec4 F64vec2ToF32vec4 F64vec2 A r0 float A0 rl float A1 Convert the signed int in B to a double precision floating point value and pass the upper double precision value from A through to the result F64vec2 InttoF64vec2 F64vec2 A int B rO double B rE te ALS Convert the lower floating point value of A to a 32 bit integer with truncation 413 Intel C Compiler for Linux Systems User s Guide int F32vec4TolInt F32vec4 A r int AO Convert the two lower floating point values of A to two 32 bit integer with truncation returning the integers in packed form Is32vec2 F32vec4ToIs32vec2 F32vec4 A rO int A0 ri int Al Convert the 32 bit integer value B to a floating point value the upper three floating point values are passed through from A F32vec4 IntToF32vec4 F32vec4 A int B rO float B ri Al r2 A2 r3 A3 Convert the two 32 bit integer values in packed form in B to two floating point values the upper two floa
15. F32vecl R cmpl ct t F32vec1 A R cmpneq F32vec4 A R cmpneq F64vec2 A R cmpneq F32vecl A F32vec4 A F64vec2 A t Compare for Less Than or Equal F32vecl A F32vec4 A F64vec2 A 4 floats F32vec4 R cmple 2 doubles F64vec2 R cmple 1 float F32vecl R cmple F32vecl A Compare for Greater Than 4 floats F32vec4 R cmpgt 1 2 doubles F64vec2 R cmpgt 1 F32vec4 A F32vec42 A 1 float F32vecl R cmpgt 1 F32vecl A Compare for Greater Than or Equal To F32vec4 A F6 4vec2 A Compare for Not Less Than 4 floats 426 F32vec4 R cmpnil 2 doubles F64vec2 R cmpn 1 float F32vecl R cmpn 4 floats F32vec4 R cmpge 2 doubles F64vec2 R cmpge I 1 float F32vecl R cmpge F32vecl A lt F32vec4 A lt F64vec2 A lt F32vecl A Compare for Not Less Than or Equal Intrinsic _mm_cmpeq_ps _mm_cmpeq_pd _mm_cmpeq_ss _mm_cmpneg_ps _mm_cmpneg_pd _mm_cmpneg_ss _mm_cmp _mm_cmp _mm_cmp _mm_cmp _mm_cmpl _mm_cmp lt_ps lt_pd lt_ss le_ps e_pd le_pd _mm_cmpgt_ps _mm_cmpgt_pd _mm_cmpgt_ss _mm_cmpge_ps _mm_cmpge_pd _mm_cmpge_ss _mm_cmpnlt_ps _mm_cmpnit_pd _mm_cmpnit_ss 4 floats F32vec4 R cmpnle F32vec4 A 2 doubles F64vec2 R cmpnile F 64vec2 A 1 float F32vecl R cm
16. Response Files Use response files to specify options used during particular compilations Response files are invoked as an option on the command line Options in a response file are inserted in the command line at the point where the response file is invoked Sample Response Files response file responsel txt compile with these options axP pch end of responsel fil response file response2 txt compile with these options mp1 trict ansi end of response2 fil 74 Volume I Building Applications Use response files to decrease the time spent entering command line options and to ensure consistency by automating command line entries Use individual response files to maintain options for specific projects to avoid editing the configuration file when changing projects Any number of options or file names can be placed on a line in the response file Several response files can be referenced in the same command line The syntax for using response files is as follows prompt gt icpce responsel txt sourcel cpp response2 txt source2 cpp F Note An at sign must precede the name of the response file on the command line Include Files Include directories are searched in the default system areas and whatever is specified by the Idirectory option For multiple search directories multiple Idirectory commands must be used The compiler searches directories for include files in
17. The 03 option enables the 02 option and adds more aggressive optimizations for example loop transformation and prefetching 03 optimizes for maximum speed but may not improve performance for some programs IA 32 Applications In conjunction with the vectorization options ax K W N B P and x K W N B P the 03 option causes the compiler to perform more aggressive data dependency analysis than the default O2 This may result in longer compilation times Itanium based Applications The ivdep_parallel option asserts there is no loop carried dependency in the loop where an IVDEP directive is specified This is useful for sparse matrix applications Loop Transformations The loop transformation techniques include Loop normalization Loop reversal Loop interchange and permutation Loop distribution Loop fusion Scalar replacement Absence of loop carried memory dependency with IVDEP directive Runtime Data Dependencies checking Itanium based systems only The loop transformations listed above are supported by data dependence The loop transformation techniques also include Induction variable elimination Constant propagation Copy propagation Forward substitution Dead code elimination 150 Volume II Optimizing Applications In addition to the loop transformations listed for both A 32 and Itanium architectures above the Itanium architecture enables implementation of collapsing techniques Scalar Replacemen
18. The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmint rin h header file Intrinsic Alternate Operation Corresponding Name Name EEE pextrw _mm_extract_pil6 Extract on Extract on of four words four words PEXTRW _m_pinsrw _mm_insert_pil6 Insert a word PINSRW L _m_pmaxsw _mm_max_pil6 Compute the maximum PMAXSW _m_pmaxub _mm_max_pu8 Compute the maximum PMAXUB a minsw _mm_min mpminsw _mmminpile al m pminsw mm min piis Compute the minimum the minimum PMINSW minub _mm_min_pu8 Compute the minimum PMINUB C eE movmskb _mm_movemask _m_pmovmskb _mm_movemask_pi8 _m_pmovmskb _mm_movemask_pi8 Create an a bit mask PMOVMSKB mulhuw mm m pmuihuw mm mulni puie m pmuihuw mm mulni puie Multiply return high bits return high bits PMULHUW m_pshufw _mm_shuffle_pil6 Return a combination of PSHUFW four words mas m_maskmovg _mm_maskmove_si64 _mm_maskmove_si m_maskmovg _mm_maskmove_si64 4 Conditional Store Conditional Store Store ASKMOVQ l _pavgb _mm_avg_pu8 Compute rounded average PAVGB m_pavgw _mm_avg_pul6 Compute rounded average PAVGW _m_psadbw _mm_sad_pu8 Compute sum of absolute PSADBW differences For these intrinsics you need to empty the multimedia state for the mmx register See The EMMS Instruction Why You Need It and When to Use It topic for more details int _m_pextrw __m64 a int n Extracts one of the four words of a The sel
19. This intrinsic extracts a single precision floating point value from the first vector element of an__m128 It does so in the most effecient manner possible in the context used This intrinsic doesn t map to any specific SSE instruction Miscellaneous Intrinsics Using Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmint rin h header file Intrinsic Operation Corresponding Name Instruction mm_shuffle_ps_ Shuffle _mm_unpackhi_ps Unpack High _mm_unpacklo_ps Unpack Low _mm_loadh_pi Load High _mm_storeh_pi Store High _mm_movehl_ps Move High to Low _mm_movelh_ps Move Low to High _mm_loadl_pi Load Low _mm_storel_pi Store Low _mm_movemask_ps Create four bit mask MOVMSKPS m128 _mm_shuffle_ps __m128 a __m128 b unsigned int imm8 Selects four specific SP FP values from a and b based on the mask imm8 The mask must be an immediate See Macro Function for Shuffle Using Streaming SIMD Extensions for a description of the shuffle semantics __m128 _mm_unpackhi_ps __m128 a __m128 b Selects and interleaves the upper two SP FP values from a and b r0 a2 rl b2 r2 a3 r3 b3 292 Reference __m128 _mm_unpacklo_ps __m128 a __m128 b Selects and interleaves the lower two SP FP values from a and b rO a0 rli bO r2 al r3 bl m1128 _mm_loadh_pi __m128 _ m64 const p Sets the upper two SP FP values with 64 bits
20. _mm_set_ss BO f0 Initializes the lowest value of B B1 0 with f0 and the other values with 0 B2 B3 0 417 Intel C Compiler for Linux Systems User s Guide Example Intrinsic F32vecl B int I _mm_cvtsi32_ss BO f0 Initializes the lowest value of B with f0 other values are undefined Returns Arithmetic Operators The following table lists the arithmetic operators of the Fvec classes and generic syntax The operators have been divided into standard and advanced operations which are described in more detail later in this section Fvec Arithmetic Operators cena eee Operation Operators Generic Syntax Standard Addition R A B R R A c Subtraction R A B R A c Multiplication R A B k R k A m Division R A B R A om aes _ Advanced Square Root Root sqrt R sqrt A E Reciprocal rcp R rep A Newton Raphson rep_nr R rep_nr A l Reciprocal Square Root rsqrt R rsqrt A Newton Raphson rsqrt_nr R rsqrt_nr A Standard Arithmetic Operator Usage The following two tables show the return values for each class of the standard arithmetic operators which use the syntax styles described earlier in the Return Value Notation section Standard Arithmetic Return Value Mapping F32vec4 F6 4vec2 F32vec1 418 Reference Arithmetic with Assignment Return Value Mapping R Operat
21. _mm_sub_pi8 The following table lists addition and subtraction return values for combinations of classes when the right side operands are of different signedness The two operands must be the same size otherwise you must explicitly indicate the typecasting Addition and Subtraction Operator Overloading R Add I64vec2 R I32vec4 R I32vec2 R Il6vec8 R Il6vec4 R I8vec8 R I8vec16 R Sub A I s u 64vec2 A I s u 32vec4 A I s u 32vec2 A I s lu l6vec8 A I s u 1l16vec4 A I s u 8vec8 A I s u 8vec2 A Return Value Available Operators Right Side Operands I I 64vec2 B 32vec4 B 32vec2 B 16vec8 B 16vec4 B 8vec8 B 8vec16 B The following table shows the return data type values for operands of the addition and subtraction operators with assignment The left side operand determines the size and signedness of the return value The right side operand must be the same size as the left operand otherwise you must use an explicit typecast 396 Addition and Subtraction with Assignment I x 32vec4 I x 32vec2 R I x 32vec2 R I x 32vec2 R I x l6vec4 I x l6vec4 I x 8vecl6 I x 8vecl6 I x 8vec8 I x 8vec8 Multiplication Operators Return Value Left Side Add Sub Right Side A 32vec4 I x 1l6vec8 I x l6vec8 If s u Jl6vec8 A Reference 8vec16 The multiplication operators can only accept and return
22. aT b7 31 16 m1l28i_mm_mullo_epil6 __m128i a __m128i b Multiplies the 8 signed or unsigned 16 bit integers from a by the 8 signed or unsigned 16 bit integers from b Packs the lower 16 bits of the 8 signed or unsigned 32 bit results rO a0 bO 15 0 rl al b1 15 0 ET a7 b7 15 0 m64 _mm_mul_su32 __m64 a _ m64 b Multiplies the lower 32 bit integer from a by the lower 32 bit integer from b and returns the 64 bit integer result r a0 pO __m128i _mm_mul_epu32 __m128i a _ m128i b Multiplies 2 unsigned 32 bit integers from a by 2 unsigned 32 bit integers from b Packs the 2 unsigned 64 bit integer results r0 a0 b0 rl i a2 b2 __m128i _mm_sad_epu8 __m128i a __m128i b Computes the absolute difference of the 16 unsigned 8 bit integers from a and the 16 unsigned 8 bit integers from b Sums the upper 8 differences and lower 8 differences and packs the resulting 2 unsigned 16 bit integers into the upper and lower 64 bit elements r0 abs a0 b0 abs al b1 abs a7 b7 ri 0x0 r2 0x0 r3 0x0 r4 abs a8 b8 abs a9 b9 abs al5 p15 r5 0x0 r6 0x0 r7 0x0 __m128i _mm_sub_epi8 __m128i a __m128i b Subtracts the 16 signed or unsigned 8 bit integers of b from the 16 signed or unsigned 8 bit integers of a r0 a0 b0 rl al bl riS al5 bld5 317 Intel C Compiler for Linux Systems User s G
23. b2 Oxff r3 a3 lt b3 Oxff __m128 _mm_cmpngt_ss __m Compare for not greater than ro a0 gt bO Oxfff rl al r2 a2 r3 __m128 _mm_cmpngt_ps __m Compare for not greater than rO a0 gt bO Oxfff rl al gt bl Oxfff r2 a2 gt b2 Oxfff r3 a3 gt b3 Oxfff __m128 _mm_cmpnge_ss __m 128 a EETET a3 128 a Fffff FfffE frfftt Fffff 128 a FEFLE a3 128 a EFEFEF Ffffff EFFET EfEfEFf 128 a FEEL a3 128 a ffrfft ELETT FEEFEE EEFE 128 a Compare for not greater than or equal rO a0 gt b0 Oxff rl al r2 a2 r3 m1128 _mm_cmpnge_ps __m FEEL LE a3 128 a Compare for not greater than or equal rO a0 gt b0 Oxff rl al gt bl Oxff r2 a2 gt b2 Oxff r3 a3 gt b3 Oxff 276 Ffffff fffffrt ETETETT Ffffff Intel C Compiler for Linux Systems User s Guide _m128 0x0 m1128 0x0 0x0 0x0 0x0 m128 m128 0x0 0x0 0x0 0x0 m1128 0x0 m1128 0x0 0x0 0x0 0x0 m128 m128 0x0 0x0 0x0 Reference __m128 _mm_cmpord_ss __m128 a __m128 b Compare for ordered ro a0 ord bO Oxffffffff 0x0 ri al r2 a2 r3 a3 m1128 _mm_cmpord_ps __m128 a __m128 b Compare for ordered rO a0 ord b0 Oxffffffff 0x0 rl al ord bl Oxffffffff 0x0 r2 a2 ord b2 Oxfffffff
24. cliJ ali b il pragma distribute point Distribution will start here sub a n a c i a i b i 194 ignoring all loop carried dependency Volume II Optimizing Applications Loop Unrolling Support unroll Directive The unrol1 directive unroll n nounrol11 tells the compiler how many times to unroll a counted loop The syntax for this directive is pragma unroll pragma unroll n pragma nounroll where n is an integer constant from 0 through 255 The unrol1 directive must precede the for statement for each for loop it affects If n is specified the optimizer unrolls the loop n times If n is omitted or if it is outside the allowed range the optimizer assigns the number of times to unroll the loop The unroll directive overrides any setting of loop unrolling from the command line The directive can be applied only for the innermost nested loop If applied to the outer loops it is ignored The compiler generates correct code by comparing n and the loop count Example of unroll Directive pragma unroll 4 for i l i lt m i Prefetching Support prefetch Directive The prefetch and noprefetch directives assert that the data prefetches are generated or not generated for some memory references This affects the heuristics used in the compiler The syntax for this directive is pragma noprefetch pragma prefetch pragma prefetch a b If the expressi
25. g Generates symbolic debugging information and line numbers in the object code for use by the source level debuggers Turns off 02 and makes 00 the default unless O1 02 or O3 is explicitly specified in the command line together with g f 5 Disable using the EBP register as general purpose register Option Effect on fp O1 02 or O3 Disables fp 00 Enables fp 84 Volume I Building Applications Combining Optimization and Debugging The 00 option turns off all optimizations so you can debug your program before any optimization is attempted To get the debug information use the g option The compiler lets you generate code to support symbolic debugging while 01 02 or O3 is specified on the command line along with g which produces symbolic debug information in the object file Note that if you specify the O1 02 or 03 option with the g option some of the debugging information returned may be inaccurate as a side effect of optimization It is best to make your optimization and or debugging choices explicit e If you need to debug your program excluding any optimization effect use the 00 option which turns off all the optimizations e Ifyou need to debug your program with optimization enabled then you can specify the O1 02 or 03 option on the command line along with g d Note The g option slows down the program when 01 02 or O3 is not specified In this case g turns
26. r a0 b0 Oxl 0x0 int _mm_ucomieq_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a equal to b If a and b are equal is returned Otherwise 0 is returned r a0 b0O 0x1 0x0 int _mm_ucomilt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a less than b If a is less than b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 int _mm_ucomile_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a less than or equal to b If a is less than or equal to b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 int _mm_ucomigt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than b If a is greater than b are equal is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_ucomige_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than or equal to b If a is greater than or equal to b 1 is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_ucomineq_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a not equal to b If a and b are not equal 1 is returned Otherwise 0 is returned r a0 b0 Oxl 0x0 Floating point Conversion Operations for Streaming SIMD Extensions 2 Each conversion intrinsic takes one data type and performs a conversion to a different type Some conversions
27. uses MOVHPD Loads a DP FP value as the upper DP FP value of the result The lower DP FP value is passed through from a The address p need not be 16 byte aligned rO a0 EL a ep __m128d _mm_loadl_pd __m128d a double const dp uses MOVLPD Loads a DP FP value as the lower DP FP value of the result The upper DP FP value is passed through from a The address p need not be 16 byte aligned ro tp rl al Floating point Set Operations for Streaming SIMD Extensions 2 The following set operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmintrin h header file __m128d _mm_set_sd double w composite Sets the lower DP FP value to w and sets the upper DP FP value to zero rO w rl 0 0 __m128d _mm_setl_pd double w composite Sets the 2 DP FP values to w rO w rl w 311 Intel C Compiler for Linux Systems User s Guide __m128d _mm_set_pd double w double x composite Sets the lower DP FP value to x and sets the upper DP FP value to w O x ri w __m128d _mm_setr_pd double w double x composite Sets the lower DP FP value to w and sets the upper DP FP value to x ro w rl x __m128d _mm_setzero_pd void uses XORPD Sets the 2 DP FP values to zero ro 0 0 rl 0 0 _ m128d _mm_move_sd m128d a _ m128d b uses MOVSD Sets the lower DP FP value to the lower DP FP value of
28. w1 Display warnings and errors Deprecated and Unsupported Compiler Options Deprecated Options Occasionally compiler options are marked as deprecated Deprecated options are still supported in the current release but are planned to be unsupported in future releases The following options are deprecated in this release of Intel C Compiler e Qansi Deprecated options are not limited to this list Unsupported Options Some Intel C Compiler options are no longer supported If you use an unsupported option the compiler issues a warning ignores the option then proceeds with compilation This version of the Intel C Compiler no longer supports the following compiler options axi axM xi xM 0f_check fdiv_check Unsupported options are not limited to this list 38 Volume I Building Applications Getting Started You can Invoke the Compiler from a system command prompt or you can use the compiler with the Eclipse Integrated Development Environment Getting Help Documentation conventions are described in How to Use This Document If you are using the compiler from the command line you can execute icc help for a summary of command line options If you need additional help in using the Intel C Compiler see Product Web Site and Support Default Behavior of the Compiler If you do not specify any options when you invoke the Intel C Compiler the compiler uses the following default settings
29. xN 62 Volume I Building Applications Option Use Category Run Generate traceback information traceback time Library Options _ lt _ Libraries Maximize speed across entire program fast Enable interprocedural optimization for single file compilation ip Link with static libraries static Link Intel 1ibcxa C library statically Link with dynamic libraries Use no C libraries Use no system libraries gcc compatibility options Process OpenMP directives Additional libraries Search directory for libraries Archiver options static libcxa i_dynamic no_cpprt nodefaultlibs cxxlib icc cxxlib gcec fabi version gcc version openmp openmp_stubs 63 Intel C Compiler for Linux Systems User s Guide Standard and Managed Makefiles When you create a new Intel C project in Eclipse CDT you can select either Standard Make C Project or Managed Make C Project iy es Project Select oe Create a new C project and let Eclipse create and manage the makefile EN Standard Make C Project C B Managed Make C Project Simple e Select Standard Make C Project if your project already includes a makefile e Use Managed Make C Project to build a makefile using Intel compiler specific options assigned from property pages Exporting Makefiles If you created a Managed Make C Project you can use Eclipse to build a makefile that includes Intel compiler options Se
30. A2 R5 B2 R6 A3 R7 B3 Corresponding intrinsic _mm_unpacklo_epil6 Interleave the two 16 bit values from the low half of A with the two 16 bit values from the low half of B Tl6vec4 unpack_low Il vec4 A I16vec4 B Isl6vec4 unpack_low Isl6vec4 A Isl6 vec4 B Tul6vec4 unpack_low Iul6vec4 A Iul6 vec4 B RO AO Rl BO R2 Al R3 Bl Corresponding intrinsic _mm_unpacklo_pil6 Interleave the four 8 bit values from the high low of A with the four 8 bit values from the low half of B T8vecl6 unpack_low I8vecl6 A I8vecl6 B Is8vecl6 unpack_low Is8vecl6 A Is8vecl6 B 410 Reference Tu8vecl6 unpack_low Iu8vecl6 A Iu8vecl6 B RO AO Rl BO R2 Al R3 Bl R4 A2 R5 B2 R6 A3 R7 B3 R8 A4 R9 B4 R10 A5 R11 B5 R12 A6 R13 B6 R14 A7 R15 B7 Corresponding intrinsic _mm_unpacklo_epi8 Interleave the four 8 bit values from the high low of A with the four 8 bit values from the low half of B I8vec8 unpack_low I8vec8 A I8vec8 B Is8vec8 unpack_low Is8vec8 A Is8vec8 B Tu8vec8 unpack_low Iu8vec8 A Iu8vec8 B RO AO Rl BO R2 Al R3 Bl R4 A2 R5 B2 R6 A3 R7 B3 Corresponding intrinsic _mm_unpacklo_pi8 Pack Operators Pack the eight 32 bit values found in A and B into eight 16 bit values with signed saturation Isl6vec8 pack_sat Is32vec2 A Is32vec2 B Corresponding intrinsic _mm_packs_epi32 Pack the fou
31. CRITICAL constructs ORDERED constructs ATOMIC directives etc are successfully handled The default is openmp_reportl 179 Intel C Compiler for Linux Systems User s Guide OpenMP Directives and Clauses OpenMP Directives Directive Name parallel for sections single parallel for parallel sections master critical lock barrier atomic flush ordered threadprivate 180 Description Defines a parallel region Identifies an iterative work sharing construct that specifies a region in which the iterations of the associated loop should be executed in parallel Identifies a non iterative work sharing construct that specifies a set of constructs that are to be divided among threads in a team Identifies a construct that specifies that the associated structured block is executed by only one thread in the team A shortcut for a parallel region that contains a single for directive The parallel or for OpenMP directive must be immediately followed by a for statement If you place other statement or an OpenMP directive between the parallel or for directive and the for statement the Intel C Compiler issues a syntax error Provides a shortcut form for specifying a parallel region containing a single sections directive Identifies a construct that specifies a structured block that is executed by the master thread of the team Identifies a construct that restricts executio
32. Examples are included of optimizations supported by Intel extended directives and library routines that enhance and or help analyze performance Compiler Directives This section discusses the language extended directives used in Software Pipelining Loop Count and Loop Distribution Loop Unrolling Prefetching Vectorization Pipelining for Itanium based Applications The swp and noswp directives indicate preference for a loop to get software pipelined or not The swp directive does not help data dependence but overrides heuristics based on profile counts or lop sided control flow The syntax for this directive is pragma swp pragma noswp 192 Volume II Optimizing Applications Example of swp Directive pragma swp for i 0 i lt m i if ali 0 The software pipelining optimization triggered by the swp directive applies instruction scheduling to certain innermost loops allowing instructions within a loop to be split into different stages allowing increased instruction level parallelism This can reduce the impact of long latency operations resulting in faster loop execution Loops chosen for software pipelining are always innermost loops that do not contain procedure calls that are not inlined Because the optimizer no longer considers fully unrolled loops as innermost loops fully unrolling loops can allow an additional loop to become the innermost loop You can request and view the optimization report to s
33. Expression Operands Extended Asm html Extended 20Asm Controlling Names Used in Yes http gcc gnu org onlinedocs gec 3 4 0 gec Assembler Code Asm Labels html Asm 20Labels Variables in Specified Registers Yes http gcc gnu org onlinedocs gcc 3 4 0 gcec Explicit Reg Vars html Explicit 20Reg 20Vars Alternate Keywords Yes http gcc gnu org onlinedocs gec 3 4 0 gec Alternate Keywords html A lternate 20Keywords Incomplete enum Types Yes http gcc gnu org onlinedocs gec 3 4 0 gec Incomplete Enums html Incomplete 20Enums Function Names as Strings Yes http gcc gnu org onlinedocs gcec 3 4 0 gec Function Names html Function 20Names Getting the Return or Frame Yes http gcc gnu org onlinedocs gec 3 4 0 gec Address of a Function Return Address html Return 20Address Using Vector Instructions No http gcc gnu org onlinedocs gec 3 4 0 gec Through Built in Functions Vector Extensions html Vector 20Extensions 97 Intel C Compiler for Linux Systems User s Guide gcc Language Extension Intel GNU Description and Examples Support Other built in functions provided Yes http gcc gnu org onlinedocs gec 3 4 0 gcec by GCC Other Builtins html Other 20Builtins Built in Functions Specific to No http gcc gnu org onlinedocs gec 3 4 0 gcec Particular Target Machines Target Builtins html Target 20Builtins Pragmas Accepted by GCC No http gcc gnu org onlinedocs gcec 3 4 0 gcec Pragmas html Pragmas Unname
34. Intel C Compiler for Linux Systems User s Guide Browsing the Frames The coverage tool creates frames that facilitate browsing through the code to identify uncovered code The top frame displays the list of uncovered functions while the bottom frame displays the list of covered functions For uncovered functions the total number of basic blocks of each function is also displayed For covered functions both the total number of blocks and the number of covered blocks as well as their ratio that is the coverage rate are displayed For example 66 67 4 6 indicates that four out of the six blocks of the corresponding function were covered The block coverage rate of that function is thus 66 67 These lists can be sorted based on the coverage rate number of blocks or function names Function names are linked to the position in source view where the function body starts So just by one click the user can see the least covered function in the list and by another click the browser displays the body of the function The user can then scroll down in the source view and browse through the function body Individual Module Source View Within the individual module source views the tool provides the list of uncovered functions as well as the list of covered functions The lists are reported in two distinct frames that provide easy navigation of the source code The lists can be sorted based on e the number of blocks within uncovered functions
35. Intel Workqueuing Model The workqueuing model lets you parallelize control structures that are beyond the scope of those supported by the OpenMP model while attempting to fit into the framework defined by OpenMP In particular the workqueuing model is a flexible mechanism for specifying units of work that are not pre computed at the start of the worksharing construct For single for and sections constructs all work units that can be executed are known at the time the construct begins execution The workqueuing pragmas taskq and task relax this restriction by specifying an environment the taskq and the units of work the tasks separately Workqueuing Constructs taskq Pragma The taskq pragma specifies the environment within which the enclosed units of work tasks are to be executed From among all the threads that encounter a taskq pragma one is chosen to execute it initially Conceptually the taskq pragma causes an empty queue to be created by the chosen thread and then the code inside the taskq block is executed single threaded All the other threads wait for work to be enqueued on the conceptual queue The task pragma specifies a unit of work potentially executed by a different thread When a task pragma is encountered lexically within a taskq block the code inside the task block is conceptually enqueued on the queue associated with the taskq The conceptual queue is disbanded when all work enqueued on it finishes and when the end of
36. Issue command rm PROF_DIR dyn Make sure that there are no unrelated dyn files present 146 Volume II Optimizing Applications 4 Issue command myApp lt datal Invocation of this command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF_DIR 5 Issue command profmerge prof_dpi Testl dpi At this step the profmerge tool merges all the dyn files into one file Test 1 dpi that represents the total profile information of the application on Test 1 6 Issue command rm PROF_DIR dyn Make sure that there are no unrelated dyn files present 7 Issue command myApp lt data2 This command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF_DIR 8 Issue command profmerge prof_dpi Test2 dpi At this step the profmerge tool merges all the dyn files into one file Test2 dpi that represents the total profile information of the application on Test2 9 Issue command rm PROF_DIR dyn Make sure that there are no unrelated dyn files present 10 Issue command myApp lt data3 This command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF_DIR 11 Issue Command profmerge pro
37. N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Reference 377 Intel C Compiler for Linux Systems User s Guide 378 max_epil max_epus min_epil min_epu8g mullo_epil mul_su32 mul_epu32 sad_epus sub_epis8 sub_epil sub_epi32 Sub_si64 sub_epi64 subs_epis8 subs_epil subs_epus subs_epul and_sil28 andnot_sil28 or sil28 xor_sil28 sll_epil6 slli_epi32 sll_epi32 slli_epi64 sll_epi64 Srai_epil N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N
38. Qwe tag Change severity of OFF diagnostics L1 through LN to error wnn Qwn tag Print a maximum of n OFF errors Wp64 Wp64 Print diagnostics for 64 bit porting Change severity of OFF diagnostics L1 through LN to remark Change severity of OFF diagnostics L1 through LN to warning 35 Intel C Compiler for Linux Systems User s Guide Linux Windows Description Linux Default X X Remove standard directories from include file search path x K W N B P Qx K W N B P Generates specialized OFF code for processor specific codes K W N B and P while also generating generic IA 32 code e K Intel Pentium III and compatible Intel processors e W Intel Pentium 4 and compatible Intel processors e N Intel Pentium 4 and compatible Intel processors e B Intel Pentium M and compatible Intel processors e P Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Zp 1121418116 Packs structures on 1 2 4 8 or 16 byte boundaries 36 Compiler Options Quick Reference Default Compiler Options Some compiler options are only available on certain systems In the following table these options are indicated with labels as follows Option available on Intel Extended Memory 64 Technology Intel EM64T Label Meaning 132 Option available on IA 32 based systems i32em systems 164 Option availab
39. Systems User s Guide Application Development Cycle Phase Transkation Phase Il Linking Phase Ill Execution oMOogr1 a 40 Volume I Building Applications Building Applications from the Command Line Invoking the Compiler The ways to invoke Intel C Compiler are as follows e Invoke directly Running Compiler from the Command Line e Use system make file Running from the Command Line with make Invoking the Compiler from the Command Line There are two necessary steps to invoke the Intel C Compiler from the command line 1 set the environment 2 invoke the compiler using icc or icpc Set the Environment Variables Before you can operate the compiler you must set the environment variables to specify locations for the various components The Intel C Compiler installation includes shell scripts that you can use to set environment variables With the default compiler installation these scripts are e opt intel_cc_80 bin iccvars sh e opt intel_cc_80 bin iccvars csh To run an environment script enter one of the following on the command line prompt gt source opt intel_cc_80 bin iccvars sh or prompt gt source opt intel_cc_80 bin iccvars csh If you want the script to run automatically when you start Linux add the same command to the end of your startup file Sample bash_profile entry for iccvars sh set environment vars for Intel C compiler source opt intel_cc_80 bin iccvars sh
40. When using these options only the preprocessing phase of compilation is activated E When you specify the E option the compiler s preprocessor expands your source module and writes the result to stdout The preprocessed source contains 1ine directives which the compiler uses to determine the source file and line number For example to preprocess two source files and write them to stdout enter the following command prompt gt icpe E progl cpp prog2 cpp P When you specify the P option the preprocessor expands your source module and directs the output to a i file instead of stdout Unlike the E option the output from P does not include 1 ine number directives By default the preprocessor creates the name of the output file using the prefix of the source file name with a i extension You can change this by using the o file option For example the following command creates two files named prog1 i and prog2 i which you can use as input to another compilation prompt gt icpe P progl cpp prog2 cpp r P When you use the P option any existing files with the same name and extension are overwritten EP Using the EP option directs the preprocessor to not include 1ine directives in the output EP is equivalent to E P prompt gt icpe EP progl cpp prog2 cpp Preserving Comments in Preprocessed Source Output Use the C option to preserve comments in your preprocessed source output
41. _mm_subs_epil6 PSUBSW Subtraction _mm_subs_epu8 PSUBUSB Subtraction mm_subs_epul6 PSUBUSW Subtraction mm128i _mm_add_epi8 __m128i a __m128i b Adds the 16 signed or unsigned 8 bit integers in a to the 16 signed or unsigned 8 bit integers in b rO a0 bo rl al bl r15 al5 b15 _ mm128i _mm_add_epil6 __m128i a __m1281i b Adds the 8 signed or unsigned 16 bit integers in a to the 8 signed or unsigned 16 bit integers in b rO a0 bO rl al bl r7 a7 b7 __m128i _mm_add_epi32 __m128i a __m128i b Adds the 4 signed or unsigned 32 bit integers in a to the 4 signed or unsigned 32 bit integers in b rO a0 bO rl al bl r2 a2 b2 r3 a3 b3 314 Reference m64 _mm_add_si64 __m64 a _ m64 b Adds the signed or unsigned 64 bit integer a to the signed or unsigned 64 bit integer b Dicas RD __m128i _mm_add_epi64 __m128i a _ m128i b Adds the 2 signed or unsigned 64 bit integers in a to the 2 signed or unsigned 64 bit integers in b rO a0 bO rl al bl __m128i _mm_adds_epi8 __m128i a __m128i b Adds the 16 signed 8 bit integers in a to the 16 signed 8 bit integers in b using saturating arithmetic r0 SignedSaturate a0 b0 rl SignedSaturate al bl r15 SignedSaturate al5 b15 __m128i _mm_adds_epil6 __m128i a __m1281i b Adds the 8 signed 16 bit integers in a to the 8 signed 16 bit integers in b using satu
42. according to the compiler s default file naming conventions being executed safely in parallel and automatically generates multithreaded code for these loops Option Description Default OFF OFF OFF OFF OFF OFF parallel Detects parallel loops capable of OFF 23 Intel C Compiler for Linux Systems User s Guide Option Description Default par_report 0 1 2 3 Controls the auto parallelizer s OFF diagnostic levels 0 1 2 or 3 as follows e par_reporto0 no diagnostic information is displayed e par_reportl indicates loops successfully auto parallelized default e par_report2 loops successfully and unsccessfully auto parallelized e par_report3 same as 2 plus additional information about any proven or assumed dependences inhibiting auto parallelization par_threshold n Sets a threshold for the auto OFF parallelization of loops based on the probability of profitable execution of the loop in parallel n 0 to 100 This option is used for loops whose computation work volume cannot be determined at compile time Default n 100 pc32 Set internal FPU precision to 24 OFF 132 i32em bit significand pc64 Set internal FPU precision to 53 OFF i32 i32em bit significand pc80 Set internal FPU precision to 64 ON 132 132em bit significand pch Automatic processing for OFF precompiled headers pch_dir dirname Directs the compiler to find OF
43. bytes of b are subtracted from the unsigned data elements bytes of a and the results of the subtraction are then each independently shifted to the right by one position The high order bits of each element are filled with the borrow bits of the subtraction __m64 _m64_pavgsub2 __m64 a __m64 b The unsigned data elements double bytes of b are subtracted from the unsigned data elements double bytes of a and the results of the subtraction are then each independently shifted to the right by one position The high order bits of each element are filled with the borrow bits of the subtraction __m64 _m64_pmpy21 __m64 a __m64 b Two signed 16 bit data elements of a starting with the most significant data element are multiplied by the corresponding two signed 16 bit data elements of b and the two 32 bit results are returned as shown in Figure 9 356 Reference Ea m E am U H Fig 9 __m64 _m64_pmpy2r __m64 a __m64 b Two signed 16 bit data elements of a starting with the least significant data element are multiplied by the corresponding two signed 16 bit data elements of b and the two 32 bit results are returned as shown in Figure 10 __m64 _m64_pmpyshr2 __m64 a __m64 b const int count The four signed 16 bit data elements of a are multiplied by the corresponding signed 16 bit data elements of b yielding four 32 bit products Each product is then shifted to the right count bits and the least significant 16 bits of each shifte
44. const int pos const int len The sign extended value v either all 1s or all Os is deposited into a 64 bit field of all zeros at an arbitrary bit position and the result is returned The deposited bit field begins at bit position pos and extends to the left toward the most significant bit the number of bits specified by len int64 _m64_extr __int64 r const int pos const int len A field is extracted from the 64 bit value r and is returned right justified and sign extended The extracted field begins at position pos and extends len bits to the left The sign is taken from the most significant bit of the extracted field 340 Reference int64 _m64_extru __int64 r const int pos const int len A field is extracted from the 64 bit value r and is returned right justified and zero extended The extracted field begins at position pos and extends len bits to the left int64 _m64_xmal __int64 a __int64 b __int64 c The 64 bit values a and b are treated as signed integers and multiplied to produce a full 128 bit signed result The 64 bit value c is zero extended and added to the product The least significant 64 bits of the sum are then returned int64 _m64_xmalu __int64 a __int64 b __int64 c The 64 bit values a and b are treated as signed integers and multiplied to produce a full 128 bit unsigned result The 64 bit value c is zero extended and added to the product The least significant 64 bits of the sum are
45. default Issues a LOOP AUTO PARALLELIZED message for parallel loops par_report2 indicates successfully auto parallelized loops as well as unsuccessful loops par_report3 same as 2 plus additional information about any proven or assumed dependencies inhibiting auto parallelization reasons for not parallelizing Example of Parallelization Diagnostics Report This example shows output generated by par_report3 prompt gt icpc c parallel par_report3 prog cpp Sample Output program prog procedure prog serial loop line 5 not a parallel candidate due to statement at line 6 serial loop line 9 flow data dependence from line 10 to line 10 due to a 12 Lines Compiled where the program prog cpp is as follows Sample prog c Assumed side effects for i 1 i lt 10000 i a i foo i Actual dependence for i l i lt 10000 i afi a i 1 i Troubleshooting Tips e Use par_threshold0 to see if the compiler assumed there was not enough computational work e Use par_report3 to view diagnostics e Use ipo value to eliminate assumed side effects done to function calls 174 Volume II Optimizing Applications Parallelization with OpenMP The Intel C Compiler supports the OpenMP C version 2 0 API specification OpenMP provides symmetric multiprocessing SMP with the following major features e Relieves the user from having to d
46. float x float y FABS Description The fabs function returns the absolute value of x Calling interface double fabs double x long double fabsl long double x float fabsf float x FDIM Description The fdim function returns the positive difference value x y for x gt y or 0 for x lt y errno ERANGE for values too large Calling interface double fdim double x double y long double fdiml long double x long double y float fdimf float x float y FINITE Description The finite function returns if x is not a NaN or infinity Otherwise 0 is returned Calling interface int finite double x int finitel long double x int finitef float x 232 Reference FMA Description The fma functions return x y z Calling interface double fma double x double y long double z long double fmal long double x long double y long double z float fmaf float x float y long double z FMAX Description The fmax function returns the maximum numeric value of its arguments Calling interface double fmax double x double y long double fmaxl long double x long double y float fmaxf float x float y FMIN Description The fmin function returns the minimum numeric value of its arguments Calling interface double fmin double x double y long double fminl long double x long double y float fminf float x float y FPCLASSIFY Description The fp
47. linked sequentially opt_report Generates an optimization report OFF directed to stderr unless opt_report_fileis specified opt_report_filefilename Specifies the filename for the OFF optimization report It is not necessary to invoke opt_report when this option is specified opt_report_levellevel Specifies the verbosity level OFF of the output Valid level arguments Ifa level is not specified min is used by default 22 Compiler Options Quick Reference opt_report_phasename Specifies the compilation name for which reports are generated The option can be used multiple times in the same compilation to get output from multiple phases Valid name arguments e ipo Interprocedural Optimizer e hlo High Level Optimizer e ilo Intermediate Language Scalar Optimizer e ecg Code Generator e omp OpenMP e a11 All phases opt_report_routinesubstring Specifies a routine substring Reports from all routines with names that include substring as part of the name are generated By default reports for all routines are generated opt_report_help Displays all possible settings for opt_report_phase No compilation is performed Os Enable speed optimizations but disable some optimizations which increase code size for small speed benefit p Same as qp Stops the compilation process after C or C source files have been preprocessed and writes the results to files named
48. profile information is generated by an instrumented application when it terminates by calling the standard exit function The functions described in this section may be necessary in assuring that profile information is generated in the following situations e when the instrumented application exits using a non standard exit routine e when instrumented application is a non terminating application where exit is never called e when you want control of when the profile information is generated This section includes descriptions of the functions and environment variable that comprise Profile Information Generation Support The functions are available by inserting include lt pgouser h gt at the top of any source file where the functions may be used The compiler sets a define for PGO INSTRUMENT when you compile with either prof_gen or prof_genx Dumping Profile Information void _PGOPTI_Prof_Dump void Description This function dumps the profile information collected by the instrumented application The profile information is recorded ina dyn file Recommended Usage Insert a single call to this function in the body of the function which terminates your application Normally _PGOPTI_Prof_Dump should be called just once It is also possible to use this function in conjunction with _PGOPTI_Prof_Reset to generate multiple dyn files presumably from multiple sets of input data 135 Intel C Compiler for Linux System
49. radians for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double asin double x long double asinl long double x float asinf float x 215 Intel C Compiler for Linux Systems User s Guide ASIND Description The asind function returns the principal value of the inverse sine of x in the range 90 90 degrees for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double asind double x long double asindl long double x float asindf float x ATAN Description The at an function returns the principal value of the inverse tangent of x in the range p1 2 pi 2 radians Calling interface double atan double x long double atanl long double x float atanf float x ATAN2 Description The at an2 function returns the principal value of the inverse tangent of y x in the range pi pi radians errno EDOM for x 0 and y 0 Calling interface double atan2 double y double x long double atan21 long double y long double x float atan2f float y float x ATAND Description The at and function returns the principal value of the inverse tangent of x in the range 90 90 degrees Calling interface double atand double x long double atandl long double x float atandf float x 216 Reference ATAN2D Description The at an2d function returns the principal value of the inverse tangent of y x in the range 180
50. subir dep E subdir mk datasmophi eer could nor open haku C O C3 C C Projects Navigator Tasks C Bulld Propenies Comole S Intel C Compiler for Linux Systems User s Guide Setting Properties The Intel C Compiler integration with Eclipse CDT lets you specify compiler linker and archiver options Follow these steps to set options for your project 1 Select your project in the C C Projects view 2 From the Eclipse toolbar select Project gt Properties gt C C Build 3 Under Configuration settings click an option category for C Compiler or Linker In the example that follows the options in the Floating Point category are displayed Loe ropes Bor hel lower fal CAC Buik oe au Exemal Tools Balders Acted Confiqurancn Project ielerenes Piim Limi fecu ng intel EE Compile Conliguaios Adease Farag Confini singa TC Compiler C tnpewe Fingiing Point Consistency anpi E Cenc C Pound Finating Point Results lp pct H Ostimization Ci int COMLEX Range cores _fietiied_range E Precomplied Headers C Check Floating Point Stack 4perthekka E Pregreeses E Layuage E Compilitian Gagneetics Btw E Gap Ale E Code Gereniion E Anime E Crnmarnd Lise T Licker E liai E Commas Lise 4 Check the option s you want to add to your project compilations then open other categories if necessary 5 Click OK to complete your selections To reset properties to their de
51. the objects have been created using ipo c then the objects will not contain a valid object but only the intermediate representation IR for that object file For example prompt gt icpce ipo c a cpp b cpp will produce a o and b o that only contains IR to be used in a link time compilation The library manager will not allow these to be inserted in a library In this case you must use the Intel library tool xild ar This program will invoke the compiler on the IR saved in the object file and generate a valid object that can be inserted in a library prompt gt xild lib cru user a a o b o See Creating a Multifile IPO Executable Using xild Using ip with Qoption Specifiers You can adjust the Intel C Compiler s optimization for a particular application by experimenting with memory and interprocedural optimizations Enter the Qopt ion option with the applicable keywords to select particular inline expansions and loop optimizations The option must be entered with an ip or ipo specification as follows ip Qoption tool opts where tool is C c and opts are Qopt ion specifiers see below Also refer to Criteria for Inline Function Expansion to see how these specifiers may affect the inlining heuristics of the compiler Qoption Specifiers If you specify ip or ipo without any Qoption qualification the compiler does the following Expands functions in line Propagates constant arguments Passes arguments in re
52. uses MOVDQU Stores 128 bit value Address p need not be 16 byte aligned p i a void _mm_maskmoveu_sil28 __m128i d __m128i n char p uses MASKMOVDQU Conditionally store byte elements of d to address p The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored Address p need not be 16 byte aligned if n0 7 p 0 d0 if nl 7 pill dl if n15 7 p 15 a15 void _mm_storel_epi64 __m128i p __m128i q uses MOVQ Stores the lower 64 bits of the value pointed to by p p 63 0 a0 328 Reference Macro Function for Shuffle The Streaming SIMD Extensions 2 SSE2 provide a macro function to help create constants that describe shuffle operations The macro takes two small integers in the range of 0 to 1 and combines them into an 2 bit immediate value used by the SHUFPD instruction See the following example Shuffle Function Macro MM SHUFFLE x y expands to the value of x41 y You can view the two integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word View of Original and Result Words with Shuffle Function Macro a E EE k E EE m3 mm_shufile_pdiml m MM SHUFFLE 1 0 a ae E Cacheability Support Operations for Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 SSE2 intrinsics are in the emmint rin
53. you must use an explicit typecast Multiplication with Assignment Return Value Left Side Mul Right Side A I x 1l6vec8 I x 1l6vec8 I s u 1l6vec8 A I x l6vec4 I x 1l6vec4 I s u 1l6vec4 A 398 Reference Shift Operators The right shift argument can be any integer or Ivec value and is implicitly converted to a M64 data type The first or left operand of a lt lt can be of any type except I s u 8vec 8 16 Example Syntax Usage for Shift Operators Automatic size and sign conversion Isl6vec4 A C Tu32vec2 B C A A amp B returns 11 6vec4 which must be cast to Tul 6vec4 to ensure logical shift not arithmetic shift Isl6vec4 A C Iul6vec4 B R R Iul6vec4 A amp B C A amp B returns I16vec4 which must be cast to Is16vec4 to ensure arithmetic shift not logical shift R Isl6vec4 A amp B C Shift Operators with Corresponding Intrinsics Operation Symbols Syntax Usage Intrinsic Shift Left Shift Right _mm_srl_si64 _mm_srli_si64 _mm_srl_pi32 _mm_srli_pi32 _mm_srl_pil6 _mm_srli_pil6 _mm_sra_pi32 _mm_srai_pi32 _mm_sra_pil6 _mm_srai_pil6 Right shift operations with signed data types use arithmetic shifts All unsigned and intermediate classes correspond to logical shifts The following table shows how the return type is determined by the first argument type 399 Intel C Compiler for Linux Systems User s Guide
54. 0 andy lt 0 errno EDOM for x lt 0 and y is a non integer errno ERANGE for overflow and underflow conditions double pow double x double y long double powl double x double y float powf float x float y SCALB Description The scalb function returns x 2 where y is a floating point value errno ERANGE for underflow and overflow conditions Calling interface double scalb double x double y long double scalbl long double x long double y float scalbf float x float y 224 Reference SCALBN Description The scalbn function returns x 2 where n is an integer value errno ERANGE for underflow and overflow conditions Calling interface double scalbn double x int n long double scalbnl long double x int n float scalbnf float x int n SCALBLN Description The scalb1n function returns x 2 where n is a long integer value errno ERANGE for underflow and overflow conditions Calling interface double scalbln double x long int n long double scalblnl long double x long int n float scalblnf float x long int n SQRT Description The sqrt function returns the correctly rounded square root errno EDOM for x lt 0 Calling interface double sqrt double x long double sgqrtl long double x float sgqrtf float x Special Functions The Intel Math library supports the following special functions ANNUITY Description The annu
55. 180 degrees errno EDOM for x 0 and y 0 Calling interface double atan2d double x double y long double atan2dl long double x long double y float atan2df float x float y COS Description The cos function returns the cosine of x measured in radians This function may be inlined with the Itantum compiler Calling interface double cos double x long double cosl long double x float cosf float x COSD Description The cosd function returns the cosine of x measured in degrees Calling interface double cosd double x long double cosdl long double x float cosdf float x COT Description The cot function returns the cotangent of x measured in radians errno ERANGE for overflow conditions at x 0 Calling interface double cot double x long double cotl long double x float cotf float x COTD Description The cotd function returns the cotangent of x measured in degrees errno ERANGE for overflow conditions at x 0 Calling interface double cotd double x long double cotdl long double x float cotdf float x 217 Intel C Compiler for Linux Systems User s Guide SIN Description The sin function returns the sine of x measured in radians This function may be inlined with the Itantum compiler Calling interface double sin double x long double sinl long double x float sint float x SINCOS Description The sin
56. 2 intrinsics for Class Libraries Header file for SSE2 intrinsics used by emmintrin h Principal header file for SSE2 intrinsics IEEE 754 version of standard float h SSE intrinsics for Class Libraries Standard header file MMX TM instructions intrinsics for Class Libraries Standard header file 203 Intel C Compiler for Linux Systems User s Guide File mathf h mathimf h mmintrin h omp h omp_lib h pgouser h pmmintrin h proto h sse2mmx h stdarg h stdbool h stddef h syslimits h varargs h xarg h xmm_func h h xmm_utils h xmmintrin h lib Files Library libguide a libguide so libompstub a libsvml a Li bire a libimf a libimf so 204 Description Principal header file for legacy Intel Math Library Principal header file for current Intel Math Library Intrinsics for MMX instructions Principal header file OpenMP Header file for OpenMP For use in the instrumentation compilation phase of profile guided optimizations Principal header file SSE3 intrinsics Principal header file for Streaming SIMD Extensions 2 intrinsics Replacement header for standard stdarg h Defines _Bool keyword Standard header file Replacement header for standard varargs h Header file used by stdargs hand varargs h Header file for Streaming SIMD Extensions Utilities for Streaming SIMD Extensions Principal header file for Streaming SIMD Extensions intrinsics Description For OpenMP implementati
57. 4 matrix of single precision floating point values _MM_TRANSPOSE4 PS row0 rowl row2 row3 The arguments row0 rowl row2 and row3 are ___m128 values whose elements form the corresponding rows of a 4 by 4 matrix The matrix transposition is returned in arguments row0 row1 row2 and row3 where row0 now holds column 0 of the original matrix row1 now holds column 1 of the original matrix and so on 297 Intel C Compiler for Linux Systems User s Guide The transposition function of this macro is illustrated in the Matrix Transposition Using the _MM_TRANSPOSE4_PS figure Matrix Transposition Using MM_TRANSPOSE4 PS Macro ONO Xn Y Z Vh IO Xi K x Xa towl X Z VA fowl T Y Y Ys gt m ve r 7 Mk bra Ta 7 T 7 tev Ke Ye 7s VA tout Ww Ww Vh Wh least mosi le si mosi significant sigptbcant sigratoant Sqricant rameni ehernent efeenent eleenent Ox Streaming SIMD Extensions 2 This section describes the C language level features supporting the Intel Pentium 4 processor Streaming SIMD Extensions 2 SSE2 in the Intel C Compiler which are divided into two categories e Floating Point Intrinsics describes the arithmetic logical compare conversion memory and initialization intrinsics for the double precision floating point data type __m1284d e Integer Intrinsics describes the arithmetic logical compare conversion memory and initialization intrinsics for the extended precision integer data ty
58. 8 bit integer values to b r0 b rl b r7 b m64 _mm_setr_pi32 int il int i0 composite Sets the 2 signed 32 bit integer values in reverse order r0 iQ el p a1 m64 _mm_setr_pil6 short s3 short s2 short sl short s0 composite Sets the 4 signed 16 bit integer values in reverse order rO w0 ri wl r2 w2 r3 w3 m64 _mm_setr_pi8 char b7 char b6 char b5 char b4 char b3 char b2 char bl char b0 composite Sets the 8 signed 8 bit integer values in reverse order rO bO ri bl r7 b7 MMX TM Technology Intrinsics on Itanium Architecture MMX TM technology intrinsics provide access to the MMX technology instruction set on Itantum based systems To provide source compatibility with the A 32 architecture these intrinsics are equivalent both in name and functionality to the set of A 32 based MMX intrinsics Some intrinsics have more than one name When one intrinsic has two names both names generate the same instructions but the first is preferred as it conforms to a newer naming standard The prototypes for MMX technology intrinsics are in the mmint rin h header file 266 Reference Data Types The C data type __m64 is used when using MMX technology intrinsics It can hold eight 8 bit values four 16 bit values two 32 bit values or one 64 bit value The __m64 data type is not a basic ANSI C data type Therefore observe the following usage restrictions Us
59. A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Reference 375 Intel C Compiler for Linux Systems User s Guide comile_sd comigt_sd comige_sd cominegq_sd UCOM UCOM UCOM UCOM UCOM UCOM ieq_sd ilt_sd ile_sd igt_sd ige_sd ineq_sd 376 cvtepi32_pd _ cvtpd_epi32 cvttpd_epi32 _ cvtepi32_ps _ cvtps_epi32 cvttps_epi32 cvtpd_ps _ cvtps_pd cvtsd_ss cvtss_sd cvtsd_si32 _ cvttsd_si32 cvtsi32_sd cvtpd_pi32 _ cvttpd_pi32 cvtpi32_pd unpackhi_pd unpacklo_pd unpacklo_pd Shuffle_pd load_pd N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A
60. Declarations html Mixed 20Declarations Declaring Attributes of Functions Yes http gcc gnu org onlinedocs gcc 3 4 0 gcec Function Attributes html Function 20 Attributes Attribute Syntax Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Attribute Syntax html Attribute 20S yntax 96 Volume I Building Applications gcc Language Extension Intel GNU Description and Examples Support Prototypes and Old Style http gcc gnu org onlinedocs gec 3 4 0 gcec Function Definitions Function Prototypes html Function 20Prototypes C Style Comments Yes http gcc gnu org onlinedocs gcec 3 4 0 gcec C Comments html C 20Comments Dollar Signs in Identifier Names Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Dollar Signs html Dollar 20Signs ESC Character in Constants Yes http gcc gnu org onlinedocs gcc 3 4 0 gcec Character Escapes html Character 20Escapes Specifying Attributes of Variables http gcc gnu org onlinedocs gcec 3 4 0 gcec Variable Attributes html Variable 20Attributes Specifying Attributes of Types Yes http gcc gnu org onlinedocs gcec 3 4 0 gcec Type Attributes html T ype 20Attributes Inquiring on Alignment of Types Yes http gcc gnu org onlinedocs gec 3 4 0 gec or Variables Alignment html Alignment Inline Function is As Fast As a Yes http gcc gnu org onlinedocs gec 3 4 0 gec Macro Inline html Inline Assembler Instructions with C Yes http gcc gnu org onlinedocs gec 3 4 0 gec
61. Intel Debugger IDB Manual idb_debugger_manual htm for complete information on using the Intel Debugger You can also use the GNU Debugger gdb to debug programs compiled with the Intel C Compiler 83 Intel C Compiler for Linux Systems User s Guide Preparing for Debugging Use the g option to direct the compiler to generate code to support symbolic debugging For example prompt gt icpe g prog cpp The compiler does not support the generation of debugging information in assembly files If you specify the g option the resulting object file will contain debugging information but the assembly file will not 5 Note The g option changes the default optimization from 02 to 00 Parsing for Syntax Only Use the syntax option to stop processing source files after they have been parsed for C language errors This option provides a method to quickly check whether sources are syntactically and semantically correct The compiler creates no output file In the following example the compiler checks prog cpp and displays diagnostic information to the standard error output prompt gt icpe syntax prog cpp Optimizations and Debugging This topic describes the command line options that you can use to debug your compilation and to display and check compilation errors The options that enable you to get debug information while optimizing are as follows Option Description 00 Disables optimizations Enables the fp option
62. Itanium based systems Disables software pipelining and global code scheduling 02 O ON by default Optimizes for code speed This is the generally recommended optimization level Itanium based systems Enables software pipelining 03 Enables 02 optimizations and more aggressive optimizations such as loop and memory access transformations The 03 optimizations may slow down code in some cases compared to 02 optimizations Recommended for applications that have loops that heavily use floating point calculations and process large data sets IA 32 systems In conjunction with ax K W N B P and x K W N B P options this option causes the compiler to perform more aggressive data dependency analysis than for 02 This may result in longer compilation times 108 Volume II Optimizing Applications Option Effect fast The fast option enhances execution speed across the entire program by including the following options that can improve run time performance 03 maximum speed and high level optimizations ipo enables interprocedural optimizations across files static prevents linking with shared libraries xP specific optimization for Intel Pentium 4 processor with Streaming SIMD Extensions 3 The fast option does not include xP when compiling on Itantum based systems To override one of the options set by fast specify that option after the fast option on the command line To target fast optimizations fo
63. NOT ANDNPS _mm_or_ps Bitwise OR ORPS _mm_xor_ps Bitwise Exclusive OR XORPS m128 _mm_and_ps __m128 a __m128 b eg the bitwise And of the four SP FP values of a and b rO a0 amp bO ri al amp bl r2 a2 amp b2 r3 a3 amp b3 __m128 _mm_andnot_ps __m128 a __m128 b ma the bitwise AND NOT of the four SP FP values of a and b rO a0 amp bO ri al amp bl r2 a2 amp b2 r3 a3 amp b3 m128 _mm_or_ps __m128 a __m128 b eae the bitwise OR of the four SP FP values of a and b rO a0 bO ri al bl r2 a2 b2 r3 a3 b3 m128 _mm_xor_ps __m128 a __m128 b ra aig bitwise XOR exclusive or of the four SP FP values of a and b rO a0 bO ri al bl r2 a2 b2 r3 a3 b3 272 Reference Comparisons for Streaming SIMD Extensions Each comparison intrinsic performs a comparison of a and b For the packed form the four SP FP values of a and b are compared and a 128 bit mask is returned For the scalar form the lower SP FP values of a and b are compared and a 32 bit mask is returned the upper three SP FP values are passed through from a The mask is set to Oxf f f fLFF for each element where the comparison is true and 0x0 where the comparison is false The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmint rin h header file Intrinsic Comparison Corresponding Name
64. Pentium 4 processors mcpu cpu Optimize for a specific cpu For ON IA 32 cpu values are pentium4 on A 32 e pentium Optimize for Pentium processor Lean Tume a on Itanium e pentiumpro Optimize based for Pentium Pro Pentium II and Pentium III processors e pentiumd4 Optimize for Pentium 4 processor Default Systems The only option available on Intel EM64T systems is mcpu pentium4 For Itanium based Systems cpu values are e itanium Optimize for Itanium processor e itanium2 Optimize for Itanium 2 processor Default MD Preprocess and compile OFF Generate output file qd extension containing dependency information MF file Generate makefile dependency OFF information in file Must specify M or MM MG Similar to M but treats missing OFF header files as generated files MM Similar to M but does not OFF include system header files 19 Intel C Compiler for Linux Systems User s Guide Option Description Default MMD Similar to MD but does not OFF include system header files mp Favors conformance to the OFF ANSI C and IEEE 754 standards for floating point arithmetic mp1 Improve floating point precision OFF speed impact is less than mp MP Add a phony target for each OFF dependency mrelax Pass relax to the linker ON 164 only mno relax Do not pass relax to the OFF 164 only linker MQtarge
65. See Compilation with Real Object Files for more information When you use the ipo option the compiler attempts to detect a whole program automatically If a whole program is detected the interprocedural constant propagation stack frame alignment data layout and padding of common blocks perform more efficiently while more dead functions get deleted This option is safe 122 Volume II Optimizing Applications Command Line for Creating an IPO Executable The command line options to enable IPO for compilations targeted for both the IA 32 and Itanium architectures are identical To produce mock object files containing intermediate representation IR compile your source files with ipo as follows prompt gt icpce ipo c a cpp b cpp c cpp This produces a o b o and c o object files These files contain Intel compiler IR corresponding to the compiled source files a cop b cpp and c cpp Using c to stop compilation after generating o files is required You can now optimize interprocedurally by adding ipo to your link command line The following example produces an executable named app prompt gt icpc oapp ipo a o b o c o This command invokes the compiler on the objects containing IR and creates a new list of object s to be linked The command then calls GCC 1d to link the specified object files and produce app as specified by the o option IPO is applied only to the object files that contain IR otherwise the object
66. __int64 pagesz void __ptrd __int64 int64 pagesz int64 void __invalat void void __invala void whichGeneralReg void __stf8 void dst const int whichFloatReg void dst const int whichFloatReg fo m SS S m SO void __thash __int64 m m pa m SO pa mw S whichTransReg _ int64 pa void __itri __int64 whichTransReg __int64 pa pos va ee va a va __ptcga __int64 va va va memm __tpa __int64 va eee fp void invala_gr const int Reference Description Map the st 8 instruction Map the st spill instruction Executes a memory fence instruction Maps to the mf instruction Executes a memory fence acceptance form instruction Maps to the mf a instruction Enables memory synchronization Maps to the sync i instruction Generates a translation hash entry address Maps to the thash r r instruction Generates a translation hash entry tag Maps to the ttag r r instruction Insert an entry into the data translation cache Map itc d instruction Insert an entry into the instruction translation cache Map itc i Map the itr d instruction Map the itr i instruction Map the ptc e instruction Purges the local translation cache Maps to the ptc l r rinstruction Purges the global translation cache Maps to the ptc g r rinstruction Purges the global translation cache and ALAT M
67. a Shift Automatic Explicit Casting Required to ensure arithmetic shift lt Compare Automatic Explicit Explicit casting is required for signed classes for the less than or greater than operations a Conditional Automatic Explicit Explicit casting is required for signed Select classes for less than or greater than operations Data Declaration and Initialization The following table shows literal examples of constructor declarations and data type initialization for all class sizes All values are initialized with the most significant element on the left and the least significant to the right Declaration and Initialization Data Types for Ivec Classes Operation Class Syntax Declaration M128 I128vecl A Iu8vecl6 A Declaration M64 T 64vecl A Iu8vecl A _ m128 M128 T128vecl A __m128 m Iul6vec8 __m128 Initialization m m64 M64 T64vecl A __m64 m Iu8vec8 A __m64 m Initialization int64 M64 I64vecl A __int64 m Iu8vec8 A Initialization __int64 m int i M64 T64vecl A int i Iu8vec8 A int i Initialization 391 Intel C Compiler for Linux Systems User s Guide Operation Class Syntax I32vec2 I32vec2 A int Al int AO Is32vec2 A signed int Al signed int int initialization A0 Tu32vec2 A unsigned int Al unsigned int AO int Initialization I32vec4 132vec4 A short A3 short A2 short Al short A0 Is32vec4 A signed short A3 signed short A0 Tu32vec4 A unsigned short
68. a hello c source file to the helloworld project 1 From the Eclipse File menu select New gt File Enter he11o c in the File name text box of the New File dialog Click Finish to add the file to the helloworld project File Create a new file resource E Enter or select the parent folder helloworld w gt File name hello c 50 Volume I Building Applications Your Eclipse Preference settings in Window gt Preferences gt Workbench let you specify Perform build automatically on resource modification If this preference is checked Eclipse CDT will attempt a build when he1llo c is created Since hello c does not yet include code errors are indicated in the Tasks view and C Build view near the bottom of the screen This is expected behavior not a true error Select Window gt Show View gt C C Projects to view the project files feat Cee Deeeiopmeni jiellog inil tetware Gee lopmeni Prodiitis Ele Ede piae Sch Eec Aun pinde Heip AeHa Se Be oe Be l ee Se EE e gt b D haino viie E miksi A mhdir cep 51 Intel C Compiler for Linux Systems User s Guide 3 Inthe Editor view add your code for hello c If you close hello c in the Editor view you can open it by doulble clicking on he11o c in the Navigator view healt C Deveiopovend rello inihi tataa Cee loopman Pradit s E ee oF he ae ye ew ee ee E creas Progecti IE aS i
69. a0 bO EL a al bl m128d _mm_sub_sd __m128d a __m128d b Subtracts the lower DP FP value of b from a The upper DP FP value is passed through from a r0 a0 b0 rl al __m128d _mm_sub_pd __m128d a __m128d b Subtracts the two DP FP values of b from a rO a0 bO rl al bl ml28d _mm_mul_sd __m128d a __m128d b Multiplies the lower DP FP values of a and b The upper DP FP is passed through from a r0 a0 bo rl s al __m128d _mm_mul_pd __m128d a __m128d b Multiplies the two DP FP values of a and b r0 a0 pO rl al bil ml28d _mm_div_sd __m128d a __m128d b Divides the lower DP FP values of a and b The upper DP FP value is passed through from a rO a0 bO ri al __m128d _mm_div_pd __m128d a __m128d b Divides the two DP FP values of a and b rO a0 bO rl al bl __m128d _mm_sqrt_sd __m128d a __m128d b Computes the square root of the lower DP FP value of b The upper DP FP value is passed through from a rO sqrt b0 rl al 300 Reference __m1l28d _mm_sqrt_pd __m128d a Computes the square roots of the two DP FP values of a rO sqrt a0 rl sqrt al ml28d _mm_min_sd __m128d a __m128d b Computes the minimum of the lower DP FP values of a and b The upper DP FP value is passed through from a rO min a0 b0 rl al __m1l28d _mm_min_pd __m128d a __m128d b Computes the minima of the
70. across four words p 0 ad p 1 a0 p 2 a0 p 3 a0 void _mm_store_ps float p __m128 a Stores four SP FP values The address must be 16 byte aligned p 0 a0 p 1 al p 2 a2 p 3 a3 void _mm_storeu_ps float p __m128 a Stores four SP FP values The address need not be 16 byte aligned p 0 a0 p 1 al p 2 a2 p 3 a3 void _mm_storer_ps float p __m128 a Stores four SP FP values in reverse order The address must be 16 byte aligned p 0 a3 p 1 a2 p 2 al p 3 a0 m1l28 _mm_move_ss m128 a _ m128 b Sets the low word to the SP FP value of b The upper 3 SP FP values are passed through from a rO bO ri al r2 a2 r3 a3 283 Intel C Compiler for Linux Systems User s Guide Cacheability Support Using Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmintrin h header file void _mm_pause void The execution of the next instruction is delayed an implementation specific amount of time The instruction does not modify the architectural state This intrinsic provides especially significant performance gain PAUSE Intrinsic The PAUSE intrinsic is used in spin wait loops with the processors implementing dynamic execution especially out of order execution In the spin wait loop PAUSE improves the speed at which the code detects the release of the lock For dynamic scheduling the PAUSE in
71. alignment cannot be determined Loop Unaligned Due to Unknown Variable Value at Compile Time void f int 1b float z2 N a2 N for i lb i lt N it a2 i a2 i x24 If you know that 1b is a multiple of 4 you can align the loop with pragma vector alignedas shown in the example that follows Alignment Due to Assertion of Variable as Multiple of 4 void f int 1b float z2 N a2 N y2 N x2 assert 1b 4 0 pragma vector aligned for i lb i lt N itt a2 i a2 i x2 y2 i 169 Intel C Compiler for Linux Systems User s Guide Loop Interchange and Subscripts Matrix Multiply Matrix multiplication is commonly written as shown in the following example Typical Matrix Multiplication for i 0 i lt N i for j 0 j lt n jtt i for k 0 k lt n k cli j sc i j a i k b k j j The use of b k j is nota stride 1 reference and therefore will not normally be vectorizable If the loops are interchanged however all the references will become st ride 1 as shown in the Matrix Multiplication With Stride 1 example Arcoution Interchanging is not always possible because of dependencies which can lead to different results Matrix Multiplication With Stride 1 for i 0 i lt N i i for k 0 k lt n k i for j 0 j lt n jtt l cli j sc i j a i k b k j Auto parallelization The auto parallelization feature of
72. an input for the test prioritization tool execution command Each line of the DPI list file should include one and only one dpi file name The name can optionally be followed by the duration of the execution time for a corresponding test in the dd hh mm ss format For example Test1 dpi 00 00 60 35 informs that Test1 lasted 0 days 0 hours 60 minutes and 35 seconds The execution time is optional However if it is not provided then the tool will not prioritize the test for minimizing execution time It will prioritize to minimize the number of tests only 145 Intel C Compiler for Linux Systems User s Guide Usage Model The chart that follows presents the Test prioritization Tool usage model Step 1 Compile with Keep the static profile information of prof_genx spi for coverage analysis and PGT Instrumented Executables Step 2 1 Run instrumented executables on Test_1 Step 2 n Run instrumented executables on Test_n Merge Dynamic Profile Information Merge Dynamic Profile Information dyn files dyn files Test_1 dpi Test_2 dpi Test_ dpi Test_n dpi Step 3 Run Test Prioritizer Here are the steps for a simple example myApp c for IA 32 systems 1 Set PROF_DIR myApp prof_dir 2 Issue command prompt gt icpe prof_genx myApp c This command compiles the program and generates an instrumented binary as well as the corresponding static profile information pgopti spi 3
73. and saturation F PA Do not intermix the M64 and M128 data types You will get unexpected behavior if you do The signedness is indicated by the s and u in the class names Is64vec2 Iu64vec2 Is32vec4 Tu32vec4 Isl6vec8 Tul6vec8 Is8vecl6 Tu8vecl6 Is32vec2 Tu32vec2 Isl6vec4 Tul6vec4 Is8vec8 Tu8vec8 388 Reference Terms Conventions and Syntax The following are special terms and syntax used in this chapter to describe functionality of the classes with respect to their associated operations Ivec Class Syntax Conventions The name of each class denotes the data type signedness bit size number of elements using the following generic format lt type gt lt signedness gt lt bits gt vec lt elements gt F I s u 64 32 16 8 vec 8 4 2 1 where type indicates floating point F or integer I signedness indicates signed s or unsigned u For the Ivec class leaving this field blank indicates an intermediate class There are no unsigned Fvec classes therefore for the Fvec classes this field is blank bits specifies the number of bits per element elements specifies the number of elements Special Terms and Conventions The following terms are used to define the functionality and characteristics of the classes and operations defined in this manual e Nearest Common Ancestor This is the intermediate or parent class of two classes of the same size For example the n
74. are commonly used together into a struct and forcing the struct to be allocated at the beginning of a cache line you can effectively guarantee that each object is loaded into the cache as soon as any one is accessed resulting in a significant performance benefit The syntax of this extended attribute is as follows align n where n is an integral power of 2 less than or equal to 32 The value specified is the requested alignment r Ve In this release __ dec l spec align 8 does not function correctly Use __declspec align 16 instead 5 Note Ifa value is specified that is less than the alignment of the affected data type it has no effect In other words data is aligned to the maximum of its own alignment or the alignment specified with __declspec align You can request alignments for individual variables whether of static or automatic storage duration Global and static variables have static storage duration local variables have automatic storage duration by default You cannot adjust the alignment of a parameter nor a field of a struct or class You can however increase the alignment of a st ruct or union or class in which case every object of that type is affected 360 Reference As an example suppose that a function uses local variables i and j as subscripts into a 2 dimensional array They might be declared as follows int i j These variables are commonly used together But they can fall in differ
75. assume that the ansi_alias program adheres to the rules defined in the ISO C Standard If your program adheres to these tules then this option will allow the compiler to optimize more aggressively If it doesn t adhere to these rules then it can cause the compiler to generate incorrect code Specifies that the application OFF cannot exceed a 32 bit address space which allows the compiler to use 32 bit pointers whenever possible To use this option you must also specify ipo value Using the auto_ilp32 option on programs that can exceed 32 bit address space 2 32 may cause unpredictable results during program execution This option has no effect on Intel EMO64T systems unless the axP or xP option is also used Generates specialized code for OFF processor specific codes K W N B and P while also generating generic IA 32 code e K Intel Pentium II and compatible Intel processors e W Intel Pentium 4 and compatible Intel processors e N Intel Pentium 4 and compatible Intel processors e B Intel Pentium M and compatible Intel processors e P Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Only the axW and axP options are available on Intel EM64T Compiler Options Quick Reference Option Description Default SC Places comments in preprocessed source output Stops the compilation process after an object file has been generated The compiler gene
76. b4 a5 b5 r3 a6 b6 a7 p7 __m128i _mm_max_epil6 __m128i a __m128i b Computes the pairwise maxima of the 8 signed 16 bit integers from a and the 8 signed 16 bit integers from b rO max a0 bO rl max al bl oy oe max a7 b7 __m128i _mm_max_epu8 __m128i a __m128i b Computes the pairwise maxima of the 16 unsigned 8 bit integers from a and the 16 unsigned 8 bit integers from b rO max a0 b0 ri max al bl ri5 max al5 b15 __m128i _mm_min_epil6 __m128i a _ m128i b Computes the pairwise minima of the 8 signed 16 bit integers from a and the 8 signed 16 bit integers from b rO min a0 b0 rl min al bl r7 min a7 b7 __m128i _mm_min_epu8 __m128i a __m128i b Computes the pairwise minima of the 16 unsigned 8 bit integers from a and the 16 unsigned 8 bit integers from b rO min a0 b0 rl min al b1 r15 min al5 b15 316 Reference _ m128i _mm_mulhi_epil6 __m128i a __m128i b Multiplies the 8 signed 16 bit integers from a by the 8 signed 16 bit integers from b Packs the upper 16 bits of the 8 signed 32 bit results rO a0 bO 31 16 rl al bl 31 16 ep a7 b7 31 16 _ m128i _mm_mulhi_epul6 __m128i a __m128i b Multiplies the 8 unsigned 16 bit integers from a by the 8 unsigned 16 bit integers from b Packs the upper 16 bits of the 8 unsigned 32 bit results rO a0 bO 31 16 ri al b1 31 16 rI i
77. be 16 byte aligned ro i rl r2 r3 OO OO p 0 1 2 3 m1l28 _mm_loadr_ps float p Loads four SP FP values in reverse order The address must be 16 byte aligned ro ri r2 r3 p 3 2 1 0 Set Operations for Streaming SIMD Extensions See summary table in Summary of Memory and Initialization topic The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmintrin h header file ml28 _mm_set_ss float w Sets the low word of an SP FP value to w and clears the upper three words r0 w ri r2 r3 0 0 m1l28 _mm_set_ps1 float w Sets the four SP FP values to w rO 3 Th ys 2 g r3 w m1128 _mm_set_ps float z float y float x float w Sets the four SP FP values to the four inputs rO w ri x r2 y r3 Z m1128 _mm_setr_ps float z float y float x float w Sets the four SP FP values to the four inputs in reverse order ro Z ri r2 r3 Iowa auK __m128 _mm_setzero_ps void Clears the four SP FP values rO rl r2 r3 0 0 282 Reference Store Operations for Streaming SIMD Extensions See summary table in Summary of Memory and Initialization topic The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmintrin h header file void _mm_store_ss float p __m128 a Stores the lower SP FP value p al void _mm_store_psl float p __m128 a Stores the lower SP FP value
78. benefit from the optimizations listed in tables that follow IA 32 and Itanium based applications lt Optimization Affected Aspect of Program lt Inline function expansion Calls jumps branches and loops ft Interprocedural constant propagation Arguments global variables and return values pt Monitoring module level static variables Further optimizations and loop invariant code pd Dead code elimination Code size ee Propagation of function characteristics Call deletion and call movement fd Multifile optimization The same aspects as ip but across multiple files IA 32 applications only Optimization Affected Aspect of Program Passing arguments in registers Calls and register usage Loop invariant code motion Further optimizations and loop invariant code Inline function expansion is one of the main optimizations performed by the interprocedural optimizer For function calls that the compiler believes are frequently executed the compiler might decide to replace the instructions of the call with code for the function itself With ip the compiler performs inline function expansion for calls to procedures defined within the current source file However when you use ipo to specify multifile IPO the compiler performs inline function expansion for calls to procedures defined in separate files To disable the IPO optimizations use the 00 option on The ip and ipo options can in
79. bit integers in a and the 16 signed or unsigned 8 bit integers in b for equality rO a0 b0 Oxff 0x0 rl al bl Oxff 0x0 r15 al5 b15 Oxff 0x0 __m128i _mm_cmpegq_epil6 __m128i a __m128i b Compares the 8 signed or unsigned 16 bit integers in a and the 8 signed or unsigned 16 bit integers in b for equality rO a0 b0 Oxffff 0x0 rl al bl Oxffff 0x0 r7 es a7 b7 Oxffff 0x0 323 Intel C Compiler for Linux Systems User s Guide __m128i _mm_cmpegq_epi32 __m128i a m128i b Compares the 4 signed or unsigned 32 bit integers in a and the 4 signed or unsigned 32 bit integers in b for equality rO a0 bO Oxffffffff 0x0 rl al bl OxfffffffFf 0x0 r2 a2 b2 Oxffffffff 0x0 r3 a3 b3 Oxffffffff 0x0 _ m128i _mm_cmpgt_epi8 __m128i a m128i b Compares the 16 signed 8 bit integers in a and the 16 signed 8 bit integers in b for greater than rO a0 gt b0 Oxff 0x0 rl al gt b1 Oxff 0x0 ri5 al5 gt b15 Oxff 0x0 _ m128i _mm_cmpgt_epil6 __m128i a __m128i b Compares the 8 signed 16 bit integers in a and the 8 signed 16 bit integers in b for greater than r0 a0 gt bO Oxffftf 0x0 rl al gt bl Oxffff 0x0 PT aE a7 gt b7 0xffff 0x0 __m128i _mm_cmpgt_epi32 __m128i a __m128i b Compares the 4 signed 32 bit integers in a and the 4 signed 32 bit integers in b for greater than
80. branches and some of the code Unrolling enables you to aggressively schedule or pipeline the loop to hide latencies if you have enough free registers to keep variables live The Intel Pentium 4 and Intel Xeon TM processors can correctly predict the exit branch for an inner loop that has 16 or fewer iterations if that number of iterations is predictable and there are no conditional branches in the loop Therefore if the loop body size is not excessive and the probable number of iterations is known unroll inner loops for e Pentium 4 processors until they have a maximum of 16 iterations e Pentium III or Pentium II processors until they have a maximum of 4 iterations 151 Intel C Compiler for Linux Systems User s Guide A potential limitation is that excessive unrolling or unrolling of very large loops can lead to increased code size For more information on how to optimize with unro11 n refer to the Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual Absence of Loop carried Memory Dependency For Itanium based applications the ivdep_parallel option indicates there is absolutely no loop carried memory dependency in the loop where the IVDEP directive is specified This technique is useful for some sparse matrix applications For example the following loop requires ivdep_parallel in addition to the directive IVDEP to indicate there is no loop carried dependencies pragma ivdep
81. c include lt stdio h gt include lt mathimf h gt int main float _Complex c32in c32o0ut double _Complex c64in c64out double pi_by_four 3 141592653589793238 4 0 c64in 1 0 I__ pi_by_four Create the double precision complex number 1 pi 4 i where i is the imaginary unit c32in float _Complex c64in Create the float complex value from the double complex value c64out cexp c64in c320ut cexpf c32in Call the complex exponential cexp z cexp xtiy x i y e x cos y i sin y printf When z 7 7f 7 7f i cexpf z 7 7f S7 7 i n crealf c32in cimagf c32in crealf c320ut cimagf c320ut printf When z 12 12f 12 12f i cexp z 12 12f 12 12 i n creal c64in cimag c64in creal c64out cimagf c64out return 0 prompt gt ice complex_math c The output of a out will look like this When z 1 0000000 0 7853982 i cexpf z 1 9221154 1 9221156 i When z 1 000000000000 0 785398163397 i cexp z 1 922115514080 1 922115514080 i F Note _Complex data types are supported in C but not in C programs 213 Intel C Compiler for Linux Systems User s Guide Exception Conditions If you call a math function using argument s that may produce undefined results an error number is assigned to the system variable errno Math function errors are usually domain errors or
82. cmpngt A B Not Greater Than or Equal To cmpnge R cmpnge A B Less Than cmplt R cmplt A B Less Than or Equal To cmple R cmple A B Not Less Than cmpnlt R cmpnlt A B Not Less Than or Equal To cmpnle R cmpnle A B Compare Operators The mask is set to Oxf fffffff for each floating point value where the comparison is true and 0x00000000 where the comparison is false The following table shows the return values for each class of the compare operators which use the syntax described earlier in the Return Value Notation section 424 Reference Compare Operator Return Value Mapping R AO For Any B If True If False F32vec4 F64vec2 F32vec1 Operators Bl Oxffffffff Ox0000000 X X X B2 ox fffffff 0x0000000 X x N A B2 B3 oxffffffff 0x0000000 X N A N A B3 B3 oxffffffff 0x0000000 X N A N A B3 The following table shows examples for arithmetic operators and intrinsics 425 Intel C Compiler for Linux Systems User s Guide Compare Operations for Fvec Classes Compare for Equality Compare for Inequality 4 floats F32vec4 Returns Example Syntax Usage F32vec4 A F6 4vec2 A 4 floats F32vec4 R cmpeq 2 doubles F64vec2 R cmpegq 1 1 float F32vecl R cmpeq 2 doubles F64vec2 1 float F32vecl Compare for Less Than 4 floats F32vec4 R cmpl 2 doubles F64vec2 R cmpl 1 float
83. computation in order to minimize the execution time of a single job In this mode the worker threads actively wait for more parallel work without yielding to other threads f Note Avoid over allocating system resources This occurs if either too many threads have been specified or if too few processors are available at run time If system resources are over allocated this mode will cause poor performance The throughput mode should be used instead if this occurs Throughput In a multi user environment where the load on the parallel machine is not constant or where the job stream is not predictable it may be better to design and tune for throughput This minimizes the total time to run multiple jobs simultaneously In this mode the worker threads will yield to other threads while waiting for more parallel work The throughput mode is designed to make the program aware of its environment that is the system load and to adjust its resource usage to produce efficient execution in a dynamic environment Throughput mode is the default OpenMP Environment Variables This topic describes the OpenMP environment variables with the OMP_ prefix and Intel specific environment variables with the KMP_ prefix Standard Environment Variables Variable Description Default OMP_SCHEDULE Sets the runtime schedule type and chunk size STATIC no chunk size specified OMP_NUM_THREADS Sets the number of threads to us
84. count E7 a7 gt gt count __m128i _mm_srai_epi32 __m128i a int count Shifts the 4 signed 32 bit integers in a right by count bits while shifting in the sign bit rO a0 gt gt count rl al gt gt count r2 a2 gt gt count r3 a3 gt gt count 321 Intel C Compiler for Linux Systems User s Guide __m128i _mm_sra_epi32 __m128i a __m1281i count Shifts the 4 signed 32 bit integers in a right by count bits while shifting in the sign bit rO a0 gt gt count rl al gt gt count r2 a2 gt gt count r3 13 gt gt count ml28i mm_srli_sil28 __m1281i a int imn Shifts the 128 bit value in a right by imm bytes while shifting in zeros imm must be an immediate r srl a imm 8 __m128i _mm_srli_epil6 __m128i a int count Shifts the 8 signed or unsigned 16 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count r7 srl a7 count __m128i _mm_srl_epil6 __m128i a __m1281i count Shifts the 8 signed or unsigned 16 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count r7 srl a7 count __m128i _mm_srli_epi32 __m128i a int count Shifts the 4 signed or unsigned 32 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count r2 srl a2 count r3 srl a3 count __m128i _mm_srl_epi32 __m128i a __m12
85. data types from the I s u 16vec4 or I s u 16vec8 classes as shown in the following example Syntax Usage for Multiplication Operators Explicitly convert B to Is16vec4 Isl6vec4 A C Iu32vec2 B C A C C A Isl6vec4 B Return nearest common ancestor type 11 6vec4 Isl6vec4 A Tul6vec4 B Il6vec4 C C A B The mul_high and mul_add functions take Is16vec4 data only Isl6vec4 A B C D C mul_high A B D mul_add A B 397 Intel C Compiler for Linux Systems User s Guide Multiplication Operators with Corresponding Intrinsics SSSS ooq Symbols Syntax Usage Intrinsic mul_high N A R mul_add N A R l _mm_madd_pil6 _mm_madd_epil The multiplication return operators always return the nearest common ancestor as listed in the table that follows The two operands must be 16 bits in size otherwise you must explicitly indicate typecasting Multiplication Operator Overloading R Mul A B Il6 vec4 R If s uJl6vec4 A I s u l6vec4 B Il6vec8 R I s u l6vec8 A I s u 1l6vec8 B Isl6vec4 R mul_add Isl6ovec4 A Isl6vec4 Isl6vec8 mul_add Isl6vec8 A Isl6vec8 B B Is32vec2 R mul_high Isl6vec4 A Isl6 vec4 B Is32vec4 R mul_high sl6vec8 A Isl6vec8 B The following table shows the return values and data type assignments for operands of the multiplication operators with assignment All operands must be 16 bytes in size If the operands are not the right size
86. documentation fpic Use the fpic option when building shared libraries It is required for the compilation of each object file included in the shared library See also Linking 89 Intel C Compiler for Linux Systems User s Guide Managing Libraries The LD_LIBRARY_PATH environment variable contains a colon separated list of directories in which the linker will search for library a files If you want the linker to search additional libraries you can add their names to LD_LIBRARY_PATH to the command line or to a response file see Note In each case the names of these libraries are passed to the linker before the names of the Intel libraries that the driver always specifies E Note Response files are processed at the location they appear on the command line If libraries are specified in the response file references from object files seen after the response file will not be resolved in those libraries Modifying LD_LIBRARY_PATH If you want to add a directory 1ibs for example to the LD_LIBRARY_PATH you can do either of the following e command line prompt gt export LD_LIBRARY_PATH libs LD_LIBRARY_PATH e startup file export LD_LIBRARY_PATH libs LD_LIBRARY_PATH To compile file cpp and link it with the library my lib a enter the following command prompt gt icpe file cpp mylib a The compiler passes file names to the linker in the following order 1 the object file 2 any objects or libraries specifie
87. e reduction operator variable list e ordered e nowait private The private clause creates a private default constructed version for each object in variable list for the taskq It also implies captureprivate on each enclosed task The original object referenced by each variable has an indeterminate value upon entry to the construct must not be modified within the dynamic extent of the construct and has an indeterminate value upon exit from the construct firstprivate The firstprivate clause creates a private copy constructed version for each object in variable list for the taskq It also implies captureprivate on each enclosed task The original object referenced by each variable must not be modified within the dynamic extent of the construct and has an indeterminate value upon exit from the construct lastprivate The lastprivate clause creates a private default constructed version for each object in variable list for the taskq It also implies captureprivate on each enclosed task The original object referenced by each variable has an indeterminate value upon entry to the construct must not be modified within the dynamic extent of the construct and is copy assigned the value of the object from the last enclosed task after that task completes execution reduction The reduction clause performs a reduction operation with the given operator in enclosed task constructs for each object in variable list operator and variable list ar
88. each parallel thread to use as its private stack This value can be changed with kmp_set_stacksize_s prior to the first parallel region or with the KMP_STACKSIZE environment variable kmp_get_stacksize This function is provided for backwards compatibility only Use kmp_get_stacksize_s for compatibility across different families of Intel processors kmp_set_stacksize_s size Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack This value can also be set via the KMP_STACKSIZE environment variable In order for kmp_set_stacksize_s to have an effect it must be called before the beginning of the first dynamically executed parallel region in the program kmp_set_stacksize size This function is provided for backward compatibility only use kmp_set_stacksize_s for compatibility across different families of Intel processors 186 Volume II Optimizing Applications Memory Allocation Function Description kmp_malloc size Allocate memory block of size bytes from thread local heap kmp_calloc nelem elsize Allocate array of nelem elements of size elsize from thread local heap kmp_realloc ptr size Reallocate memory block at address pt rand size bytes from thread local heap kmp_free ptr Free memory block at address pt r from thread local heap Memory must have been previously allocated with kmp_malloc kmp_calloc or kmp_realloc
89. file passes to link stage F Note For the above step you can use the xi ld tool instead of icpc The two steps described above can be combined as shown in the following prompt gt icpe ipo oapp a f b f c f Generating Multiple IPO Object Files For the most part IPO generates a single object file for the link time compilation This can be clumsy for very large applications perhaps even making it impossible to use ipo on the application The compiler provides two ways to avoid this problem The first way is a size based heuristic which automatically causes the compiler to generate multiple object files for large link time compilations The second way is using one of two explicit command line controls that tell the compiler to do multi object IPO e ipoN where N indicates the number of object files to generate e ipo_separate which tells the compiler to generate a separate IPO object file for each source file These options are alternatives to the ipo option that is they indicate an IPO compilation Explicitly requesting a multi object IPO compilation turns the size based heuristic off The number of files generated by the link time compilation is invisible unless either the ipo_c or ipo_s option is used In this case the compiler appends a number to the file name For example consider this example prompt gt icpc ipo_separat ipo c a0 Ded Cd 123 Intel C Compiler for Linux Systems User s Guide Here a
90. float i 439 Intel C Compiler for Linux Systems User s Guide Call function to add all array elements Add20ArrayElements array amp result Print average array element value printf Average of all array values f n result 20 printf The correct answer is f n n n 9 5 440
91. frame for all functions Produce symbolic OFF debug information in object file The g option changes the default optimization from 02 to O0 H QH Print include file OFF order help help Print help message OFF listing 31 Intel C Compiler for Linux Systems User s Guide Linux Windows Description Linux Default Idirectory Idirectory Add directory to OFF include file search path Preserve the source OFF position of inlined code instead of assigning the call site source position to inlined code inline_debug_info Qinline_debug_info Enable single file IP OFF optimizations within files ip_no_inlining Qip_no_inlining Optimize the behavior OFF of IP disable full and partial inlining requires ip or ipo value ipo value Qipo value Enable multifile IP OFF optimizations between files ipo_obj Qipo_obj Optimize the behavior OFF of IP force generation of real object files requires ipo value KPIC NA Generate position OFF independent code same as Kpic Kpic NA Generate position OFF independent code same as KPIC m NA Instruct linker to OFF produce map file Generate makefile OFF dependency information mp Op Maintain floating OFF point precision disables some optimizations mp1 Qprec Improve floating OFF point precision speed impact is less than mp 32 Compiler Options Quic
92. gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Reference 379 Intel C Compiler for Linux Systems User s Guide unpackhi_epil unpackhi_epi32 unpackhi_epi64 _ unpack _ unpack unpack lo_epi8 lo_epil6 lo_epi32 _ unpack lo_epi64 380 move_epi64 movpi64_epi 64 movepi64 pib4 load_sil28 loadu_sil28 loadl_epi64 set_epio4 set_epi32 _ set_epil set_epis8 setl_epio4 setl_epi32 setl_epil setl_epi8 setr_epi64 _ setr_epi32 setr_epil setr_epi8 setzero_sil28 store_sil28 storeu_sil28 storel_epi 4 maskmoveu_sil28 stream_pd N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A gt gt gt
93. h header file void _mm_stream_pd double p __m128d a uses MOVNTPD Stores the data in a to the address p without polluting caches The address p must be 16 byte aligned If the cache line containing address p is already in the cache the cache will be updated p 0 a0 p 1 al void _mm_stream_sil28 __m128i p __m128i a Stores the data in a to the address p without polluting the caches If the cache line containing address p is already in the cache the cache will be updated Address p must be 16 byte aligned zo US a void _mm_stream_si32 int p int a Stores the data in a to the address p without polluting the caches If the cache line containing address p is already in the cache the cache will be updated pi a 329 Intel C Compiler for Linux Systems User s Guide void _mm_clflush void const p Cache line containing p is flushed and invalidated from all caches in the coherency domain void _mm_lfence void Guarantees that every load instruction that precedes in program order the load fence instruction is globally visible before any load instruction which follows the fence in program order void _mm_mfence void Guarantees that every memory access that precedes in program order the memory fence instruction is globally visible before any memory instruction which follows the fence in program order void _mm_pause void The execution of the next instruction is delayed an implementation specific amount of
94. i i 0 9 Incorrect Usage for Non Countable Loop i 0 Iterations dependent on a il while a i gt 0 0 afi b i c il i 160 Volume II Optimizing Applications Types of Loops Vectorized For integer loops MMX TM technology and Streaming SIMD Extensions provide SIMD instructions for most arithmetic and logical operators on 32 bit 16 bit and 8 bit integer data types Vectorization may proceed if the final precision of integer wrap around arithmetic will be preserved A 32 bit shift right operator for instance is not vectorized if the final stored value is a 16 bit integer Also note that because the MMX TM instructions and Streaming SIMD Extensions instruction sets are not fully orthogonal byte shifts for instance are not supported not all integer operations can actually be vectorized For loops that operate on 32 bit single precision and 64 bit double precision floating point numbers the Streaming SIMD Extensions provide SIMD instructions for the arithmetic operators and Also the Streaming SIMD Extensions provide SIMD instructions for the binary MIN MAX and unary SORT operators SIMD versions of several other mathematical operators like the trigonometric functions SIN COS TAN are supported in software in a vector mathematical run time library that is provided with the Intel C Compiler Strip Mining and Cleanup Strip mining also known as loop sectioning is a lo
95. if the line contains cmp1 mod1 cpp then only those modules with the name mod1 cpp will be selected that are in a directory named cmp1 If no component file is specified then all files that have been compiled with prof_genx are selected for coverage analysis Dynamic Counters This feature displays the dynamic execution count of each basic block of the application providing useful information for both coverage and performance tuning The coverage tool can be configured to generate information about dynamic execution counts This configuration requires the count s option The counts information is displayed under the code after a sign precisely under the source position where the corresponding basic block begins If more than one basic block is generated for the code at a source position macros for example then the total number of such blocks and the number of the blocks that were executed are also displayed in front of the execution count In certain situations it may be desirable to consider all the blocks generated for a single source position as one entity In such cases it is necessary to assume that all blocks generated for one source position are covered when at least one of the blocks is covered This assumption can be configured with the nopartial option When this option is specified decision coverage is disabled and the related statistics are adjusted accordingly The code lines 11 and 12 indicate that the printf stateme
96. int64 Target int64 value Reference Description Do a compare and exchange operation atomically Maps to the cmpxchg4 instruction with appropriate setup Use compare and exchange to do an atomic add of the increment value to the addend Maps to a loop with the cmpxchg4 instruction to guarantee atomicity Same as the previous intrinsic but returns new value not the original one Map the exch8 instruction Atomically compare and exchange the pointer value specified by its first argument all arguments are pointers Atomically exchange the 32 bit quantity specified by the 1st argument Maps to the xchg4 instruction Maps to the cmpxchg4 rel instruction with appropriate setup Atomically compare and exchange the value specified by the first argument a 64 bit pointer Same as the previous intrinsic but map the cmpxchg4 acq instruction Release spin lock Increment by one the value specified by its argument Maps to the fet chadd instruction Decrement by one the value specified by its argument Maps to the fet chadd instruction Do an exchange operation atomically Maps to the xchg instruction 343 Intel C Compiler for Linux Systems User s Guide Comparand unsigned __int64 __int64 Exchange Comparand int64 int64 int64 Comparand int64 int64 int64 addend 5 Note _InterlockedCompareExchange64 volatil Destination
97. losing IEEE compliance turning these flags on significantly increases the performance of programs with denormal floating point values in the gradual underflow mode run on the most recent IA 32 processors Hence for the Intel Pentium III Pentium 4 Pentium M Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 and compatible IA 32 processors the compiler s default behavior is to turn these flags on The compiler inserts code in the program to perform a run time check for the processor on which the program runs to verify it is one of the afore listed Intel processors Examples e Executing a program on a Pentium III processor enables FTZ but not DAZ e Executing a program on an Intel Pentium M processor or Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 enables both FTZ and DAZ These flags are only turned on by Intel processors that have been validated to support them For non Intel processors you can set the flags manually with the following macros Enable FTZ _MM SET FLUSH _ZERO_MODE _MM FLUSH_ZERO_ON Enable DAZ _MM SET _DENORMALS ZERO MODE _MM DENORMALS_ ZERO _ON The prototypes for these macros are in xmmintrin h FTZ and pmmintrin h DAZ 120 Volume II Optimizing Applications Interprocedural Optimizations Use ip and ipo to enable interprocedural optimizations IPO which enable the compiler to analyze your code to determine where you can
98. lower SP FP value of a the upper 3 SP FP values are passed through rO recip sqrt a0 ri al r2 a2 r3 a3 270 Reference m128 _mm_rsqrt_ps __m128 a Computes the approximations of the reciprocals of the square roots of the four SP FP values of a rO recip sqrt a0 rl recip sqrt al r2 recip sqrt a2 r3 recip sqrt a3 m128 _mm_min_ss __m128 a __m128 b Computes the minimum of the lower SP FP values of a and b the upper 3 SP FP values are passed through from a rO min a0 b0 rl al r2 a2 r3 a3 m128 _mm_min_ps __m128 a __m128 b Computes the minimum of the four SP FP values of a and b rO min a0 b0 rl min al bl r2 min a2 b2 r3 min a3 b3 m128 _mm_max_ss __m128 a _ m128 b Computes the maximum of the lower SP FP values of a and b the upper 3 SP FP values are passed through from a rO max a0 bO ri al r2 a2 r3 a3 m128 _mm_max_ps __m128 a __m128 b Computes the maximum of the four SP FP values of a and b rO max a0 b0 rl max al bl r2 max a2 b2 r3 max a3 b3 271 Intel C Compiler for Linux Systems User s Guide Logical Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmint rin h header file Intrinsic Operation Corresponding Name Instruction _mm_and_ps Bitwise AND ANDPS _mm_andnot_ps Logical
99. m64 b 352 Corresponding Instruction czx1 1 Compute Zero Index czx1 r Compute Zero Index czx2 1 Compute Zero Index czx2 r Compute Zero Index mix1 1 Mix mix1 r Mix mix2 1 Mix mix2 r Mix mix4 1 Mix mix4 r Mix mux1 Mux mux2 Mux padd1 uus Parallel add padd2 uus Parallel add pavgl Parallel average Reference Intrinsic Corresponding Instruction m64 _m64_pavg2_nraz __m64 a __m64 b pavg2 Parallel average m64 _m64_pavgsubl __m6o4 a __m64 b pavgsub1 Parallel average subtract m64 _m64_pavgsub2 __mo4 a _ m64 b pavgsub2 Parallel average subtract m64 _m64_pmpy2r __m64 a _ m64 b pmpy2 r Parallel multiply __m64 _m64_pmpy21 __m64 a _ m64 b pmpy2 1 Parallel multiply __m64 _m64_pmpyshr2 __m64 a __m64 b pmpyshr2 Parallel multiply const int count and shift right __m64 _m64_pmpyshr2u __m64 a __m64 b pmpyshr2 u Parallel multiply const int count and shift right m64 _m64_pshladd2 __m64 a const int pshladd2 Parallel shift left count _ m64 b and add m64 _m64_pshradd2 __m64 a const int pshradd2 Parallel shift right count _ m64 b and add m64 _m64_psubluus __m64 a _ m64 b psub1 uus Parallel subtract m64 _m64_psub2uus __m64 a _ m64 b psub2 uus Parallel subtract int64 _m64_czx1l _ m64 a The 64 bit value a is scanned for a zero element from the most significant element to the least significant ele
100. m_psradi _mm_srai_pi32 right Arithmetic PSRADI m_psrlw _mm_srl_pil right Logical PSRLW _m_psrlwi _mm_srli_pil 6 right Logical PSRLWI _m_psrid _mm_srl_pi32 right Logical PSRLD _m_psridi _mm_srli_pi32 right Logical PSRLDI _m_psrlq _mm_srl_si64 right Logical PSRLO _m_psrigi _mm_srli_si64 right Logical PSRLOI m64 _m_psllw __m64 m __m64 count Shift four 16 bit values in m left the amount specified by count while shifting in zeros __m64 _m_psllwi __m64 m int count Shift four 16 bit values in m left the amount specified by count while shifting in zeros For the best performance count should be a constant __m64 _m_pslld __m64 m __m64 count Shift two 32 bit values in m left the amount specified by count while shifting in zeros __m64 _m_psllidi __m64 m int count Shift two 32 bit values in m left the amount specified by count while shifting in zeros For the best performance count should be a constant 261 Intel C Compiler for Linux Systems User s Guide __m64 _m_psllq __m64 m __m64 count Shift the 64 bit value in m left the amount specified by count while shifting in zeros __m64 _m_psllqi __m64 m int count Shift the 64 bit value in m left the amount specified by count while shifting in zeros For the best performance count should be a constant __m64 _m_psraw __m64 m __m64 count Shift four 16 bit values in m right the amount specified by count
101. normal sequential version of the loop shown the value of data 1 read during the second iteration was written into the first iteration For vectorization the iterations must be done in parallel without changing the semantics of the original loop 158 Volume II Optimizing Applications Data Dependence Theory Data dependence analysis involves finding the conditions under which two memory accesses may overlap Given two references in a program the conditions are defined by e whether the referenced variables may be aliases for the same or overlapping regions in memory e for array references the relationship between the subscripts For array references the Intel C Compiler s data dependence analyzer is organized as a series of tests that progressively increase in power as well as time and space costs First a number of simple tests are performed in a dimension by dimension manner since independence in any dimension will exclude any dependence relationship Multi dimensional arrays references that may cross their declared dimension boundaries can be converted to their linearized form before the tests are applied Some of the simple tests used are the fast GCD test proving independence if the greatest common divisor of the coefficients of loop indices cannot evenly divide the constant term and the extended bounds test which tests potential overlap for the extreme values of subscript expressions If all simple tests fail to prove i
102. of interprocedural optimizations PGO Phases The PGO methodology requires three phases e Phase 1 Instrumentation compilation and linking with prof_gen x e Phase 2 Instrumented execution by running the executable e Phase 3 Feedback compilation with prof_use A key factor in deciding whether you want to use PGO lies in knowing which sections of your code are the most heavily used If the data set provided to your program is very consistent and it elicits a similar behavior on every execution then PGO can probably help optimize your program execution However different data sets can elicit different algorithms to be called This can cause the behavior of your program to vary from one execution to the next In cases where your code behavior differs greatly between executions PGO may not provide noticeable benefits You have to ensure that the benefit of the profile information is worth the effort required to maintain up to date profiles When using prof_gen x with the x qualifier extra source position is collected which enables code coverage tools such as the Intel C Compiler Code coverage Tool Without such tools prof_genx does not provide better optimization and may slow parallel compile times 131 Intel C Compiler for Linux Systems User s Guide Basic PGO Options Description prof_gen x Instructs the compiler to produce instrumented code in your object files in preparation for instrumented exe
103. performed For example division is never changed to multiplication by the reciprocal e the compiler performs floating point operations in the order specified without re association e the compiler does not perform the constant folding optimization on floating point values Constant folding also eliminates any multiplication by 1 division by 1 and addition or subtraction of 0 For example code that adds 0 0 to a number is executed exactly as written Compile time floating point arithmetic is not performed to ensure that floating point exceptions are also maintained 110 Volume II Optimizing Applications e floating point operations conform to ANSI C When assignments to type float and double are made the precision is rounded from 80 bits extended down to 32 bits float or 64 bits double When you do not specify mp the extra bits of precision are not always rounded before the variable is reused e sets the nolib_inline option which disables inline functions expansion mp1 Option Use the mp1 option to improve floating point precision mp1 disables fewer optimizations and has less impact on performance than mp mp1 prevents the compiler from performing optimizations which change NAN comparison semantics It also causes all values used in comparisons to be truncated to declared precision prior to use in the comparison It also makes sure to use library routines which give better precision results compared to the X87 transc
104. prioritization Tool to exit when it reaches a given level of basic block coverage tselect dpi_list tests_list spi pgopti spi cutoff 85 00 If the tool is run with the cutoff value of 85 00 in the previous example only Test 3 will be selected as it achieves 45 65 block coverage which corresponds to 87 50 of the total block coverage that is reached from all three tests The Test prioritization Tool does an initial merging of all the profile information to determine the total coverage that is obtained by running all the tests The not otal option enables you to skip this step In such a case only the absolute coverage information will be reported as the overall coverage remains unknown 149 Intel C Compiler for Linux Systems User s Guide High level Language Optimizations HLO High level optimizations exploit the properties of source code constructs for example loops and arrays in the applications developed in high level programming languages such as Fortran and C The high level optimizations include loop interchange loop fusion loop unrolling loop distribution unroll and jam blocking data prefetch scalar replacement data layout optimizations and loop unrolling techniques The option required to turn on the high level optimizations is O3 The scope of optimizations turned on by 03 is different for A 32 and Itantum based applications See Setting Optimization Levels IA 32 and Itanium based Applications
105. rO a0 gt b0 OxfffFf 0x0 rl al gt bl Oxffff 0x0 r2 a2 gt b2 Oxffff 0x0 r3 a3 gt b3 OxfffFf 0x0 __m128i _mm_cmplt_epi8 __m128i a __m128i b Compares the 16 signed 8 bit integers in a and the 16 signed 8 bit integers in b for less than rO a0 lt b0 Oxff 0x0 rl al lt b1 Oxff 0x0 riS als amp bis 7 02tf mo0z __m128i _mm_cmplt_epil6 __m128i a __m128i b Compares the 8 signed 16 bit integers in a and the 8 signed 16 bit integers in b for less than rO a0 lt bO Oxfffft 0x0 rl al lt bl Oxffff 0x0 ae a7 lt b7 Oxffff 0x0 __m128i _mm_cmplt_epi32 __m128i a __m128i b Compares the 4 signed 32 bit integers in a and the 4 signed 32 bit integers in b for less than rO a0 lt b0 ri al lt bl r2 a2 lt b2 r3 a3 lt b3 324 Oxffff Oxffff Oxffff Oxffff aD en a 0x0 0x0 0x0 0x0 Reference Integer Conversions Operations for Streaming SIMD Extensions 2 The following two conversion intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmintrin h header file _ mil28i _mm_cvtsi32_sil28 int a uses MOVD Moves 32 bit integer a to the least significant 32 bits of an ___m128i object Copies the sign bit of a into the upper 96 bits of the __m128i object Y0 a rl 0x0 r2 0x0 r3 0x0 int _mm_cvtsil28_si32
106. range errors Domain errors result from arguments that are outside the domain of the function For example acos is defined only for arguments between 1 and 1 inclusive Attempting to evaluate acos 2 or acos 3 results in a domain error where the return value is QNaN Range errors occur when a mathematically valid argument results in a function value that exceeds the range of representable values for the floating point data type Attempting to evaluate exp 1000 results in a range error where the return value is INF When domain or range error occurs the following values are assigned to errno e domain error EDOM errno 33 e range error ERANGE errno 34 The following example shows how to read the errno value for an EDOM and ERANGE error exrrno c finclude lt errno h gt include lt mathimf h gt include lt stdio h gt int main void double neg_one 1 0 double zero 0 0 The natural log of a negative number is considered a domain error EDOM printf log e Se and errno EDOM d n neg_one log neg_one errno The natural log of zero is considered a range error ERANGE printf log e e and errno ERANGE d n zero log zero errno The output of errno c will look like this log 1 000000e 00 nan and errno EDOM log 0 000000e 00 inf and errno ERANGE For the math functions in
107. rl r2 r3 Mpo a 3 2 1 0 289 Intel C Compiler for Linux Systems User s Guide m128 _mm_set_ss float a Sets the low word of an SP FP value to a and clears the upper three words r0 c i iS p2 f 3 2S 50220 m128 _mm_set_psl float a Sets the four SP FP values to a rO rl r2 r3 t a ml28 _mm_set_ps float a float b float c float d Sets the four SP FP values to the four inputs ro rl G2 x3 aap m128 _mm_setr_ps float a float b float c float d Sets the four SP FP values to the four inputs in reverse order rO d ri c r2 r3 a __m128 _mm_setzero_ps void Clears the four SP FP values rO rl r2 r3 0 0 void _mm_store_ss float v _ m128 a Stores the lower SP FP value ky al void _mm_store_psl float v __m128 a Stores the lower SP FP value across four words v 0 a0 v 1 a0 v 2 a0 v 3 a0 void _mm_store_ps float v __m128 a Stores four SP FP values The address must be 16 byte aligned v 0 a0 v 1 al v 2 a2 v 3 a3 290 Reference void _mm_storeu_ps float v __m128 a Stores four SP FP values The address need not be 16 byte aligned v 0 a0 v 1 al v 2 a2 v 3 a3 void _mm_storer_ps float v __m128 a Stores four SP FP values in reverse order The address must be 16 byte aligned v 0 a3 v 1 a2 v 2 al v 3 a0 m1128 _mm_move_ss __m1
108. some cases significantly increase compile time and code size auto_ilp32 for Itanium based Systems On Itanium based systems the auto_ilp32 option requires interprocedural analysis over the whole program This optimization allows the compiler to use 32 bit pointers whenever possible as long as the application does not exceed a 32 bit address space Using the auto_ilp32 option on programs that exceed 32 bit address space might cause unpredictable results during program execution 121 Intel C Compiler for Linux Systems User s Guide Because this optimization requires interprocedural analysis over the whole program you must use the auto_ilp32 option with the ipo option IPO Compilation Model For the topics in this section the term IPO generally refers to multi file IPO When you use the ipo option the compiler collects information from individual program modules of a program Using this information the compiler performs optimizations across modules In order to do this the ipo option is applied to both the compilation phase and the link phase One of the main benefits of IPO is that it enables more inlining For information on inlining and the minimum inlining criteria see Criteria for Inline Function Expansion and Controlling Inline Expansion of User Functions Inlining and other optimizations are improved by profile information For a description of how to use IPO with profile information for further optimization see
109. store dynamic information files or whether to overwrite pgopti dpi Refer to your operating system documentation for instructions on how to specify environment values Profile guided Optimization Environment Variables Variable Description PROF_DIR Specifies the directory in which dynamic information files are created This variable applies to all three phases of the profiling process PROF_NO_CLOBBER Alters the feedback compilation phase slightly By default during the feedback compilation phase the compiler merges the data from all dynamic information files and creates a new pgopti dpii file if dyn files are newer than an existing pgopti dpi file When this variable is set the compiler does not overwrite the existing pgopti dpi file Instead the compiler issues a warning and you must remove the pgopti dpi file if you want to use additional dynamic information files Using profmerge to Relocate the Source Files The compiler uses the full path to the source file to look up profile summary information By default this prevents you from e using the profile summary file dpi if you move your application sources e sharing the profile summary file with another user who is building identical application sources that are located in a different directory Source Relocation To enable the movement of application sources as well as the sharing of profile summary files use profmerge with the src_oldand src_new option
110. such as _mm_cvtpd_ps result in a loss of precision The rounding mode used in such cases is determined by the value in the MXCSR register The default rounding mode is round to nearest Note that the rounding mode used by the C and C languages when performing a type conversion is to truncate 307 Intel C Compiler for Linux Systems User s Guide The _mm_cvttpd_epi32 and_mm_cvttsd_si32 intrinsics use the truncate rounding mode regardless of the mode specified by the MXCSR register The conversion operation intrinsics for Streaming SIMD Extensions 2 SSE2 are listed in the following table followed by detailed descriptions The prototypes for SSE2 intrinsics are in the emmint rin h header file Intrinsic Corresponding Return Parameters Name Instruction Type _mm_cvtpd_ps CVTPD2PS __m128 __m128d a _mm_cvtps_pd CVTPS2PD __m128d __m128 a _mm_cvtepi32_pd CVIDQ2PD __m128d __m128i a _mm_cvtpd_epi32 CVTPD2D0 _ m128i __m128d a _mm_cvtsd_si32 CVTSD2SI int __m128d a _mm_cvtsd_ss CVTSD2SS __m128 __m128 a __m128d b _mm_cvtsi32_sd CVTSI2SD _ m128d __m128d a int b _mm_cvtss_sd CVTSS2SD m128d __m128d a __m128 b _mm_cvttpd_epi32 CVTTPD2DQ _ m128i __m128d a _mm_cvttsd_si32 CVTITSD2SI int __m128d a _mm_cvtpd_pi32 CVTPD2PI m64 __m128d a _mm_cvttpd_pi32 CVTITPD2PI __m64 __m128d a _mm_cvtpi32_pd CVTPI2PD __m128d __m64 a _mm_cvtsd_f64 None double __m128d a m128 _mm_cvtpd_ps
111. the Intel C Compiler automatically translates serial portions of the input program into equivalent multithreaded code The auto parallelizer analyzes the dataflow of the program s loops and generates multithreaded code for those loops which can be safely and efficiently executed in parallel This enables the potential exploitation of the parallel architecture found in symmetric multiprocessor SMP systems Automatic parallelization relieves the user from e having to deal with the details of finding loops that are good worksharing candidates e performing the dataflow analysis to verify correct parallel execution e partitioning the data for threaded code generation as is needed in programming with OpenMP directives 170 Volume II Optimizing Applications The parallel run time support provides the same run time features found in OpenMP such as handling the details of loop iteration modification thread scheduling and synchronization While OpenMP directives enable serial applications to transform into parallel applications quickly the programmer must explicitly identify specific portions of the application code that contain parallelism and add the appropriate compiler directives Auto parallelization triggered by the parallel option automatically identifies those loop structures which contain parallelism During compilation the compiler automatically attempts to decompose the code sequences into separate threads for parallel processi
112. the master thread executes pragma omp parallel Begin a Parallel Construct form a team This is Replicated Code each team member executes the same code pragma omp sections Begin a Worksharing Construct pragma omp section One unit of work figs ates pragma omp section Another unit of work Erea ee Wait until both units of work complete More Replicated Code 177 Intel C Compiler for Linux Systems User s Guide pragma omp for nowait LOR sca 4 pragma omp critical Begin a Worksharing Construct each iteration is unit of work Work is distributed among the team members End of Worksharing Construct nowait was specified so threads proceed Begin a Critical Section pragma omp barrier 178 Replicated Code but only one thread can execute it ata given time More Replicated Code Wait for all team members to arrive ore Replicated Code End of Parallel Construct disband team and continue serial execution Possibly more Parallel constructs End serial execution Volume II Optimizing Applications Compiling with OpenMP Directive Format and Diagnostics To run the Intel C Compiler in OpenMP mode invoke the compiler with the openmp option prompt gt icpe openmp file cpp Before you run the multithr
113. the new list and produce app s Note The ipo option can reorder object files and linker arguments on the command line Therefore if your program relies on a precise order of arguments on the command line ipo can affect the behavior of your program The xild command supports the ipo ipoN and ipo_separate options Usage Rules You must use the Intel linker xild to link your application if e Your source files were compiled with the ipo option e You normally would invoke the GCC linker 1d to link your application The xild Options The additional options supported by xild may be used to examine the results of IPO These options are described in the following table qipo_fa file s Produces an assembly listing for the IPO compilation You can specify an optional name for the listing file or a directory with the backslash in which to place the file The default listing name is ipo_out s qipo_fo file o Produces an object file for the IPO compilation You can specify an optional name for the object file or a directory with the backslash in which to place the file The default object file name is ipo_out o ipo_fcode asm Adds code bytes to the assembly listing ipo_fsource asm Adds high level source code to the assembly listing 125 Intel C Compiler for Linux Systems User s Guide ipo_fverbose asm Enables and disables respectively inserting comments ipo_fnoverbose asm containing versio
114. the program aborts F32vec4 A 3 float f Corresponding intrinsics none Load and Store Operators Loads two double precision floating point values copying them into the two floating point values of A No assumption is made for alignment void loadu F64vec2 A double p Corresponding intrinsic _mm_loadu_pd Stores the two double precision floating point values of A No assumption is made for alignment void storeu float p F64vec2 A Corresponding intrinsic _mm_storeu_pd Loads four single precision floating point values copying them into the four floating point values of A No assumption is made for alignment void loadu F32vec4 A double p Corresponding intrinsic _mm_loadu_ps Stores the four single precision floating point values of A No assumption is made for alignment 431 Intel C Compiler for Linux Systems User s Guide void storeu float p F32vec4 A Corresponding intrinsic _mm_storeu_ps Unpack Operators for Fvec Operators Selects and interleaves the lower double precision floating point values from A and B F64vec2 R unpack_low F64vec2 A F64vec2 B Corresponding intrinsic __mm_unpacklo_pd a b Selects and interleaves the higher double precision floating point values from A and B F6 4vec2 R unpack_high F64vec2 A F64vec2 B Corresponding intrinsic _mm_unpackhi_pd a b Selects and interleaves the lower two single precision floating point values from A an
115. the taskq block is reached Control Structures Many control structures exhibit the pattern of separated work iteration and work creation and are naturally parallelized with the workqueuing model Some common cases are e while loops e C iterators e recursive functions 187 Intel C Compiler for Linux Systems User s Guide while Loops If the computation in each iteration of a while loop is independent the entire loop becomes the environment for the taskq pragma and the statements in the body of the while loop become the units of work to be specified with the task pragma The conditional in the while loop and any modifications to the control variables are placed outside of the task blocks and executed sequentially to enforce the data dependencies on the control variables C Iterators C Standard Template Library STL iterators are very much like the while loops just described whereby the operations on the data stored in the STL are very distinct from the act of iterating over all the data If the operations are data independent they can be done in parallel as long as the iteration over the work is sequential This type of while loop parallelism is a generalization of the standard OpenMP worksharing for loops In the worksharing for loops the loop increment operation is the iterator and the body of the loop is the unit of work However because the for loop iteration variable frequently has a closed form solution it can b
116. time The instruction does not modify the architectural state This intrinsic provides especially significant performance gain PAUSE Intrinsic The PAUSE intrinsic is used in spin wait loops with the processors implementing dynamic execution especially out of order execution In the spin wait loop PAUSE improves the speed at which the code detects the release of the lock For dynamic scheduling the PAUSE instruction reduces the penalty of exiting from the spin loop Example of loop with the PAUSE instruction spin loop pause cmp eax A jne spin_loop In this example the program spins until memory location A matches the value in register eax The code sequence that follows shows a test and test and set In this example the spin occurs only after the attempt to get a lock has failed get_lock mov eax 1 xchg eax A Try to get lock cmp eax 0 Test if successful jne spin_loop critical_section code mov A 0 Release lock jmp continue spin_loop pause Spin loop hint cmp 0 A Check lock availability jne spin_loop jmp get_lock continue Note that the first branch is predicted to fall through to the critical section in anticipation of successfully gaining access to the lock It is highly recommended that all spin wait loops include the PAUSE instruction Since PAUSE is backwards compatible to all existing A 32 processor generations a test for processor type 330 Reference a CP
117. to vectorize code The declspec align n declaration enables you to overcome hardware alignment constraints The restrict qualifier and the pragmas address the stylistic issues due to lexical scope data dependence and ambiguity resolution 162 Language Support Feature __declspec align n __declspec align n off restrict __assume_aligned a n pragma ivdep pragma novector Multi version Code pragma vector aligned unaligned always Volume II Optimizing Applications Description Directs the compiler to align the variable to an n byte boundary Address of the variable is address mod n 0 Directs the compiler to align the variable to an n byte boundary with offset off within each n byte boundary Address of the variable is address mod n off Permits the disambiguator flexibility in alias assumptions which enables more vectorization Instructs the compiler to assume that array a is aligned on an n byte boundary used in cases where the compiler has failed to obtain alignment information Instructs the compiler to ignore assumed vector dependencies Specifies how to vectorize the loop and indicates that efficiency heuristics should be ignored Specifies that the loop should never be vectorized Multi version code is generated by the compiler in cases where data dependence analysis fails to prove independence for a loop due to the occurrence of pointers with unknown values This f
118. two DP FP values of a and b rO min a0 bO yri min al bl _ m128d _mm_max_sd __m128d a __m128d b Computes the maximum of the lower DP FP values of a and b The upper DP FP value is passed through from a rO max a0 b0 ri al __m128d _mm_max_pd __m128d a __m128d b Computes the maxima of the two DP FP values of a and b rO max a0 bO rl max al bl Floating point Logical Operations for Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 SSE2 intrinsics are in the emmint rin h header file __m128d _mm_and_pd __m128d a __m128d b uses ANDPD Computes the bitwise AND of the two DP FP values of a and b ro a0 amp bO rl al amp bl __m128d _mm_andnot_pd __m128d a __m128d b uses ANDNPD Computes the bitwise AND of the 128 bit value in b and the bitwise NOT of the 128 bit value in a rO a0 amp bO ri al amp bl __m128d _mm_or_pd __m128d a __m128d b uses ORPD Computes the bitwise OR of the two DP FP values of a and b r0 a0 bO ri al bl 301 Intel C Compiler for Linux Systems User s Guide __m128d _mm_xor_pd __m128d a __m128d b uses XORPD Computes the bitwise XOR of the two DP FP values of a and b r0 a0 bO cl al bl Floating point Comparison Operations for Streaming SIMD Extensions 2 Each comparison intrinsic performs a comparison of a and b For the packed form the two DP FP values of a and b are compa
119. value of an integer Returns the absolute value of a long integer Rotates bits left for an unsigned long integer Rotates bits right for an unsigned long integer Rotates bits left for an unsigned integer Rotates bits right for an unsigned integer Reference FP Note Passing a constant shift value in the rotate intrinsics results in higher performance Floating point Related Intrinsic double fabs double double log double float logf float double logl0 double float logl0f float double exp double float expf float double double pow double float powf float float double sin double float sinf float double cos double float cosf float double tan double float tanf float double acos double float acosf float double acosh double float acoshf float double asin double Description Returns the absolute value of a floating point value Returns the natural logarithm In x x gt 0 with double precision Returns the natural logarithm In x x gt 0 with single precision Returns the base 10 logarithm log10 x x gt 0 with double precision Returns the base 10 logarithm log10 x x gt 0 with single precision Returns the exponential function with double precision Returns the exponential function with single precision Returns the value of x to the power y with double precision Returns the value of x to the power y w
120. which let you load and store data into memory The load and set operations are similar in that both initialize ___m128 data However the set operations take a float argument and are intended for initialization with constants whereas the load operations take a floating point argument and are intended to mimic the instructions for loading data from memory The store operation assigns the initialized data to the address 287 Intel C Compiler for Linux Systems User s Guide The intrinsics are listed in the following table Syntax and a brief description are contained the following topics The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmintrin h header file Intrinsic Name _mm_load_ss _mm_load_ps1l _mm_load_ps _mm_loadu_ps _mm_loadr_ps mm_set_ss mm_set_psl mm_set_ps mm_setr_ps mm_setzero_ps mm_store_ss store_psl Alternate Name Operation Load the low value and clear the three high values Load one value into all four words _mm_loadl_ps Load four values address aligned Load four values address unaligned Load four values in reverse order Set the low value and clear the three high values Set all four words with the same value _mm_setl_ps Set four values address aligned Set four values in reverse order Clear all four values Store the low value mm_storel_ps Store the low value across mm_store_ps mm_storeu_ps mm_store
121. with the keyword static usually have HIDDEN visibility they cannot be referenced directly by other components or for that matter other compilation units within the same component but they might be referenced indirectly f Note Visibility applies to references as well as definitions A symbol reference s visibility attribute is an assertion that the corresponding definition will have that visibility 91 Intel C Compiler for Linux Systems User s Guide Symbol Preemption Sometimes you may need to use some of the functions or data items from a shareable object but may wish to replace others with your own definitions For example you may want to use the standard C runtime library shareable object 1ibc so but to use your own definitions of the heap management routines malloc and free In this case it is important that calls to malloc and free within libc so call your definition of the routines and not the definitions present in Libc so Your definition should override or preempt the definition within the shareable object This feature of shareable objects is called symbol preemption When the runtime loader loads a component all symbols within the component that have default visibility are subject to preemption by symbols of the same name in components that are already loaded Since the main program image is always loaded first none of the symbols it defines will be preempted The possibility of symbol preemption i
122. x int signgam float lgammaf_r float x int signgam 227 Intel C Compiler for Linux Systems User s Guide TGAMMA YO Y1 YN Description The t gamma function computes the gamma function of x errno EDOM for x 0 or negative integers Calling interface double tgamma double x long double tgammal long double x float tgammaf float x Description Computes the Bessel function of the second kind of x with order 0 errno EDOM for x lt 0 Calling interface double yO double x double yOl long double x float yOf float x Description Computes the Bessel function of the second kind of x with order 1 errno EDOM for x lt 0 Calling interface double yl double x double yll long double x float ylf float x Description Computes the Bessel function of the second kind of x with order n errno EDOM for x lt 0 Calling interface double yn int n double x double ynl int n long double x float ynf int n float x 228 Reference Nearest Integer Functions The Intel Math library supports the following nearest integer functions CEIL Description The ceil function returns the smallest integral value not less than x as a floating point number This function may be inlined with the Itanium compiler Calling interface double ceil double x long double ceill long double x float ceilf float x FLOOR Descrip
123. z float _Complex casinhf float _Complex z CATAN Description The cat an function returns the complex inverse tangent of z Calling interface double _Complex catan double _Complex z long double _Complex catanl long double _Complex z float _Complex catanf float _Complex z CATANH Description The cat anh function returns the complex inverse hyperbolic tangent of z Calling interface double _Complex catanh double _Complex z long double _Complex catanhl long double _Complex z float _Complex catanhf float _Complex z 237 Intel C Compiler for Linux Systems User s Guide CCOS Description The ccos function returns the complex cosine of z Calling interface double _Complex ccos double _Complex z long double _Complex ccosl long double _Complex z float _Complex ccosf float _Complex z CCOSH Description The ccosh function returns the complex hyperbolic cosine of z Calling interface double _Complex ccosh double _Complex z long double _Complex ccoshl long double _Complex z float _Complex ccoshf float _Complex z CEXP Description The cexp function computes e Calling interface double _Complex cexp double _Complex z long double _Complex cexpl long double _Complex z float _Complex cexpf float _Complex z CEXP2 Description The cexp function computes 2 Calling interface double _Complex cexp2 double _Complex z long double _Com
124. 12 A12 11 A11 10 A10 9 A9 8 A8 7 A7 6 A6 5 A5 4 A4 3 A3 2 A2 1 A1 0 A0 Corresponding Intrinsics none The eight 8 bit values of A are placed in the output buffer and printed in the following format default is decimal cout lt lt Is8vec8 A cout lt lt Iu8vec8 A cout lt lt hex lt lt Iu8vec8 A print in hex format instead of decimal 7 A7 6 A6 5 A5 4 A4 3 A3 2 A2 1 Al 0 A0 Corresponding Intrinsics none 405 Intel C Compiler for Linux Systems User s Guide Element Access Operators int R Is64vec2 A il unsigned int R Iu64vec2 A i int R Is32vec4 A il unsigned int R Iu32vec4 A i int R Is32vec2 A il unsigned int R Iu32vec2 A i short R Isl6vec8 A i unsigned short R Iul6vec8 A i short R Isl6vec4 A i unsigned short R Iul vec4 Afi signed char R Is8vecl6 A i unsigned char R Iu8vecl 6 A i signed char R Is8vec8 A i unsigned char R Iu8vec8 A i Access and read element i of A If DEBUG is enabled and the user tries to access an element outside of A a diagnostic message is printed and the program aborts Corresponding Intrinsics none Element Assignment Operators Is64vec2 Ali int R Is32vec4 A i int R Tu32vec4 A i unsigned int R Is32vec2 A i int R Tu32vec2 A i unsigned int R Isl6vec8 A i short R Iul6vec8 A i unsigned short R Isl6vec4 A i short R 406
125. 247 Intel C Compiler for Linux Systems User s Guide Intrinsic Syntax To use an intrinsic in your code insert a line with the following syntax data_type intrinsic_name Where paramete data_type intrinsic _name ES parameters Intrinsics For All IA Is the return data type which can be either void int _ m64 m128 m128d __ m1281i _ implemented across all IA may return other data types as well as indicated in the intrinsic syntax definitions Is the name of the intrinsic which behaves like a function that you can use in your C code instead of inlining the actual instruction Represents the parameters required by each intrinsic int 64 Intrinsics that can be The intrinsics in this section function across all A 32 and Itantum based platforms They are offered as a convenience to the programmer They are grouped as follows Integer Arithmetic Related Floating Point Related String and Block Copy Related Miscellaneous Integer Arithmetic Related Intrinsic int abs unsigned int long labs long J long _ rotl unsigned value i unsigned nt shi long ft fo _lrotr unsigned long value i nsigned value i unsigned value i 248 nt shi int nt shi int nt shi ft fj __rotl unsigned int ct ama rotr unsigned int Et Description Returns the absolute
126. 28 a __m128 b Sets the low word to the SP FP value of b The upper 3 SP FP values are passed through from a r0 p0 ri al r2 a2 r3 a3 unsigned int _mm_getcsr void Returns the contents of the control register void _mm_setcsr unsigned int i Sets the control register to the value specified void _mm_prefetch char const a int sel uses PREFETCH Loads one cache line of data from address a to a location closer to the processor The value sel specifies the type of prefetch operation the constants _MM_HINT_T0 _MM_HINT_T1 _MM_HINT_T2 and _MM_HINT_NTA should be used for IA 32 corresponding to the type of prefetch instruction The constants _MM_HINT_T1 _MM_HINT_NT1 _MM_HINT_NT2 and _MM_HINT_NTA should be used for Itanium based systems void _mm_stream_pi __m64 p _ m64 a uses MOVNTQ Stores the data in a to the address p without polluting the caches This intrinsic requires you to empty the multimedia state for the mmx register See The EMMS Instruction Why You Need It and When to Use It topic void _mm_stream_ps float p __m128 a see MOVNTPS Stores the data in a to the address p without polluting the caches The address must be 16 byte aligned 291 Intel C Compiler for Linux Systems User s Guide void _mm_sfence void uses SFENCE Guarantees that every preceding store is globally visible before any subsequent store float _mm_cvtss_f32 __m128 a
127. 2vec4 B F32vec4 F64vec2 amp amp R F32vec4 A F64vec2 A amp F32vec2 B F64vec2 F64vec2 A F32vec1 B F32vec1 F32vec1 A amp F32vec1 F32vec1 A F32vec4 B F32vec4 F32vec4 A F32vec4 F32vec4 A F64vec2 B F64vec2 F32vec1 F64vec2 A F32vec2 F64vec2 A F32vec1 A F32vec1 B F32vec1 F32vec4 R R F32vecl A F32vec4 A F32vec4 B F32vec4 F64vec2 R R F32vec4 A F64vec2 A F364vec2 F64vec2 F32vec1 B R A R F64vec2 A F32vec1 A F32vec1 B F32vec1 F64vec2 F64vec2 R R B F32vec1 A andnot F64vec2 A Reference Intrinsic _mm_and_ps _mm_and_pd _mm_and_ps _mm_or_ps _mm_or_pd _mm_or_ps _mm_xor_ps _mm_xor_pd _mm_xor_ps _mm_andnot_pd 423 Intel C Compiler for Linux Systems User s Guide Compare Operators The operators described in this section compare the single precision floating point values of A and B Comparison between objects of any Fvec class return the same class being compared The following table lists the compare operators for the Fvec classes Compare Operators and Corresponding Intrinsics Compare For Operators Syntax Equality cmpeq R cmpeq A B Inequality cmpneq R cmpneq A B Greater Than cmpgt R cmpgt A B Greater Than or Equal To cmpge R cmpge A B Not Greater Than cmpngt R
128. 6 processors not provided by Intel Corporation prompt gt icpe xW prog cpp 115 Intel C Compiler for Linux Systems User s Guide r TE Ifa program compiled with x K WIN B P is executed on a non compatible processor it might fail with an illegal instruction exception or display other unexpected behavior Executing programs compiled with xN xB or xP on unsupported processors see table will display the following run time error Fatal Error This program was not built to run on the processor in your system Automatic Processor specific Optimizations IA 32 only The ax K W N B P options direct the compiler to find opportunities to generate separate versions of functions that take advantage of features that are specific to the specified Intel processor If the compiler finds such an opportunity it first checks whether generating a processor specific version of a function is likely to result in a performance gain If this is the case the compiler generates both a processor specific version of a function and a generic version of the function The generic version will run on any IA 32 processor At run time one of the versions is chosen to execute depending on the Intel processor in use In this way the program can benefit from performance gains on more advanced Intel processors while still working properly on older A 32 processors The disadvantages of using ax K W N B P are e The size of the compiled bi
129. 81i count Shifts the 4 signed or unsigned 32 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count r2 srl a2 count r3 srl a3 count __m128i _mm_srli_epi64 __m128i a int count Shifts the 2 signed or unsigned 64 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count 322 Reference __m128i _mm_srl_epi64 __m128i a __m1281i count Shifts the 2 signed or unsigned 64 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count Integer Comparison Operations for Streaming SIMD Extensions 2 The comparison intrinsics for Streaming SIMD Extensions 2 SSE2 and descriptions for each are listed in the following table The prototypes for SSE2 intrinsics are in the emmint rin h header file Intrinsic Name Instruction Comparison Elements Size of Elements _mm_cmpeq_epi8 PCMPEQB Equality _mm_cmpeq_epil6 PCMPEQW Equality _mm_cmpgt_epi8 PCMPGTB Greater Than 1 _mm_cmpgt_epil PCMPGTW Greater Than _mm_cmpgt_epi32 PCMPGTD Greater Than _mm_cmplt_epi8 PCMPGTBr Less Than _mm_cmplt_epil6 PCMPGTWr Less Than _mm_cmplt_epi32 PCMPGTDr Less Than 16 8 _mm_cmpeq_epi32 PCMPEQD Equality 4 6 8 4 16 8 4 m128i _mm_cmpeq_epi8 __m128i a __m128i b Compares the 16 signed or unsigned 8
130. 9 ISO C90 plus GNU extensions ON Includes some C99 features std gnu 98 Same as std gnu89 OFF no traceback Generate do not generate extra OFF information in the object file that allows the display of source file traceback information at run time when a severe error occurs Options Quick Reference Guide Some compiler options are only available on certain systems In the following table these options are indicated with labels as follows Label Meaning 132 Option available on IA 32 based systems i32em Option available on Intel Extended Memory 64 Technology Intel EM64T systems 164 Option available on Itanium based systems e Ifno label is present the option is available on all supported systems e If only appears in the label that option is only available on the identified system Option Description Default A Disables all predefined macros no align Analyze and reorder memory 132 only layout for variables and arrays Aname value Associates a symbol name with OFF the specified sequence of value Equivalent to an assert preprocessing directive alias_args This option implies arguments may be aliased not aliased alias_args Intel C Compiler for Linux Systems User s Guide Option Description Default ansi Equivalent to GNU ANSI OFF ansi_alias auto_ilp32 ax K W N B P i32 i32em 10 ansi_alias directs the compiler to
131. A N A N A N A N A N A N A N A N A N A N A N A gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A mm mm Sra_epil Srai_epi32 Sra_epi32 cm cm cm cm cm li_sil28 li_epil 1_epil li_epi32 l_epi32 li_epi64 l_epi64 peq_epi8 peq_epil peq_epi32 pgt_epi8 pgt_epil pgt_epi32 cmplt_epi8 _ cmplt_epil _ cmplt_epi32 Cvtsi32_sil28 cvtsil28_si32 _ packs_epil packs_epi32 packus_epil6 extract_epil6 insert_epil6 movemask_epi8 shuff shuff le_epi32 lehi_epil shuff lelo_epil unpackhi_epi8 N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A gt gt gt
132. A3 unsigned short A0 short int Il6vec4 Il6vec4 A short A3 short A2 short Al Initialization short AO Isl6vec4 A signed short A3 signed short A0 Iul6vec4 A unsigned short A3 unsigned short AO short int I1l6vec8 Il6vec8 A short A7 short A6 short Initialization Al short AO Isl6vec8 A signed A7 Signed short A0 Iul6vec8 A unsigned short A7 unsigned short A0 char I8vec8 I8vec8 A char A7 char A6 char Al Initialization char A0 Is8vec8 A signed char A7 signed char A0 Tu8vec8 A unsigned char A7 unsigned char AO char I8vec16 I8vec16 A char A15 char A0 Initialization Is8vecl6 A signed char A15 signed char A0 Tu8vecl6 A unsigned char A15 unsigned char A0 392 Assignment Operator Reference Any Ivec object can be assigned to any other Ivec object conversion on assignment from one Ivec object to another is automatic Assignment Operator Examples Isl6vec4 A Is8vec8 B T 4vecl C A B assign Is8vec8 to Isl6vec4 B C assign I64vecl to Is8vec8 B A amp C assign M64 result of amp to Is8vec8 Logical Operators The logical operators use the symbols and intrinsics listed in the following table Bitwise Operator Symbols Syntax Usage Operation Standard w assign Standard Sr a ae oe o t p e ANDNOT andnot R A andnot B Intrinsic w assign Logical Op
133. Auto parallelization Threshold Control and Diagnostics Threshold Control The par_threshold n option sets a threshold for the auto parallelization of loops based on the probability of profitable execution of the loop in parallel The value of n can be from 0 to 100 This option is used for loops whose computation work volume cannot be determined at compile time The threshold is usually relevant when the loop trip count is unknown at compile time The par_threshold n option has the following functionality e par_threshold100 is executed by default so loops get auto parallelized only if profitable parallel execution is almost certain e Ifyou specify par_threshold with designating a value for n the compiler uses the default value n 100 e The intermediate 1 to 99 values represent the percentage probability for profitable speed up For example n 50 directs the compiler to parallelize only if there is a 50 probability of the code speeding up if executed in parallel The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads 173 Intel C Compiler for Linux Systems User s Guide Diagnostics The par_report 0 1 2 3 option controls the auto parallelizer s diagnostic levels 0 1 2 or 3 as follows par_report0 no diagnostic information is displayed par_report1 indicates loops successfully auto parallelized
134. Comments following preprocessing directives however are not preserved Preprocessing Directive Equivalents You can use the A D and U options as equivalents to preprocessing directives e A is equivalent to a assert preprocessing directive e D is equivalent to a define preprocessing directive e U is equivalent to a undef preprocessing directive 68 Volume I Building Applications Using A Use the A option to make an assertion Syntax Aname value Description Argument name Indicates an identifier for the assertion Indicates a value for the assertion Ifa value is specified it should be quoted along with the parentheses delimiting it value i For example to make an assertion for the identifier fruit with the associated values orange and banana use the following command prompt gt icpe A fruit orange banana progl cpp Using D Use the D option to define a macro Syntax Dname value Argument Description name The name of the macro to define Indicates a value to be substituted for name If you do not enter a value name is set to 1 The value should be quoted if it contains non alphanumerics value i For example to define a macro called SIZE with the value 100 use the following command prompt gt icpc DSIZE 100 progl cpp The D option can also be used to define functions For example prompt gt icpe D f x x progl cpp Using U Use the U option to re
135. Compiler for Linux Systems User s Guide Language Conformance Conformance Options Description Equivalent to GNU ANSI strict_ansi Strict ANSI conformance dialect Conformance to the C Standard You can set the Intel C Compiler to accept either e ANSI conformance equivalent to GNU ANSI with the ansi option or e Strict ANSI conformance dialect with the st rict_ansi option The compiler is set by default to accept extensions and not be limited to the ANSI ISO standard Understanding the ANSI ISO Standard C Dialect The Intel C Compiler provides conformance to the ANSI ISO standard for C language compilation ISO IEC 9899 1990 This standard requires that conforming C compilers accept minimum translation limits This compiler exceeds all of the ANSI ISO requirements for minimum translation limits Macros Included with the Compiler The ANSI ISO standard for C language requires that certain predefined macros be supplied with conforming compilers The following table lists the macros that the Intel C Compiler supplies in accordance with this standard The compiler includess predefined macros in addition to those required by the standard Macro Value _ _DATE__ The date of compilation as a string literal in the form Mmm dd Yyyy FPLE A string literal representing the name of the file being compiled LINE_ The current line number as a decimal constant _STDC__ The name __ STDC___ is defined wh
136. Default Tums on the three debug OFF options debug extended e debug inline_info e debug variable_locations qM Output macro definitions in OFF effect after preprocessing use with E Dname value Defines a macro name and OFF associates it with the specified value Equivalent to a define preprocessor directive dryrun Show driver tool commands but OFF do not execute tools dynamic linkerfilename Selects a dynamic linker OFF filename other than the default Stops the compilation process OFF after the C or C source files have been preprocessed and writes the results to stdout EP Preprocess to stdout omitting OFF line directives export Enable recognition of exported OFF templates Supported in C mode only export_dir dir Specifies a directory name for OFF the exported template search path falias Assume aliasing in program ON fabi version n Directs the compiler to selecta OFF specific ABI implementation 12 Compiler Options Quick Reference Option Description Default fast The fast option maximizes OFF speed across the entire program For Itanium based systems fast sets 03 i po and static For IA 32 and Intel EM64T systems fast sets 03 ipo static and xP Note that on IA 32 and Intel EM64T systems programs compiled with the xP option will detect non compatible processors and generate an error m
137. Example of Profile Guided Optimization Compilation Phase When using IPO as each source file is compiled the compiler stores an intermediate representation IR of the source code in the object file which includes summary information used for optimization By default the compiler produces mock object files during the compilation phase of IPO Generating mock files instead of real object files reduces the time spent in the IPO compilation phase Each mock object file contains the IR for its corresponding source file but no real code or data These mock objects must be linked using the ipo option or the xild tool See Creating a Multifile IPO Executable with xild d Note Failure to link mock objects with the ipo option or xi 1d will result in linkage errors There are situations where mock object files cannot be used See Compilation with Real Object Files for more information Linkage Phase When you invoke the linker adding ipo to the command line causes the compiler to be invoked a final time before the linker The compiler performs IPO across all object files that have an IR The compiler first analyzes all of the summary information and then finishes compiling the pieces of the application for which it has IR Having global information about the application while it is compiling individual pieces can improve the quality of optimization F Note The compiler does not support multifile IPO for static libraries a files
138. F and or create a file for precompiled headers in dirname prec_div 132 132em Disables the floating point OFF division to multiplication optimization Improves precision of floating point divides 24 Compiler Options Quick Reference Option Description Default Enables disables the insertion ON of software prefetching by the compiler Default prefetch prefetch 132 only prof_dir dirname Specify the directory OFF dirname to hold profile information dyn dpi prof_file filename Specify the filename for OFF profiling summary file prof_format_32 By default the Intel compiler OFF creates 64 bit profiling counters dyn and dpi This option creates 32 bit counters for compatibility with the Intel C Compiler 7 0 prof_gen x Instruments the program to OFF prepare for instrumented execution and also creates a new static profile information file spi With the x qualifier extra source position is collected which enables code coverage tools prof_use Uses dynamic feedback OFF information Qinstall dir Sets dir as root of compiler OFF installation Qlocation tool path Sets pathas the location of the OFF tool specified by tool Qoption tool list Passes an argument list to OFF another tool in the compilation sequence such as the assembler or linker Compile and link for function OFF profiling with UNIX prof tool
139. Guide mm_comile_ss mm_comigt_ss mm_comige_ss mm_comineq_ss mm_ucom mm_ucom mm_ucom mm_ucom mm_ucom ieq_ss ilt_ss ile_ss igt_ss ige_ss 372 ucomineg_ss cvt_ss2si cvt_ps2pi _ cvtt_ss2si _ cvtt_ps2pi cvt_si2ss Cvt_pi2ps cvtpil6_ps cvtpul6_ps cvtpi8_ps cvtpu8_ps Cvtpi32x2_ps _ cvtps_pil Cvtps_pi8 _move_ss _shuffle_ps _unpackhi_ps _unpacklo_ps _movehl_ps movelh_ps movemask_ps _ getcsr _mm_cvtss_si32 _mm_cvtps_pi32 _mm_cvttss_si32 _mm_cvttps_pi32 _mm_cvtsi32_ss _mm_cvtpi32_ps N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A wW ve w ve ve w ve ve ve w gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt SFlLalrl el rl el ele apyarlrasyayasyayasyayly Oy Se OW er wo ww wh ll ll w l mm_setcsr mm_loadh_pi mm_loadl_pi mm_load_ss mm_load_psl mm_load_ps mm_loadu_ps mm_loadr_ps mm_storeh_pi mm_storel_pi mm_store_ss mm_store_ps mm_store_psl mm_storeu_ps mm_storer_ps mm_set_ss mm_set_psl mm_set_ps mm_setr_ps mm_se
140. H and CPLUS_INCLUDE_PATH environment variables How to Remove Include Directories Use the X option to prevent the compiler from searching the default path specified by the environment variables You can use the X option with the I option to prevent the compiler from searching the default path for include files and direct it to use an alternate path For example to direct the compiler to search the path alt include instead of the default path do the following prompt gt icpe X I alt include source cpp Controlling Compilation If no errors occur during processing you can use the output files from a particular phase as input to a subsequent compiler invocation The following table describes the options to control the output Option Input Output P e Source files Preprocessed files i files E e Source files Preprocesses source file and directs output to stdout EP e Source files Preprocesses source file directs output to stdout and omits line numbers C e Source files Compile to object only 0 do not link e Preprocessed files S e Source files Generate assemblable files with s suffix and stops the Preprocessed compilation process files 76 syntax e Default Option Input Source files Preprocessed files Source files Preprocessed files Assemblable files Object files Libraries Volume I Building Applications Output Emits diagnostic list of syntax error
141. Instruction _mm_cmpeq_ss Equal CMPEQSS _mm_cmpeq_ps Equal CMPEQPS _mm_cmplt_ss Less Than CMPLTSS mm_cmplt_ps Less Than CMPLTPS _mm_cmple_ss Less Than or Equal CMP LESS _mm_cmple_ps Less Than or Equal CMPLEPS _mm_cmpgt_ss Greater Than CMPLTSS mm_cmpgt_ps Greater Than CMPLTPS _mm_cmpge_ss Greater Than or Equal CMPLESS mm_cmpge_ps Greater Than or Equal CMPLEPS _mm_cmpneq_ss Not Equal CMPNEQSS _mm_cmpneq_ps Not Equal CMPNEQPS _mm_cmpnit_ss Not Less Than CMPNLTSS mm_cmpnlt_ps Not Less Than CMPNLTPS _mm_cmpnle_ss Not Less Than or Equal CMPNLESS mm_cmpnle_ps Not Less Than or Equal CMPNLEPS _Inm_cmpngt_ss Not Greater Than CMPNLTSS _mm_cmpngt_ps Not Greater Than CMPNLTPS _mm_cmpnge_ss Not Greater Than or Equal CMPNLESS mm_cmpnge_ps Not Greater Than or Equal CMPNLEPS _mm_cmpord_ss Ordered CMPORDSS _mm_cmpord_ps Ordered CMPORDPS _mm_cmpunord_ss Unordered CMPUNORDSS 273 Intel C Compiler for Linux Systems User s Guide Intrinsic Comparison Name _mm_cmpunord_ps Unordered _mm_comieq_ss Equal _mm_comilt_ps Less Than _mm_comile_ss Less Than or Equal _mm_comigt_ss Greater Than _mm_comige_ss Greater Than or Equal _mm_comineq_ss Not Equal _mm_ucomieq_ss Equal _mm_ucomilt_ss_ Less Than _mm_ucomile_ss_ Less Than or Equal _mm_ucomigt_ss_ Greater Than _mm_ucomige_ss Greater Than or Equal _mm_ucomineqg_ss Not Equal m128 _mm_cmpeq_ss __m128 a __m128 Compa
142. Intel C Compiler for Linux Systems User s Guide Document Number 253254 031 Table Of Contents Intel C Compiler LISCF Sa Ciao solidassceactaselsces Sin peettsalbuaadscasdetiatis wisiepiaqdsintesRaaimedicasecttas 1 Disclaimer and Legal Information ee eceeeeeeeeeeeeeeeee eter eeeeaeeeeeeeaeeeeeeaaeeeseeaaeeeeeeaaeeeeeeaeeeseeaeeeeeenaeeeeneaaes 1 Welcome to the Intel C Compiler ccccccceccceceeeeeeeeeeeeeeeceaeeeeaaeeeeaeeseeeeeseaeeesaaesseeeeseeeeeeaeseeaeeeneeeeaes 1 What s New in This Release c cccccessecceeeeeeeeeeeeeeeeeeeeaeeesesaaeeeseaaeeeseaaeseseaaeeesecaaeseseegaeeesesneeeseeseneeees 2 Features and Bene tits 2 cehccaiescdee a Gece ea aa a aE a aeaa A Naa pN EAA aa EAR ANARA REEE 2 Product Web Site and Supportt essseesseesseeseessetssnessntsintstrnetnnetnnetnnstnastnsstnsstnsstenstensttastnnstnnntnnennnntnnnenn nnt 3 System Requirements ccccccccesseceeececeeeeeeeaeeeeaaeeeeeeeceaeceeaaesseaeecaaeeecaaesdeaaesgneesaeeseeaaeegeeeseaeesaeeseaaeesenees 3 FLEXIm Electronic LICGNSING eniai ae a aaea E E R 3 Related Publication presare roae a aoa aar aA a AAA Aa eee A eE AKER Mies heer 4 How to Use Ths DOCUMEMKEE ssa sce iets cae s aa a E A 5 Compiler Options Quick Reference sxcsicccieisseiniends conned ied nas hana 6 New Options asc rte Meta ee ec ot se a tasters eats estes Abe Unrate sae Ge teen Ma cure a hea 6 Options Quick Reference Guide cccccceceeeeeesceceeee
143. M Technology Set Intrinsics The prototypes for MMX TM technology intrinsics are in the mmint rin h header file Intrinsic Operation Number of Element Signed Reverse Name Elements Bit Size _mm_setzero_sid64 set to zero _mm_set_pi32 set integer values 2 _mm_set_pil6 set integer values 4 16 _mm_set_pi8 set integer values 8 8 264 Reference So peel Element Signed Reverse Name Elements Bit Size Order E Note In the following descriptions regarding the bits of the MMX register bit 0 is the least significant and bit 63 is the most significant m64 _mm_setzero_si64 PXOR Sets the 64 bit value to zero r 0x0 m64 _mm_set_pi32 int il int i0 composite Sets the 2 signed 32 bit integer values r0 i0 rl iL m64 _mm_set_pil6 short s3 short s2 short sl short s0 composite Sets the 4 signed 16 bit integer values rO w0 ri wl r2 w2 r3 w3 m64 _mm_set_pi8 char b7 char b6 char b5 char b4 char b3 char b2 char bl char b0 composite Sets the 8 signed 8 bit integer values r0 bO pi bl r7 p7 m64 _mm_set1_pi32 int i Sets the 2 signed 32 bit integer values to i ro i ri i 265 Intel C Compiler for Linux Systems User s Guide m64 _mm_set1l_pil6 short s composite Sets the 4 signed 16 bit integer values to w rO w ri w r2 w r3 w m64 _mm_setl_pi8 char b composite Sets the 8 signed
144. N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A loadl_pd loadr_pd loadu_pd load_sd loadh_pd loadl_pd set_sd setl_pd set_pd setr_pd setzero_pd move_sd store_sd storel_pd store_pd storeu_pd storer_pd storeh_pd sStorel_pd add_epis8 add_epil add_epi32 _ add_si64 add_epi64 adds_epi8 adds_epil adds_epus _ adds_epul avg_epus avg_epul madd_epil N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A
145. O a0 gt b0 Oxfffff ri al gt bl Oxfffff __m128d _mm_cmpge_pd _ m1 Compares the two DP FP values of a EEFEEFEFFTE frfrfrfrffrfett 28d a __m and b for a gre 0x0 0x0 28d b ater than or equal to b 0o a0 gt bO Oxfffffffffrfffffftt 0x0 rl al gt bl Oxfffffffffffffftt 0x0 __m128d _mm_cmpord_pd __m128d a __m128d b Compares the two DP FP values of a and b for ordered ro a0 ord DOJ Oxfffffffffrfffffft 0x0 ri al ord bl Oxfffffffffrfffffft 0x0 __m128d _mm_cmpunord_pd __m128d a __m128d b Compares the two DP FP values of a and b for unordered rO a0 unord b0 Oxffffffffffffffff 0x0 ri al unord bl Oxffffffffffffffff 0x0 __m128d _mm_cmpneq_pd m128d a _ m128d b Compares the two DP FP values of a and b for inequality rO a0 bO Oxffffffffffffffff 0x0 rl al bl Oxffffffffffffffft 0x0 __m128d _mm_cmpnit_pd __m128d a __m128d b Compares the two DP FP values of a and b for a not less than b rO a0 lt bO Oxffff rl al lt bl Oxffff __m128d _mm_cmpnle_pd __m FEPREPEE EEE fFrfffffefrtt 128d a _m 0x0 0x0 Mh Ft 128d b Compares the two DP FP values of a and b for a not less than or equal to b YO a0 lt bO Oxfff rl al lt bl Oxfff __m128d _mm_cmpngt_pd __m 128d a FEELLELLEECE Frffffferft mMm Ff 0x0 E 0x0 128d b Compares the two D
146. P FP values of a and b for a not greater than b rO a0 gt b0 Oxffff ri al gt bl Oxffff __m128d _mm_cmpnge_pd __m 128d a FELELLL ELLE Frfffffrefrtt m 0x0 0x0 mh Fh 128d b Compares the two DP FP values of a and b for a not greater than or equal to b ro rl Oxfff Oxfff a0 gt b0 al gt bl 304 EELEE FEELET ELEFEFEEEFE cf cf 0x0 0x0 Reference __m128d _mm_cmpeq_sd __m128d a __m128d b Compares the lower DP FP value of a and b for equality The upper DP FP value is passed through from a r0 a0 bO Oxffffffffffffffff 0x0 rl al __m128d _mm_cmplt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a less than b The upper DP FP value is passed through from a ro tL aO lt bO Oxffffffffffffffff Oxo il __m128d _mm_cmple_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a less than or equal to b The upper DP FP value is passed through from a rO a0 lt bO Oxffffffffffffffff 0x0 PI al __m128d _mm_cmpgt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than b The upper DP FP value is passed through from a ro rl aOQ gt bO Oxffffffffffffffff Oxo al __m1l28d _mm_cmpge_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than or equal to b The upper DP FP value is passed through
147. Reference Iul6vec4 A i unsigned short R Is8vec16 A i signed char R Iu8vec16 A i unsigned char R Is8vec8 A i Signed char R Iu8vec8 A i unsigned char R Assign R to element i of A If DEBUG is enabled and the user tries to assign a value to an element outside of A a diagnostic message is printed and the program aborts Corresponding Intrinsics none Unpack Operators Interleave the 64 bit value from the high half of A with the 64 bit value from the high half of B 1364vec2 unpack_high Il64vec2 A I64vec2 B Is64vec2 unpack_high Is64vec2 A Is64vec2 B Tu64vec2 unpack_high Iu64vec2 A Iu64vec2 B RO Al Rl Bl Corresponding intrinsic _mm_unpackhi_epi64 Interleave the two 32 bit values from the high half of A with the two 32 bit values from the high half of B I32vec4 unpack_high I32vec4 A I32vec4 B Is32vec4 unpack_high Is32vec4 A Is32vec4 B Tu32vec4 unpack_high Iu32vec4 A Iu32vec4 B RO Al Rl Bl R2 A2 R3 B2 Corresponding intrinsic _mm_unpackhi_epi32 Interleave the 32 bit value from the high half of A with the 32 bit value from the high half of B I32vec2 unpack_high I32vec2 A I32vec2 B Is32vec2 unpack_high Is32vec2 A Is32vec2 B Tu32vec2 unpack_high lu32vec2 A Iu32vec2 B 407 Intel C Compiler for Linux Systems User s Guide RO R1 Al Bl Corresponding intrinsic _mm_unpackhi_pi32 Interleave the four 16 bit values from the high half of A with th
148. SR TA64_REG_CR_ISR TA64_REG_CR_IIP TA64_REG_CR_IFA TA64_REG_CR_ITIR TA64_REG_CR_IIPA TA64_REG_CR_IFS TA64_REG_CR_IIM TA64_REG_CR_IHA TA64_REG_CR_LID TA64_REG_CR_IVR TA64_REG_CR_TPR TA64_REG_CR_EOI TA64_REG_CR_IRRO TA64_REG_CR_IRR1 TA64_REG_CR_IRR2 TA64_REG_CR_IRR3 TA64_REG_CR_ITV TA64_REG_CR_PMV TA64_REG_CR_CMCV TA64_REG_CR_LRRO TA64_REG_CR_LRR1 getReg only whichReg 4104 4112 4113 4115 4116 4117 4118 4119 4120 4121 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4176 4177 Reference 351 Intel C Compiler for Linux Systems User s Guide Indirect Registers for getIndReg and setindReg whichReg DR_RESERVED 9007 get IndReg only Multimedia Additions The prototypes for these intrinsics are in the ia64intrin h header file Intrinsic int64 _m64_czx1l1 __m64 a int64 _m64_czxlir __m64 a int64 _m64_czx21 __m64 a int64 _m64_czx2r __m64 a m64 _m64 mixll __m64 a _ m64 b m64 _m64 mixlr __m64 a _ m64 b m64 _m64_mix21 __ m64 a _ m64 b m64 _m64 mix2r __m64 a _ m64 b m64 _m64_mix41 _ m64 a _ m64 b m64 _m64_mix4r _ m64 a _ m64 b m64 _m64_mux1 _ m64 a const int n m64 _m64_mux2 _ m64 a const int n m64 _m64_paddluus __m64 a __m64 b m64 _m64_padd2uus _ m64 a __m64 b m64 _m64_pavgl_nraz __m64 a _
149. Shift Operator Overloading Operation R Right Shift Left Shift A Logical I64vec1 gt gt gt gt lt lt lt lt 164vecl A I64vec1 B Logical I32vec2 gt gt lt lt lt lt I32vec2 A 132vec2 B Arithmetic Is32vec2 gt gt lt lt lt lt Is32vec2 A I slu N vec N B Logical Tu32vec2 gt gt gt gt lt lt lt lt Iu32vec2 A Logical Il6vec4 e gt gt lt lt lt lt Il6vec4 A Arithmetic Is16vec4 gt gt gt gt lt lt lt lt Islovec4 A Logical Iul6vec4 gt gt lt lt lt lt Iulovec4 A Comparison Operators The equality and inequality comparison operands can have mixed signedness but they must be of the same size The comparison operators for less than and greater than must be of the same sign and size Example of Syntax Usage for Comparison Operator The nearest common ancestor is returned for compare for equal not equal operations Tu8vecs8 A Is8vec8 B I8vec8 C C cmpneq A B Type cast needed for different sized elements for equal not equal comparisons Iu8vec8 A C Isl6vec4 B C cmpeq A Iu8vec8 B Type cast needed for sign or size differences for less than and greater than comparisons Iul6vec4 A Isl6vec4 B C C cmpge Isl6vec4 A B o cmpgt B C 400 Reference Inequality Comparison Symbols and Corresponding Intrinsics m Compare For Ope
150. Than _m_pcempgtd _mm_cmpgt_pi32 Greater Than __m64 _m_pcmpegh __m64 ml __m64 m2 Alternate Comparison Number Element Corresponding Name of Bit Size Instruction Elements If the respective 8 bit values in m1 are equal to the respective 8 bit values in m2 set the respective 8 bit resulting values to all ones otherwise set them to all zeros __m64 _m_pcmpeqw __m64 ml __m64 m2 If the respective 16 bit values in m1 are equal to the respective 16 bit values in m2 set the respective 16 bit resulting values to all ones otherwise set them to all zeros __m64 _m_pcmpegqd __m64 ml __m64 m2 If the respective 32 bit values in m1 are equal to the respective 32 bit values in m2 set the respective 32 bit resulting values to all ones otherwise set them to all zeros m64 _m_pempgtb __m64 ml __m64 m2 If the respective 8 bit values in m1 are greater than the respective 8 bit values in m2 set the respective 8 bit resulting values to all ones otherwise set them to all zeros __m64 _m_pempgtw __m64 ml __m64 m2 If the respective 16 bit values in m1 are greater than the respective 16 bit values in m2 set the respective 16 bit resulting values to all ones otherwise set them to all zeros m64 _m_pempgtd __m64 ml __m64 m2 If the respective 32 bit values in m1 are greater than the respective 32 bit values in m2 set the respective 32 bit resulting values to all ones otherwise set them all to zeros MMX T
151. The Intel C Class Libraries for SIMD Operations provide a convenient interface to access the underlying instructions for processors as specified in Processor Requirements for Use of Class Libraries These processor instruction extensions enable parallel processing using the single instruction multiple data SIMD technique as illustrated in the following figure SIMD Data Flow Performing four operations with a single instruction improves efficiency by a factor of four for that particular instruction These new processor instructions can be implemented using assembly inlining intrinsics or the C SIMD classes Compare the coding required to add four 32 bit floating point values using each of the available interfaces Comparison Between Inlining Intrinsics and Class Libraries Assembly Inlining Intrinsics SIMD Class Libraries ml28 a b c include lt mmintrin h gt include __asm movaps xmm0 b cs _m128 a b c a lt fvec h gt movaps xmml c addps _mm_add_ps b c j F32vec4 a b c xmm0 xmml movaps a a D c xmm0 This table shows an addition of two single precision floating point values using assembly inlining intrinsics and the libraries You can see how much easier it is to code with the Intel C SIMD Class Libraries Besides using fewer keystrokes and fewer lines of code the notation is like the standard notation in C making it much easier to implement over other methods C Cl
152. The arithmetic operations for the Streaming SIMD Extensions 2 SSE2 are listed in the following table The prototypes for SSE2 intrinsics are in the emmint rin h header file Intrinsic Name mm_add_sd mm_add_pd 1 8d Lpd mm_div_sd mm_div_pd min_sd mm_min_pd mm_max_sd mm_max_pd mm_sub_sd SUBSD mm_sub_pd SUBPD mm_sqrt_sd SQRTS mm_sqrt_pd SQRTP Corresponding Instruction ADDSD ADDPD ULSD ULPD DIVSD DIVPD D D MINSD l 3 3 O oO ie ue MINPD MAXSD AXPD Operation Addition Addition Subtraction Subtraction Multiplication Multiplication Computes Square a Root Computes Square Root Computes Minimum Computes Minimum a Computes Maximum a0 Computes Maximum RO Value a0 op bo a0 op bo a0 op bo a0 op bo a0 op bo a0 op bo a0 op bo a0 op o oy O fo e O 0 ao bO a E 0 0 O gel o eg O fo 5 5 a0 op toy S R1 Value al al b1 op op op al b1 op al al b1 op 299 Intel C Compiler for Linux Systems User s Guide m128d _mm_add_sd __m128d a __m128d b Adds the lower DP FP double precision floating point values of a and b the upper DP FP value is passed through from a f0 a0 bO rl al __m128d _mm_add_pd __m128d a __m128d b Adds the two DP FP values of a and b rO
153. To force the compiler to produce real object files instead of mock ones with IPO you must specify ipo_obj in addition to ipo Use of ipo_obj is necessary under the following conditions e The objects produced by the compilation phase of ipo will be placed in a static library without the use of xiar The compiler does not support multifile IPO for static libraries so all static libraries are passed to the linker Linking with a static library that contains mock object files will result in linkage errors because the objects do not contain real code or data Specifying ipo_obj causes the compiler to generate object files that can be used in static libraries e Alternatively if you create the static library using xiar then the resulting static library will work as a normal library e The objects produced by the compilation phase of ipo might be linked without the ipo option and without the use of xiar e You want to generate an assembly listing for each source file using S while compiling with ipo If you use ipo with S but without ipo_obj the compiler issues a warning and an empty assembly file is produced for each compiled source file Implementing the il Files with Version Numbers An IPO compilation consists of two parts the compile phase and the link phase In the compile phase the compiler produces an intermediate language IL version of the users code In the link phase the compiler reads the IL and c
154. UID test is not needed All legacy processors will execute PAUSE as a NOP but in processors which use the PAUSE as a hint there can be significant performance benefit Miscellaneous Operations for Streaming SIMD Extensions 2 The miscellaneous intrinsics for Streaming SIMD Extensions 2 SSE2 are listed in the following table followed by their descriptions The prototypes for SSE2 intrinsics are in the emmint rin h header file Intrinsic Corresponding Operation Instruction _mm_packs_epil6 PACKSSWB Packed Saturation _mm_packs_epi32 PACKSSDW Packed Saturation _mm_packus_epil6 PACKUSWB Packed Saturation _mm_extract_epil6 PEXTRW Extraction _mm_insert_epil6 PINSRW Insertion _mm_movemask_epi8 PMOVMSKB Mask Creation _mm_shuffle_epi32 PSHUFD Shuffle _mm_shufflehi_epil PSHUFHW Shuffle _mm_shufflelo_epil PSHUFLW Shuffle _mm_unpackhi_epi8 PUNPCKHBW Interleave _mm_unpackhi_epil PUNPCKHWD Interleave _mm_unpackhi_epi32 PUNPCKHDQ Interleave _mm_unpackhi_epi64 PUNPCKHQDQ Interleave _mm_unpacklo_epi8 PUNPCKLBW Interleave _mm_unpacklo_epil PUNPCKLWD Interleave _mm_unpacklo_epi32 PUNPCKLDQ Interleave _mm_unpacklo_epi 4 PUNPCKLOQDOQ Interleave _mm_movepi64_pi64 MOVDQ20 move _m128i_mm_movpi64_epi64 MOVO2DO move _mm_move_epi64 MOVO move 331 Intel C Compiler for Linux Systems User s Guide __m128i _mm_packs_epil6 __m128i a __m128i b Packs the 16 signed 16 bit integers from a and b into 8 bit in
155. V Display compiler version OFF information V Show driver tool commands and execute tools 27 Intel C Compiler for Linux Systems User s Guide Option Description Default vec_report n i32 i32em W Wbrief Wcheck W 28 Controls the amount of vectorizer diagnostic information e n 0 no diagnostic information e n 1 indicates vectorized loops DEFAULT e n 2 indicates vectorized non vectorized loops e n 3 indicates vectorized non vectorized loops and prohibiting data dependence information e n 4 indicates non vectorized loops e n 5 indicates non vectorized loops and prohibiting data Disable all warnings Enable a mode in which a shorter form of the diagnostic output is used When enabled the original source line is not displayed and the error message text is not wrapped when too long to fit on a single line Performs compile time code checking for code that exhibits non portable behavior represents a possible unintended code sequence or possibly affects operation of the program because of a quiet change in the ANSI C Standard Control diagnostics e n 0 displays errors same as w e n 1 displays warnings and errors DEFAULT e n 2 displays remarks warnings and errors OFF OFF Wall Enable all warnings OFF OFF OFF w1 Compiler Options Quick Reference Option Description Default wdL1 L2 Disables diagnostics L1 OFF
156. When the master thread encounters a parallel construct it creates a team of threads with the master thread becoming the master of the team The program statements enclosed by the parallel construct are executed in parallel by each thread in the team These statements include routines called from within the enclosed statements The statements enclosed lexically within a construct define the static extent of the construct The dynamic extent includes the static extent as well as the routines called from within the construct When the pragma omp parallel directive reaches completion the threads in the team synchronize the team is dissolved and only the master thread continues execution The other threads in the team enter a wait state You can specify any number of parallel constructs in a single program As a result thread teams can be created and dissolved many times during program execution Using Orphaned Directives In routines called from within parallel constructs you can also use directives Directives that are not in the lexical extent of the parallel construct but are in the dynamic extent are called orphaned directives Orphaned directives allow you to execute major portions of your program in parallel with only minimal changes to the sequential version of the program Using this functionality you can code parallel constructs at the top levels of your program and use directives to control execution in any of the called routines For ex
157. __m128d a Converts the two DP FP values of a to SP FP values ro ei r2 float a0 float al 0 0 r3 0 0 __m128d _mm_cvtps_pd m128 a Converts the lower two SP FP values of a to DP FP values ro 1 ao al double double __m128d _mm_cvtepi32_pd __m128i a Converts the lower two signed 32 bit integer values of a to DP FP values ro rl 308 double double ad al Reference m128i _mm_cvtpd_epi32 __m128d a Converts the two DP FP values of a to 32 bit signed integer values rO int a0 ri int al r2 0x0 r3 0x0 int _mm_cvtsd_si32 __m128d a Converts the lower DP FP value of a to a 32 bit signed integer value r int a0 m128 _mm_cvtsd_ss __m128 a _ m128d b Converts the lower DP FP value of b to an SP FP value The upper SP FP values in a are passed through rO float bO ri al r2 a2 r3 a3 _ m128d _mm_cvtsi32_sd __m128d a int b Converts the signed integer value in b to a DP FP value The upper DP FP value in a is passed through ro double b rl al _ ml28d _mm_cvtss_sd __m128d a __m128 b Converts the lower SP FP value of b to a DP FP value The upper value DP FP value in a is passed through rO double bO ri al m128i _mm_cvttpd_epi32 __m128d a Converts the two DP FP values of a to 32 bit signed integers using truncate rO int a0 rl int al r2 0x0 r3 0
158. __m128i a uses MOVD Moves the least significant 32 bits of a to a 32 bit integer r i a m128 _mm_cvtepi32_ps __m128i a Converts the 4 signed 32 bit integer values of a to SP FP values rO float a0 ri float al r2 float a2 r3 float a3 __m128i _mm_cvtps_epi32 __m128 a Converts the 4 SP FP values of a to signed 32 bit integer values rO int a0 rl int al r2 int a2 r3 int a3 __m128i _mm_cvttps_epi32 __m128 a Converts the 4 SP FP values of a to signed 32 bit integer values using truncate rO int a0 rl int al r2 int a2 r3 int a3 Integer Memory and Initialization Operations for Streaming SIMD Extensions 2 The integer load set and store intrinsics and their respective instructions provide memory and initialization operations for the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmint rin h header file e Load Operations e Set Operations e Store Operations 325 Intel C Compiler for Linux Systems User s Guide Integer Load Operations for Streaming SIMD Extensions 2 The following load operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmintrin h header file __m128i _mm_load_sil28 __m128i const p uses MOVDQA Loads 128 bit value Address p must be 16 byte aligned roi p __m128i _mm
159. _loadu_sil28 __m128i const p uses MOVDQU Loads 128 bit value Address p not need be 16 byte aligned r p __m128i _mm_loadl_epi64 __m128i const p uses MOVQ Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result zeroing the upper 64 bits of the result rO p 63 0 r1 0x0 Integer Set Operations for SSE2 The following set operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmintrin h header file __m128i _mm_set_epi64 __m64 ql __m64 q0 Sets the 2 64 bit integer values r0 q0 rli ql __m128i _mm_set_epi32 int i3 int i2 int il int i0 Sets the 4 signed 32 bit integer values rO i0 rl il r2 12 r3 13 __m128i _mm_set_epil6 short w7 short w6 short w5 short w4 short w3 short w2 short wl short w0 Sets the 8 signed 16 bit integer values rO w0 ri wl r7 w7 326 __m128i _mm_set_epi8 char b15 char b10 char b9 char b8 char b7 char b2 char bl char b0 Sets the 16 signed 8 bit integer values ro neg r15 bo bl b 15 __m128i _mm_setl_epi64 __m64 q Sets the 2 64 bit integer values to q ro ri q q __m128i _mm_set1_epi32 int i Sets the 4 signed 32 bit integer values to i ro i rl i r2 i r3 i _ m128i _mm_setl_epil6 short w Sets the 8 signed 16 bit integer v
160. _m128d a __m128d b Compares the lower DP FP value of a and b for a not greater than or equal to b The upper DP FP value is passed through from a rO a0 gt bO Oxffffffffffffffff 0x0 EL Snail int _mm_comieq_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a equal to b If a and b are equal 1 is returned Otherwise 0 is returned r a0 bO 0x1 0x0 int _mm_comilt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a less than b If a is less than b is returned Otherwise 0 is returned ro a0 lt bO 0x1 0x0 int _mm_comile_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a less than or equal to b If a is less than or equal to b 1 is returned Otherwise 0 is returned r a0 lt bO Oxl 0x0 int _mm_comigt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than b If a is greater than b are equal 1 is returned Otherwise 0 is returned r i a0 gt b0 Oxl 0x0 306 Reference int _mm_comige_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than or equal to b If a is greater than or equal to b 1 is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_comineq_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a not equal to b If a and b are not equal 1 is returned Otherwise 0 is returned
161. a compiler distributed or system library and have long long double or long double types in your interface you will get the wrong answer due to the difference in alignment Any code built with align cannot make calls to libraries that use these types in their interfaces unless they are built with align in which case they will not work without align Math Libraries The Intel math library 1ibimf a contains optimized versions of math functions found in the standard C run time library The functions in Libimf a are optimized for program execution speed on Intel Pentium III and Pentium 4 processors The Itanium compiler also includes a 1ibimf a designed to optimize performance on Itanium based systems The Intel math library is linked by default See Managing Libraries and Intel Math Library Intel Shared Libraries By default the Intel C Compiler links Intel provided C libraries dynamically The GNU and Linux system libraries are also linked dynamically Options for Shared Libraries Option Description i_dynamic Use the i_dynamic option to link Intel provided C libraries dynamically default This has the advantage of reducing the size of the application binary but it also requires the libraries to be on the systems where the application runs shared The shared option instructs the compiler to build a Dynamic Shared Object DSO instead of an executable For more details refer to the 1d man page
162. a0 bO ri UnsignedSaturate al bl ri5 UnsignedSaturate al5 b15 318 Reference __m128i _mm_subs_epul6 __m128i a __m1281i b Subtracts the 8 unsigned 16 bit integers of b from the 8 unsigned 16 bit integers of a using saturating arithmetic r0 UnsignedSaturate a0 bO ri UnsignedSaturate al bl r7 UnsignedSaturate a7 b7 Integer Logical Operations for Streaming SIMD Extensions 2 The following four logical operation intrinsics and their respective instructions are functional as part of Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmint rin h header file _ m128i mm_and_sil28 __m128i a _ m128i b uses PAND Computes the bitwise AND of the 128 bit value in a and the 128 bit value in b rosi a amp b m128i _ mm_andnot_sil28 __m1281i a __m128i b uses PANDN Computes the bitwise AND of the 128 bit value in b and the bitwise NOT of the 128 bit value in a ml28i mm_or_sil28 __m1281i a __m128i b uses POR Computes the bitwise OR of the 128 bit value in a and the 128 bit value in b ri al ob mil28i _ mm_xor_sil28 __m128i a __m128i b uses PXOR Computes the bitwise XOR of the 128 bit value in a and the 128 bit value in b r a b Integer Shift Operations for Streaming SIMD Extensions 2 The shift operation intrinsics for Streaming SIMD Extensions 2 SSE2 and the description for each are listed in the following table The pr
163. a2 a3 op op op op bO bl b2 b3 _mm_sqrt_ss SORTSS Squared Root op al a2 a3 a0 _mm_sqrt_ps SQRTPS Squared Root a a op op b2 b3 _mm_rcp_ss RCPSS Reciprocal op a2 a3 a0 _mm_rcp_ps RCPPS Reciprocal op op op ao b2 H3 _mm_rsqrt_ss RSQRTSS Reciprocal op a2 a3 Square Root a0 _mm_rsqrt_ps RSQRTPS Reciprocal op op Squared Root 20 er b2 b3 268 Reference Intrinsic Instruction Operation RO R1 R2 _mm_min_ss MINSS Computes op al a2 Minimum a0 b0 _mm_min_ps MINPS Computes op op op Minimum a0 b0 al a2 bl b2 _mm_max_ss MAXSS Computes op al a2 Maximum a0 b0 Computes op op op Maximum a0 b0 al a2 bl b2 m128 _mm_add_ss __m128 a __m128 b Adds the lower SP FP single precision floating point values of a and b the upper 3 SP FP values are passed through from a rO a0 bO ri al r2 a2 r3 a3 m128 _mm_add_ps __m128 a __m128 b Adds the four SP FP values of a and b rO a0 bO ri al bl r2 a2 b2 r3 a3 p3 m128 _mm_sub_ss __m128 a _ m128 b Subtracts the lower SP FP values of a and b The upper 3 SP FP values are passed through from a r0 a0 b0 ri al r2 a2 r3 a3 m128 _mm_sub_ps __m128 a __m128 b Subtracts the four SP FP values of a and b rO a0 bO ri al bl r2 a2 b2 r3 a3 b3 m128 _mm_mul_ss __m128 a _ m128 b Multiplies the lower SP FP valu
164. ables dynamic adjustment of the number of threads used to execute a parallel region If dynamic_threads is TRUE dynamic threads are enabled If dynamic_threads is FALSE dynamic threads are disabled Dynamics threads are disabled by default Returns TRUE if dynamic thread adjustment is enabled otherwise returns FALSE Enables or disables nested parallelism If nested is TRUE nested parallelism is enabled If nested is FALSE nested parallelism is disabled Nested parallelism is disabled by default Returns TRUE if nested parallelism is enabled otherwise returns FALSE Description Initializes the lock associated with lock for use in subsequent calls Causes the lock associated with lock to become undefined Forces the executing thread to wait until the lock associated with lock is available The thread is granted ownership of the lock when it becomes available Releases the executing thread from ownership of the lock associated with Lock The behavior is undefined if the executing thread does not own the lock associated with lock Attempts to set the lock associated with lock If successful returns TRUE otherwise returns FALS Gl Initializes the nested lock associated with lock for use in the subsequent calls Causes the nested lock associated with lock to become undefined Volume II Optimizing Applications Function Description omp_set_nest_lock lock Forces the execut
165. agnostic The following is an example of a warning message tantst cpp 3 warning 328 Local variable increment never used The compiler can also display internal error messages on the standard error If your compilation produces any internal errors contact your Intel representative Internal error messages are in the following form FATAL COMPILER ERROR message Suppressing Warning Messages with lint Comments The UNIX lint program attempts to detect features of a C or C program that are likely to be bugs non portable or wasteful The compiler recognizes three 1int specific comments 1 ARGSUSED 2 NOTREACHED 3 VARARGS Like the lint program the compiler suppresses warnings about certain conditions when you place these comments at specific points in the source Suppressing Warning Messages or Enabling Remarks Use the w or Wn option to suppress warning messages or to enable remarks during the preprocessing and compilation phases You can enter the option with one of the following arguments Option Description j w0 Display only errors same as w __ w1 Display warnings and errors DEFAULT w2 Display remarks warnings and errors 208 Reference For some compilations you might not want warnings for known and benign characteristics such as the K amp R C constructs in your code For example the following command compiles newpr
166. al Optimizations Tool used for Interprocedural Optimizations 205 Intel C Compiler for Linux Systems User s Guide include Files File emmintrin h float h fvec h ia 4intrin h ia 64regs h iso646 h ivec h limits h mathimf h mmintrin h omp h pgouser h proto h sse2mmx h stdarg h stdbool h stddef h syslimits h varargs h xarg h xmmintrin h 206 Description Principal header file for SSE2 intrinsics IEEE 754 version of standard float h SSE intrinsics for Class Libraries Standard header file Standard header file MMX TM instructions intrinsics for Class Libraries Standard header file Principal header file for current Intel Math Library Intrinsics for MMX instructions Principal header file OpenMP For use in the instrumentation compilation phase of profile guided optimizations Principal header file for Streaming SIMD Extensions 2 intrinsics Replacement header for standard stdarg h Defines _Bool keyword Standard header file Replacement header for standard varargs h Header file used by stdargs hand varargs h Principal header file for Streaming SIMD Extensions intrinsics Reference lib Files File Description libcprts a C standard language library libcxa so C language library indicating I O data location libirc a Intel specific library optimizations libm a Math library libguide a OpenMP library libguide so Shared OpenMP library ibmofl a Multiple Obje
167. alues to w ro Yl r7 W WwW Ww __m128i _mm_set1_epi8 char b Sets the 16 signed 8 bit integer values to b ro pi r15 __m128i _mm_setr_epi64 __m64 q0 Sets the 2 64 bit integer values in reverse order ro rl b b b q0 ql __m128i _mm_setr_epi32 int i0 Sets the 4 signed 32 bit integer values in reverse order ro ti r2 E3 i0 il 12 13 Reference char bll char b3 327 Intel C Compiler for Linux Systems User s Guide __m128i _mm_setr_epil short w0 short wl short w2 short w3 short w4 short w5 short w6 short w7 Sets the 8 signed 16 bit integer values in reverse order rO w0 ri wl r7 w7 __m128i _mm_setr_epi8 char b15 char b14 char b13 char b12 char b11 char b10 char b9 char b8 char b7 char b6 char b5 char b4 char b3 char b2 char bl char b0 Sets the 16 signed 8 bit integer values in reverse order rO bO rl bl ELS bl5 _ ml28i _mm_setzero_sil28 Sets the 128 bit value to zero r 0x0 Integer Store Operations for Streaming SIMD Extensions 2 The following store operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmint rin h header file void _mm_store_sil28 __m128i p __m128i b uses MOVDQA Stores 128 bit value Address p must be 16 byte aligned p r a void _mm_storeu_sil28 __m128i p __m128i b
168. ample int main void pragma omp parallel phasel void phasel void pragma omp for private i shared n for i 0 i lt n itt some_work i This is an orphaned directive because the parallel region is not lexically present 176 Volume II Optimizing Applications Data Environment Directive A data environment directive controls the data environment during the execution of parallel constructs You can control the data environment within parallel and worksharing constructs Using directives and data environment clauses on directives you can e Privatize scope variables by using the THREADPRIVATE directive e Control data scope attributes by using the THREADPRIVATE directive s clauses The data scope attribute clauses are e COPYIN DEFAULT PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SHARED You can use several directive clauses to control the data scope attributes of variables for the duration of the construct in which you specify them If you do not specify a data scope attribute clause on a directive the default is SHARED for those variables affected by the directive Pseudo Code of the Parallel Processing Model A sample pseudo program using some of the more common OpenMP directives is shown in the code example that follows This example also indicates the difference between serial regions and parallel regions main Begin serial execution Only
169. aps to the ptc ga r rinstruction Purges the translation register Maps to the ptr i r rinstruction Purges the translation register Maps to the ptr d r x instruction Map the tpa instruction Invalidates ALAT Maps to the invala instruction Same as void __invalat void whichGeneralReg 0 127 347 Intel C Compiler for Linux Systems User s Guide A Intrinsic void __invala_fr const int whichFloatReg m void _ break const int mmm void _ nop const int void _ debugbreak void _ t void __fc __int64 fe void __sum int mask m OAS void _ rum int mask mmm int64 _ReturnAddress void void __lfetch int lfhint void y fo void __lfetch_fault int lfhint void y void __lfetch_excl int lfhint void y void __lfetch_fault_excl int lfhint void y SEEN unsigned int _ CacheSize unsigned int cacheLevel aa void memory_barrier void N void __ssm int mask A void __rsm int mask 348 Description whichFloatReg 0 127 Generates a break instruction with an immediate Generate a nop instruction Generates a Debug Break Instruction fault Flushes a cache line associated with the address given by the argument Maps to the fc instruction Sets the user mask bits of PSR Maps to the sum imm24 instruction Resets the user mask Get the caller s address Generate the 1fetch lfhint instruction The value of the first argu
170. arithm of the absolute value of gamma The sign of the gamma function is returned in the integer signgam Calling interface double gamma_r double x int signgam double gammal_r long double x int signgam float gammaf_r float x int signgam 226 Reference JO Description Computes the Bessel function of the first kind of x with order 0 Calling interface double jO double x double jOl long double x float jOf float x J1 Description Computes the Bessel function of the first kind of x with order 1 Calling interface double j1 double x double jll long double x float jlf float x JN Description Computes the Bessel function of the first kind of x with order n Calling interface double jn int n double x double jnl int n long double x float jnf int n float x LGAMMA Description The 1gamma function returns the value of the logarithm of the absolute value of gamma errno ERANGE for overflow conditions x 0 or negative integers Calling interface double lgamma double x long double lgammal long double x float lgammaf float x LGAMMA_R Description The 1gamma_r function returns the value of the logarithm of the absolute value of gamma The sign of the gamma function is returned in the integer signgam errno ERANGE for overflow conditions x 0 or negative integers Calling interface double lgamma_r double x int signgam long double lgamma_r double
171. asses Part 1 Intrinsic unpack_high _mm_unpackhi_ x epi64 epi32 epil6 epis8 pi32 unpack_low _mm_unpacklo_ x epi64 epi32 epil epi8 pisz pack_sat _mm_packs_ x N A epi32 epil6 N A pi32 packu_sat _mm_packus_ x N A N A epil N A N A sat_add _mm_adds_ x N A N A epil epi8 N A sat_sub _mm_subs_ x N A N A epil epi8 N A Packing and Unpacking Operators Corresponding Intrinsics and Classes Part 2 Operators Corresponding 116vec4 I8vec8 F64vec2 F32vec4 F32vec1 Intrinsic pil6 pig pa Operators Corresponding l64vec2 I32vec4 l16vec8 I8veci6 I32vec2 unpack_high _mm_unpackhi_ x unpack_low _mm_unpacklo_ x pil6 pig pa pack_sat _mm_packs_ x pil6 N A N A packu_sat _mm_packus_ x pul6 N A N A sat_add _mm_adds_ x pits pig pa sat_sub _mm_subs_ x pil6 pig pil6 Conversions Operators Corresponding Intrinsics and Classes Operators Corresponding Intrinsic F64vec2ToInt _mm_cvttsd_si32 F32vec4ToF64vec2 _mm_cvtps_pd F64vec2ToF32vec4 _mm_cvtpd_ps Int ToF64vec2 _mm_cvtsi32_sd F32vec4ToInt _mm_cvtt_ss2si F32vec4Tols32vec2 _mm_cvttps_pi32 Int ToF32vec4 _mm_cvtsi32_ss Ts32vec2ToF32vec4 _mm_cvtpi32_ps 438 Reference Programming Example This sample program uses the F32vec4 class to average the elements of a 20 element floating point array Include Streaming SIMD Extension Class Definitions include lt fvec h gt Shuffle any 2 single precisi
172. asses and SIMD Operations The use of C classes for SIMD operations is based on the concept of operating on arrays or vectors of data in parallel Consider the addition of two vectors A and B where each vector contains four elements Using the integer vector Ivec class the elements A i and B i from each array are summed as shown in the following example Typical Method of Adding Elements Using a Loop short a 4 b 4 c 4 for i 0 i lt 4 i needs four iterations cli a i bi returns c 0 eiL cl2 c 3 383 Intel C Compiler for Linux Systems User s Guide The following example shows the same results using one operation with Ivec Classes SIMD Method of Adding Elements Using Ivec Classes sIsl6vec4 ivecA ivecB ivec C needs one iteration ivecC ivecA ivecB returns ivecC0O ivecCl ivecC2 iveccC3 Available Classes The Intel C SIMD classes provide parallelism which is not easily implemented using typical mechanisms of C The following table shows how the Intel C SIMD classes use the classes and libraries SIMD Vector Classes Instruction Set Class Signedness Data Size Elements Header Type File E MMxX TM T6o4vecl unspecified __mo4 64 1 ivec h technology available for IA 32 and Itantum based systems I32vec2 unspecified int 32 2 ivec h Is32vec2 signed int 32 2 ivec h Iu32vec2 unsigned int 32 2 ivec h Il6vec4 unspecifie
173. asured in intermediate language statements due to inlining The number n is a positive integer The default value for n is 2000 The following command activates procedural and interprocedural optimizations on source cpp and sets the maximum increase in the number of intermediate language statements to five for each function prompt gt icpce ip Qoption c ip_ninl_max_stats 5 source cpp 129 Intel C Compiler for Linux Systems User s Guide Controlling Inline Expansion of User Functions The compiler enables you to control the amount of inline function expansion with the options shown in the following summary Description ip_no_inlining This option is only useful if ip is also specified In this case ip_no_inlining disables inlining that would result from the ip interprocedural optimizations but has no effect on other interprocedural optimizations ip_no_pinlining Disables partial inlining can be used if ip or ipo value is also specified Criteria for Inline Function Expansion Once the criteria are met the compiler picks the routines whose inline expansion will provide the greatest benefit to program performance The inlining heuristics used by the compiler differ based on whether or not you use profile guided optimizations prof_use When you use profile guided optimizations with ip or ipo value the compiler uses the following heuristics e The default heuristic focuses o
174. atReg void __stfd void dst const int whichFloatReg fo void __stfe void dst i const int whichFloatReg 346 Description Copy a value in an indexed register The index is the 2nd argument the register file is the first argument Gets TEB address The TEB address is kept in r13 and maps to the move r tp instruction Executes the serialize instruction Maps to the srlz i instruction Serializes the data Maps to the srlz d instruction Map the fet chadd4 acq instruction Map the fet chadd4 rel instruction Map the fet chadd8 acq instruction Map the fet chadd8 rel instruction Flushes the write buffers Maps to the fwb instruction Map the 1dfs instruction Load a single precision value to the specified register Map the 1dfd instruction Load a double precision value to the specified register Map the 1dfe instruction Load an extended precision value to the specified register Map the 1d 8 instruction Map the 1df i11 instruction Map the sfts instruction Map the st fd instruction Map the st fe instruction ee Intrinsic void __stf_spill void _ mf void void _ mfa void void __synci void void __ttag __int64 void __itcd __int64 void __itci __int64 void __itrd __int64 void __ptce __int64 void __ptcl __int64 __int64 pagesz void __ptcg __int64 __int64 pagesz void int64 pagesz void __ptri __int64
175. atement 20Exprs Locally Declared Labels Yes http gcc gnu org onlinedocs gec 3 4 0 gcec Local Labels html Local 20Labels Labels as Values Yes http gcc gnu org onlinedocs gcec 3 4 0 gcec Labels as Values html Labels 20as 20Values Nested Functions No http gcc gnu org onlinedocs gec 3 4 0 gcec Nested Functions html Nested 20Functions Constructing Function Calls No http gcc gnu org onlinedocs gec 3 4 0 gcec Constructing Calls html Constructing 20Calls Naming an Expression s Type Yes http gcc gnu org onlinedocs gec 3 2 gcec Naming Types html Naming 20Types Referring to a Type with typeof Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Typeof html T ypeof Generalized Lvalues Yes http gcc gnu org onlinedocs gec 3 4 0 gec Lvalues html Lvalues Conditionals with Omitted Yes http gcc gnu org onlinedocs gec 3 4 0 gec Operands Conditionals html Conditionals Double Word Integers Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Long Long html Long 20Long Complex Numbers Yes http gcc gnu org onlinedocs gcec 3 4 0 gcec Complex html Complex 95 Intel C Compiler for Linux Systems User s Guide gcc Language Extension Intel GNU Description and Examples Support Hex Floats Yes http gcc gnu org onlinedocs gec 3 4 0 gec Hex Floats html Hex 20Floats Arrays of Length Zero Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Zero Length html Zero 20Length Arrays of Variable Length Yes http gcc gnu org
176. ators Corresponding 1128vec1 I64vec2 I32vec4 I16vec8 I8vec16 Intrinsic oe 1 32 epi epi64 epi i116 N A epi64 epi32 epil6 N A N A epi32 epil NA N A epi32 epile N A epi6 4 epi32 epil N A epi64 epi32 epil6 N A 434 Reference Shift Operators Corresponding Intrinsics and Classes Part 2 Operators Corresponding I64vec1 Intrinsic gt gt gt I32vec2 116vec4 I8vec8 Comparison Operators Corresponding Intrinsics and Classes Part 1 cmpeq _mm_cmpeq_ x epi32 epil6 epi8 pi32 pil cmpneq _mm_cmpeq_ x _mm_andnot_l y cmpgt _mm_cmpgt_ x epi32 epil epis8 pi32 pi cmpge _mm_cmpge_ x _mm_andnot_l y _mm_cmplt_ x _mm_cmple_ x Operators Corresponding I32vec4 l16vec8 I8vec16 I32vec2 116vec4 Intrinsic _mm_andnot_ yl cmpngt _mm_cmpngt_ x epi32 epil epis8 pi32 pi cmpnge _mm_cmpnge_ x N A N A N A N A N A cmnpnlt _mm_cmpnlt_ x N A N A N A N A N A cmpnle _mm_cmpnie_ x N A N A N A N A N A Note that_mm_andnot_ y intrinsics do not apply to the fvec classes 435 Intel C Compiler for Linux Systems User s Guide Comparison Operators Corresponding Intrinsics and Classes Part 2 Operators Corresponding F64vec2 F32vec4 F32vec1 Intrinsic cmpeq _mm_cmpeq_ x pd ps ss cmpneq _mm_cmpeq_ x pd ps ss _mm_andnot_l y cmpgt _mm_cmpgt_ x pd ps ss cmpge _m
177. ax Usage Intrinsic F32vecl R F32vecl A F32vecl _mm_div_ss B F32vecl R F32vecl A Advanced Arithmetic Operator Usage The following table shows the return values classes of the advanced arithmetic operators which use the syntax styles described earlier in the Return Value Notation section Advanced Arithmetic Return Value Mapping R Operators A F32vec4 F64vec2 F32vec1 RO sqrt rep rsqrt rcp_nr rsqrt_nr AO R1 sqrt rcp rsqrt rcp_nr rsqrt_nr Al N A Ee a rsqrt_nr A2 N A N A sqrt rcp rsqrt rcp_nr rsqrt_nr A3 N A N A f add_horizontal A0 N A N A Al A2 A3 d add_horizontal A0 N A N A Al This table shows examples for advanced arithmetic operators Advanced Arithmetic Operations for Fvec Classes Returns Example Syntax Usage Intrinsic Square Root 4 floats F32vec4 R sqrt F32vec4 A _mm_sqrt_ps 2 doubles F64vec2 R sqrt F64vec2 A _mm_sqrt_pd 1 float F32vecl R sqrt F32vecl A _mm_sqrt_ss Reciprocal 4 floats F32vec4 R rep F32vec4 A _mm_rcp_ps 2 doubles F64vec2 R rep F64vec2 A _mm_rcp_pd 1 float F32vecl R rep F32vecl A _mm_rcp_ss Reciprocal Square Root 4 floats F32vec4 R rsqrt F32vec4 A _mm_rsqrt_ps 420 Reference Returns Example Syntax Usage Intrinsic 2 doubles F64vec2 R rsqrt F64vec2 A mm_rsqrt_pd 1 float F32vecl R rsqrt F32vecl A mm_rsqrt_ss Reciprocal Newton Raphson
178. b The upper DP FP value is passed through from a r0 b0 rI s al Floating point Store Operations for Streaming SIMD Extensions 2 The following st ore operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 SSE2 The prototypes for SSE2 intrinsics are in the emmint rin h header file void _mm_store_sd double dp __m128d a uses MOVSD Stores the lower DP FP value of a The address dp need not be 16 byte aligned dp a0 void _mm_storel_pd double dp __m128d a uses MOVAPD shuffling Stores the lower DP FP value of a twice The address dp must be 16 byte aligned dp 0 a0 dp 1 a0 void _mm_store_pd double dp __m128d a uses MOVAPD Stores two DP FP values The address dp must be 16 byte aligned dp 0 a0 dp 1 al void _mm_storeu_pd double dp __m128d a uses MOVUPD Stores two DP FP values The address dp need not be 16 byte aligned dp 0 a0 dp 1 al 312 Reference void _mm_storer_pd double dp __m128d a uses MOVAPD shuffling Stores two DP FP values in reverse order The address dp must be 16 byte aligned dp 0 al dp 1 a0 void _mm_storeh_pd double dp __m128d a uses MOVHPD Stores the upper DP FP value of a dp al void _mm_storel_pd double dp __m128d a uses MOVLPD Stores the lower DP FP value of a dp a0 Integer Arithmetic Operations for Streaming SIMD Extensions 2 The integer arithmeti
179. bit values in m1 using saturating arithmetic __m64 _m_psubusw __m64 ml __m64 m2 Subtract the four unsigned 16 bit values in m2 from the four unsigned 16 bit values in m1 using saturating arithmetic __m64 _m_pmaddwd __m64 ml __m64 m2 Multiply four 16 bit values in m1 by four 16 bit values in m2 producing four 32 bit intermediate results which are then summed by pairs to produce two 32 bit results __m64 _m_pmulhw __m64 ml __m64 m2 Multiply four signed 16 bit values in m1 by four signed 16 bit values in m2 and produce the high 16 bits of the four results __m64 _m_pmullw __m64 ml __m64 m2 Multiply four 16 bit values in m1 by four 16 bit values in m2 and produce the low 16 bits of the four results 260 Reference MMX TM Technology Shift Intrinsics The prototypes for MMX TM technology intrinsics are in the mmint rin h header file Intrinsic Alternate Shift Shift Corresponding Name Name Direction Type Instruction _m_psllw _mm_sll_pil left Logical PSLLW _m_psllwi _mm_s11li_pil6 left Logical PSLLWI _m_pslld _mm_sll_pi32 left Logical PSLLD _m_pslldi _mm_s11li_pi32 left Logical PSLLDI _m_psllq _mm_sll_si6 4 left Logical PSLLO _m_psllqi _mm_s11li_si64 left Logical PSLLOI _m_psraw _mm_sra_pil6 right Arithmetic PSRAW _m_psrawi _mm_srai_pil6 right Arithmetic PSRAWI _m_psrad _mm_sra_pi32_ right Arithmetic PSRAD
180. c _mm_max_pil6 Compute the element wise minimum of the respective signed integer words in A and B Isl6vec4 simd_min Isl6vec4 A Isl6vec4 B Corresponding intrinsic _mm_min_pil6 Compute the element wise maximum of the respective unsigned bytes in A and B Tu8vec8 simd_max Iu8vec8 A Iu8vec8 B Corresponding intrinsic _mm_max_pu8 Compute the element wise minimum of the respective unsigned bytes in A and B Tu8vec8 simd_min Iu8vec8 A Iu8vec8 B Corresponding intrinsic _mm_min_pu8 Create an 8 bit mask from the most significant bits of the bytes in A 412 Reference int move_mask I8vec8 A Corresponding intrinsic _mm_movemask_pi8 Conditionally store byte elements of A to address p The high bit of each byte in the selector B determines whether the corresponding byte in A will be stored void mask_move I8vec8 A I8vec8 B signed char p Corresponding intrinsic _mm_maskmove_si64 Store the data in A to the address p without polluting the caches A can be any Ivec type void store_nta __m64 p M64 A Corresponding intrinsic __mm_stream_pi Compute the element wise average of the respective unsigned 8 bit integers in A and B Tu8vec8 simd_avg Iu8vec8 A Iu8vec8 B Corresponding intrinsic _mm_avg_pu8 Compute the element wise average of the respective unsigned 16 bit integers in A and B Tul6 vec4 simd_avg Iul6vec4 A Iul6 vec4 B Corresponding intrinsic _mm_avg_pu16 Conversions Between Fvec and Ivec
181. c operations for Streaming SIMD Extensions 2 SSE2 are listed in the following table followed by their descriptions The packed arithmetic intrinsics for SSE2 are listed in the Floating point Arithmetic Operations topic The prototypes for SSE2 intrinsics are in the emmint rin h header file Intrinsic Instruction Operation _mm_add_epi8 PADDB Addition _mm_add_epil6 PADDW Addition _mm_add_epi32 PADDD Addition _mm_add_si64 PADDQ Addition _mm_add_epi64 PADDQ Addition _mm_adds_epi8 PADDSB Addition _mm_adds_epil6 PADDSW Addition _mm_adds_epu8 PADDUSB Addition mm_adds_epul PADDUSW Addition _mm_avg_epus PAVGB Computes Average _mm_avg_epul6 PAVGW Computes Average _mm_madd_epil6 PMADDWD Multiplication Addition _mm_max_epil6 PMAXSW Computes Maxima _mm_max_epu8 PMAXUB Computes Maxima _mm_min_epil6 PMINSW Computes Minima 313 Intel C Compiler for Linux Systems User s Guide Intrinsic Instruction Operation _mm_min_epu8 PMINUB Computes Minima _mm_mulhi_epil6 PMULHW Multiplication mm_mulhi_epul6 PMULHUW Multiplication Ss mm_mullo_epil6 PMULLW Multiplication _mm_mul_su32 PMULUDO Multiplication _mm_mul_epu32 PMULUDQ Multiplication _mm_sad_epu8 PSADBW Computes Difference Adds _mm_sub_epi8 PSUBB Subtraction _mm_sub_epil6 PSUBW Subtraction _mm_sub_epi32 PSUBD Subtraction _mm_sub_si64 PSUBQ Subtraction _mm_sub_epi64 PSUBQ Subtraction _mm_subs_epi8 PSUBSB Subtraction
182. ch intrinsic has two key properties e the function performed is guaranteed to be atomic e associated with each intrinsic are certain memory barrier properties that restrict the movement of memory references to visible data across the intrinsic operation by either the compiler or the processor For the following intrinsics lt t ype gt is either a 32 bit or 64 bit integer Atomic Fetch and op Operations lt type gt sync_fetch_and_add lt type gt ptr lt type gt val lt type gt sync_fetch_and_and lt type gt ptr lt type gt val lt type gt sync_fetch_and_nand lt type gt ptr lt type gt val lt type gt sync_fetch_and_or lt type gt ptr lt type gt val lt type gt sync_fetch_and_sub lt type gt ptr lt type gt val lt type gt sync_fetch_and_xor lt type gt ptr lt type gt val 358 Reference Atomic Op and fetch Operations lt lt type gt sync_add_and_fetch lt type gt ptr lt type gt val lt type gt sync_sub_and_fetch lt type gt ptr lt type gt val lt type gt sync_or_and_fetch lt type gt ptr lt type gt val lt type gt sync_and_and_fetch lt type gt ptr lt type gt val lt type gt sync_nand_and_fetch lt type gt ptr lt type gt val lt type gt sync_xor_and_fetch lt type gt ptr lt type gt val lt Atomic Compare and swap Operations lt type gt _ sync_val_compare_and_swap lt type gt ptr lt type gt old_val l
183. cisions about function inlining thereby increasing the effectiveness of interprocedural optimizations Instrumented Program Profile guided optimization creates an instrumented program from your source code and special code from the compiler Each time this instrumented code is executed the instrumented program generates a dynamic information file When you compile a second time the dynamic information files are merged into a summary file Using the profile information in this file the compiler attempts to optimize the execution of the most heavily travelled paths in the program Unlike other optimizations such as those used strictly for size or speed the results of IPO and PGO vary This is due to each program having a different profile and different opportunities for optimizations The guidelines provided here help you determine if you can benefit by using IPO and PGO Profile guided Optimizations Methodology PGO works best for code with many frequently executed branches that are difficult to predict at compile time An example is code that is heavy with error checking in which the error conditions are false most of the time The cold error handling code can be placed such that the branch is rarely mispredicted Eliminating the interleaving of hot and cold code improves instruction cache behavior For example the use of PGO often enables the compiler to make better decisions about function inlining thereby increasing the effectiveness
184. classify function returns the value of the number classification macro appropriate to the value of its argument Calling interface double fpclassify double x long double fpclassifyl long double x float fpclassifyf float x ISFINITE Description The isfinite function returns if x is not a NaN or infinity Otherwise 0 is returned Calling interface int isfinite double x int isfinitel long double x int isfinitef float x 233 Intel C Compiler for Linux Systems User s Guide ISGREATER Description The isgreater function returns if x is greater than y This function does not raise the invalid floating point exception Calling interface int isgreater double x double y int isgreaterl long double x long double y int isgreaterf float x float y ISGREATEREQUAL Description The isgreaterequal function returns 1 if x is greater than or equal to y This function does not raise the invalid floating point exception Calling interface int isgreaterequal double x double y int isgreaterequall long double x long double y int isgreaterequalf float x float y ISINF Description The isinf function returns a non zero value if and only if its argument has an infinite value Calling interface int isinf double x int isinfl long double x int isinff float x ISLESS Description The isless function returns 1 if x is less than y This function does not raise t
185. code might contain unconditional use of features that are not supported on other processors Option Specific Optimization for O xK Intel Pentium II and compatible Intel processors f xW Intel Pentium 4 and compatible Intel processors m xN Intel Pentium 4 and compatible Intel processors Programs compiled with this option will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processor specific optimizations xB Intel Pentium M and compatible Intel processors Programs compiled with this option will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processor specific optimizations xP Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Programs compiled with this option will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processor specific optimizations To execute a program on x86 processors not provided by Intel Corporation do not specify the x K W N B P option Example The following invocation compiles prog cpp for Intel Pentium 4 and compatible processors The resulting binary might not execute correctly on Pentium Pentium Pro Pentium II Pentium III or Pentium with MMX technology processors or on x8
186. cos function returns both the sine and cosine of x measured in radians This function may be inlined with the Itanium compiler Calling interface void sincos double x double sinval double cosval void sincosl long double x long double sinval long double cosval void sincosf float x float sinval float cosval SINCOSD Description The sincosd function returns both the sine and cosine of x measured in degrees Calling interface void sincosd double x double sinval double cosval void sincosdl long double x long double sinval long double cosval void sincosdf float x float sinval float cosval SIND Description The sind function computes the sine of x measured in degrees Calling interface double sind double x long double sindl long double x float sindf float x TAN Description The t an function returns the tangent of x measured in radians Calling interface double tan double x long double tanl long double x float tanf float x 218 Reference TAND Description The t and function returns the tangent of x measured in degrees errno ERANGE for overflow conditions Calling interface double tand double x long double tandl long double x float tandf float x Hyperbolic Functions The Intel Math library supports the following hyperbolic functions ACOSH Description The acosh function returns the inverse hyperbolic cosine of x errno EDOM for x lt 1 Calling inter
187. create the library file from the object files prompt gt ar rc my_lib a my_sourcel o my_source2 o0 my_source3 o 3 compile and link your project with your new library prompt gt icpc main cpp my_lib a If your library file and source files are in different directories use the Ldir option to indicate where your library is located prompt gt icpe L cpp libs main cpp my_lib a If you are using Interprocedural Optimization see Creating a Library from IPO Objects using xiar Shared Libraries Shared libraries also referred to as dynamic libraries or Dynamic Shared Objects DSO are linked differently than static libraries At compile time the linker insures that all the necessary symbols are either linked into the executable or can be linked at runtime from the shared library Executables compiled from shared libraries are smaller but the shared libraries must be included with the executable to function correctly When multiple programs use the same shared library only one copy of the library is required in memory 87 Intel C Compiler for Linux Systems User s Guide To build a shared library 1 use the fPIC and c options to generate object files from the source files prompt gt icpe fPIC c my_sourcel cpp my_source2 cpp my_source3 cpp 2 use the shared option to create the library file from the object files prompt gt icpce shared my_lib so my_sourcel o my_source2 o my_source3 o0 3 compile and link your project with yo
188. cs none The lowest single precision floating point value of A is placed in the output buffer and printed cout lt lt F32vec1 A Corresponding intrinsics none Element Access Operations double d F64vec2 Afint i Read one of the two double precision floating point values of A without modifying the corresponding floating point value Permitted values of i are 0 and 1 For example If DEBUG is enabled and i is not one of the permitted values 0 or 1 a diagnostic message is printed and the program aborts double d F64vec2 A 1 Corresponding intrinsics none 430 Reference Read one of the four single precision floating point values of A without modifying the corresponding floating point value Permitted values of i are 0 1 2 and 3 For example float f F32vec4 Afint i If DEBUG is enabled and i is not one of the permitted values 0 3 a diagnostic message is printed and the program aborts float f F32vec4 A 2 Corresponding intrinsics none Element Assignment Operations F6 4vec4 Afint i double d Modify one of the two double precision floating point values of A Permitted values of int i are 0 and 1 For example F32vec4 A 1 double d F32vec4 A int i float f Modify one of the four single precision floating point values of A Permitted values of int i are 0 1 2 and 3 For example If DEBUG is enabled and int i is not one of the permitted values 0 3 a diagnostic message is printed and
189. ct Format Library used by the Intel assembler ibmofl so Shared Multiple Object Format Library used by the Intel assembler libunwinder a Unwinder library libintrins a_ Intrinsic functions library Diagnostics and Messages This section describes the various messages that the compiler produces These messages include the sign on message and diagnostic messages for remarks warnings or errors The compiler always displays any diagnostic message along with the erroneous source line on the standard output This section also describes how to control the severity of diagnostic messages Diagnostic Messages Option Description w0 Display errors same as w w1 Display warnings and errors DEFAULT w2 Display remarks warnings and errors 207 Intel C Compiler for Linux Systems User s Guide Language Diagnostics These messages describe diagnostics that are reported during the processing of the source file These diagnostics have the following format filename linenum type nn message filename Indicates the name of the source file currently being processed linenum Indicates the source line where the compiler detects the condition type Indicates the severity of the diagnostic message warning remark error or catastrophic error nn The number assigned to the error or warning message Hard errors or catastrophes are not assigned a number message Describes the di
190. ctual property rights is granted by this document Except as provided in Intel s Terms and Conditions of Sale for such products Intel assumes no liability whatsoever and Intel disclaims any express or implied warranty relating to sale and or use of Intel products including liability or warranties relating to fitness for a particular purpose merchantability or infringement of any patent copyright or other intellectual property right Intel products are not intended for use in medical life saving or life sustaining applications This User s Guide as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license The information in this manual is furnished for informational use only is subject to change without notice and should not be construed as a commitment by Intel Corporation Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them The software described in this User s Guide may contain software defects which may cause the product to deviate from publi
191. cution prof_use Instructs the compiler to produce a profile optimized executable and merges available dynamic information dyn files into a pgopti dpi file In cases where your code behavior differs greatly between executions you have to ensure that the benefit of the profile information is worth the effort required to maintain up to date profiles In the basic profile guided optimization the following options are used in the phases of the PGO Generating Instrumented Code The prof_gen x option instruments the program for profiling to get the execution count of each basic block It is used in Phase 1 of the PGO to instruct the compiler to produce instrumented code in your object files in preparation for instrumented execution Parallel make is automatically supported for prof_genx compilations Generating a Profile optimized Executable The prof_use option is used in Phase 3 of the PGO to instruct the compiler to produce a profile optimized executable and merges available dynamic information dyn files into a pgopti dpii file F Note The dynamic information files are produced in Phase 2 when you run the instrumented executable If you perform multiple executions of the instrumented program prof_use merges the dynamic information files again and overwrites the previous pgopti dpi file Disabling Function Splitting Iltanium Compiler only fnsplit disables function splitting Function splitt
192. cxa statically while still allowing the standard libraries to be linked in by the default behavior This option is placed in the linker command line corresponding to its location on the user command line This option is used to control the linking behavior of any library being passed in via the command line Volume I Building Applications Description Bdynamic This option is placed in the linker command line corresponding to its location on the user command line This option is used to control the linking behavior of any library being passed in via the command line Suppressing Linking Use the c option to suppress linking For example entering the following command produces the object files filel oand file2 o prompt gt icpe c filel cpp file2 cpp E Note The preceding command does not link these files to produce an executable file Debugging This section describes the basic command line options that you can use as tools to debug your compilation and to display and check compilation errors This section includes topics on e Preparing for Debugging e Parsing for Syntax Only e Optimizations and Debugging e Options for Debug Information Debuggers The Intel Debugger is included with the Intel C Compiler but installation is optional The Intel Debugger idb includes an environment script idbvars sh which you need to run before executing idb prompt gt source opt intel_idb_80 bin idbvars sh See the
193. d B F32vec4 R unpack_low F32vec4 A F32vec4 B Corresponding intrinsic __mm_unpacklo_ps a b Selects and interleaves the higher two single precision floating point values from A and B F32vec4 R unpack_high F32vec4 A F32vec4 B Corresponding intrinsic _mm_unpackhi_ps a b Move Mask Operator Creates a 2 bit mask from the most significant bits of the two double precision floating point values of A as follows int i move_mask F64vec2 A i sign al lt lt 1l sign a0 lt lt 0 Corresponding intrinsic _mm_movemask_pd Creates a 4 bit mask from the most significant bits of the four single precision floating point values of A as follows int i move_mask F32vec4 A i sign a3 lt lt 3 sign a2 lt lt 2 sign al lt lt 1l sign a0 lt lt 0 Corresponding intrinsic _mm_movemask_ps 432 Classes Quick Reference Reference This appendix contains tables listing the class functionality and corresponding intrinsics for each class in the Intel C Class Libraries for SIMD Operations The following table lists all Intel C Compiler intrinsics that are not implemented in the C SIMD classes Logical Operators Corresponding Intrinsics and Classes Operators Corresponding Intrinsic 1128vec1 l64vec l64vec2 I32vec I32vec4 l16vec 116vec8 I8vec8 I8vec16 F64vec2 F32vec4 F32vec1 amp amp _mm_and_ sil28 si64 ee fe Andnot _mm_andnot_ il28 si64 a co N A Arithmet
194. d begins handling it If the thread uses nested workqueuing constructs and the scope of the request becomes large after the inner construct is started the threads from the outer construct can easily migrate to the inner construct to help finish the request Since the workqueuing model is designed to preserve sequential semantics synchronization is inherent in the semantics of the taskq block There is an implicit team barrier at the completion of the taskq block for the threads that encountered the taskq construct to ensure that all of the tasks specified inside of the taskq block have finished execution This taskq barrier enforces the sequential semantics of the original program Just like the OpenMP worksharing constructs it is assumed you are responsible for ensuring that either no dependences exist or that dependencies are appropriately synchronized between the task blocks or between code in a task block and code in the taskq block outside of the task blocks The syntax semantics and allowed clauses are designed to resemble OpenMP worksharing constructs Most of the clauses allowed on OpenMP worksharing constructs have a reasonable meaning when applied to the workqueuing pragmas 188 Volume II Optimizing Applications taskq Construct pragma intel omp taskq clause clause structured block where clause can be any of the following e private variable list e firstprivate variable list e lastprivate variable list
195. d on the command line in a response file or in a configuration file 3 the Intel Math Library libimf a Compiling for Non shared Libraries This section includes information on e Global Symbols and Visibility Attributes e Symbol Preemption e Specifying Symbol Visibility Explicitly e Other Visibility related Command line Options 90 Volume I Building Applications Global Symbols and Visibility Attributes A global symbol is one that is visible outside the compilation unit single source file and its include files in which it is declared In C C this means anything declared at file level without the st at ic keyword For example int x 5 gi 1 data definition extern int y g 1 data reference int five gi l1 function definition return 5 extern int four gl 1 function reference A complete program consists of a main program file and possibly one or more shareable object so files that contain the definitions for data or functions referenced by the main program Similarly shareable objects might reference data or functions defined in other shareable objects Shareable objects are so called because if more than one simultaneously executing process has the shareable object mapped into its virtual memory there is only one copy of the read only portion of the object resident in physical memory The main program file and any shareable objects that it references are collectively called the c
196. d product form 4 16 bit results which are returned as one 64 bit word __m64 _m64_pmpyshr2u __m64 a __m64 b const int count The four unsigned 16 bit data elements of a are multiplied by the corresponding unsigned 16 bit data elements of b yielding four 32 bit products Each product is then shifted to the right count bits and the least significant 16 bits of each shifted product form 4 16 bit results which are returned as one 64 bit word __m64 _m64_pshladd2 __m64 a const int count __m64 b a is shifted to the left by count bits and then is added to b The upper 32 bits of the result are forced to 0 and then bits 31 30 of b are copied to bits 62 61 of the result The result is returned __m64 _m64_pshradd2 __m64 a const int count __m64 b The four signed 16 bit data elements of a are each independently shifted to the right by count bits the high order bits of each element are filled with the initial value of the sign bits of the data elements in a they are then added to the four signed 16 bit data elements of b The result is returned 357 Intel C Compiler for Linux Systems User s Guide __m64 _m64_ paddluus __m64 a __m64 b a is added to b as eight separate byte wide elements The elements of a are treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word __m64 _m64_padd2uus __m64 a __m64 b a is added to b as four separate 16 bit w
197. d short 16 4 ivec h Isl6vec4 signed short 16 4 ivec h Iul6vec4 unsigned short 16 4 ivec h I8vec8 unspecified char 8 8 ivec h Is8vec8 signed char 8 8 ivec h Iu8vec8 unsigned char 8 8 ivec h es Streaming SIMD F32vec4 signed float 32 4 fvec h Extensions available for IA 32 and Itanium based systems mM F32vec1 signed float 32 1 fvec h Streaming SIMD F64vec2 signed double 64 2 dvec h Extensions 2 available for IA 32 based systems only 384 Reference Instruction Set Class Signedness Data Size Elements Header Type File T128vecl unspecified _m128i 128 1 dvec h a I64vec2 unspecified long 64 4 dvec h int Is64vec2 signed long 64 4 dvec h int T long 32 f4 dvec h ant I32vec4 unspecified int 32 4 dvec h Is32vec4 signed int 32 4 dvec h Iu32vec4 unsigned int 32 4 dvec h Il6vec8 unspecified int 16 8 dvec h Isl6vec8 signed m Iul6vec8 unsigned po I8vecl6 unspecified a Is8vecl16 signed ee Iu8vec16 unsigned cm Most classes contain similar functionality for all data types and are represented by all available intrinsics However some capabilities do not translate from one data type to another without suffering from poor performance and are therefore excluded from individual classes f Note Intrinsics that take immediate values and cannot be expressed easily in classes are not implemented For example _mm_shuf
198. d struct union fields Yes http gcc gnu org onlinedocs gec 3 4 0 gec within structs unions Unnamed Fields html Unnamed 20Fields g Extensions to the C Language GNU C includes several non standard features not found in ISO standard C This version of the Intel C Compiler supports many of these extensions listed in the following table See http www gnu org for more information g Language Intel GNU Description and Examples Extension Support Minimum and Yes http gcc gnu org onlinedocs gec 3 4 0 gec Maximum operators in Min and Max html Min 20and 20Max C When is a Volatile No http gcc gnu org onlinedocs gec 3 4 0 gcec Object Accessed Volatiles html Volatiles Restricting Pointer Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Aliasing Restricted Pointers html Restricted 20Pointers Vague Linkage Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Vague Linkage html Vague 20Linkage Declarations and Definitions in One Header No http gcc gnu org onlinedocs gec 3 4 0 gcec C Interface html C 20Interface Where s the Template extern http gcc gnu org onlinedocs gec 3 4 0 gcec template Template supported Instantiation html Template 20Instantiation Extracting the function pointer from a bound pointer to member function No http gcc gnu org onlinedocs gec 3 4 0 gec Bound member functions html Bound 20member 20functions C Specific Variable Function and Type Attribut
199. documents if only one browse adapter hes been configured the selection cannot be changed O Mozilla Adapter Z Cusan Fawr tase defined pami Custom Eroi commer tongued 1 46 Volume I Building Applications Creating a New Project To create a simple helloworld project follow these steps after starting Eclipse 1 3 Select Window gt Open Perspective gt C C Development From the Eclipse File menu select New gt Project The New Project wizard opens with the Select dialog to specify the kind of project you want to create In the left column select C from the list In the right column select Managed Make C Project Click Next to proceed See also Standard and Managed Make Files s aNew Project Select a Create a new C project and let Eclipse create and manage the makefile EN Standard Make C Project C lt N Managed Make C Project Simple In the Name text box of the Managed Make C Project dialog type helloworld Check the Use Default Location box if not already checked Click Next to proceed v E Project Managed Make C Project Create a new Managed Make C project Name helloworld Use Default Location Location opt intel_cc_80 bin workspace helloworid 47 Intel C Compiler for Linux Systems User s Guide 4 From the Select a Target dialog select Linux Executable Using Intel R C C Compiler from the Platform drop down list Check the Release and Debu
200. e __inline __m128 _mm_cvtpi32x2_ps __m64 a __m64 b Convert the two 32 bit signed integer values in a and the two 32 bit signed integer values in b to four single precision FP values rO float a0 ri float al r2 float b0o r3 float bl inline m64 _mm_cvtps_pil6 __m128 a Convert the four single precision FP values in a to four signed 16 bit integer values rO short a0 ri short al r2 short a2 r3 short a3 __inline m64 _mm_cvtps_pi8 __m128 a Convert the four single precision FP values in a to the lower four signed 8 bit integer values of the result rO char a0 rl char al r2 char a2 r3 char a3 Load Operations for Streaming SIMD Extensions See summary table in Summary of Memory and Initialization topic The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmint rin h header file m1l28 _mm_load_ss float p Loads an SP FP value into the low word and clears the upper three words ro N rl 0 0 r2 0 0 r3 0 0 m1128 _mm_load_ps1 float p Loads a single SP FP value copying it into all four words EOL 3 p rl t p r2 p r3 p m1l28 _mm_load_ps float p Loads four SP FP values The address must be 16 byte aligned r0 p 0 rl p 1 r2 p 2 r3 p 3 281 Intel C Compiler for Linux Systems User s Guide m128 _mm_loadu_ps float p Loads four SP FP values The address need not
201. e Name Instruction For _mm_cmpneq_sd CMPNEQSD Inequality _mm_cmpnlt_sd CMPNLTSD Not Less Than _mm_cmpnle_sd CMPNLESD Not Less Than or Equal _mm_cmpngt_sd CMPNLTSDr Not Greater Than _mm_cmpnge_sd CMPNLESDR Not Greater Than or Equal _mm_comieq_sd COMISD Equality _mm_comilt_sd COMISD Less Than _mm_comile_sd COMISD Less Than or Equal _mm_comigt_sd COMISD Greater Than _mm_comige_sd COMISD Greater Than or Equal _mm_comineq_sd COMISD Not Equal _mm_ucomieq_sd UCOMISD Equality _mm_ucomilt_sd UCOMISD Less Than _mm_ucomile_sd UCOMISD Less Than or Equal _mm_ucomigt_sd UCOMISD Greater Than _mm_ucomige_sd UCOMISD Greater Than or Equal _mm_ucomineq_sd UCOMISD Not Equal __m128d _mm_cmpeq_pd __m128d a __m128d b Compares the two DP FP values of a and b for equality rO a0 b0 Oxffffffffffffffff 0x0 ri al bl Oxffffffffffffffff 0x0 _ m128d _mm_cmplt_pd __m128d a __m128d b Compares the two DP FP values of a and b for a less than b rO a0 lt bO Oxffffffffffffffff 0x0 rl al lt bl Oxffffffffffffffff 0x0 m1l28d _mm_cmple_pd __m128d a __m128d b Compares the two DP FP values of a and b for a less than or equal to b ro EL a0 lt b0 al lt bl Oxffff FFFfff Oxffffffffffffffff 0x0 0x0 303 mL __m128d _mm_cmpgt_pd Intel C Compiler for Linux Systems User s Guide 28d a __m 28d b Compares the two DP FP values of a and b for a greater than b r
202. e 110 Optimizing for Specific Processos oii iane aea Aaea id aaa eaa aa aa iaa aa a a a aaa aaa 114 Interprocedural Optimizations cecceceseceeneeeeeeeeeeaaeeeeeeeceaeeecaaeseeaaeseaeeeseaaesdeaeesgaeeesaeeeeaaesseeeeeneeeeaees 121 Profile quided Optimizations miesi eria i eet ee a eid ie 131 High level Language Optimizations HLO ccccceceeseeeeeeeeceeeeeeeeeceeeeeeaeeesaaeseeaeeseeeesaeeeseaeeseeeeseeeseas 150 Parallel Program Manag is eera e a R A a AE A EN R EEEE Noned wv edea tees aeeine 155 Optimization Support Features cccccceeccecseeceeeeeeeeeeeeeeeeeeaeeeeaaeseeeeecaeeseaeeeeaaeseeeeeseeessaeeseaaeseeeeeeeeess 192 Referenc Earne a a a vacdeicadvaule ibaa nace e hatin utananccaltts 202 Comper M Sea a a a a a a a aa 202 KEY FIGS riana aE E EE A S A ES 203 Diagnostics and Messages srprireerir apinta Ra tranas vad dened sieceeebidccnbaseccbeduad RAR RE KAAT ARIAK SAREREA EERE Ai 207 intel Math EPY a aea iraa narra EASE RAE ARANA CAR AREA AT CTAA DE RAT es debs decease ARIAK eae eee 211 Intel C Intrinsics Reference 2 2 eeeeceececeeeeeeeeeeeeeeeceeeeeceaeeeeaaeeeeeecaeeesaaesdeaeeseaeeeseaeestaaeseeeeeeeeeeaees 243 INtel C Class Libraries E E fa seesensheeeeababavateadacenesdaesdiipesdies 382 Intel C Compiler User s Guide Disclaimer and Legal Information Information in this document is provided in connection with Intel products No license express or implied by estoppel or otherwise to any intelle
203. e Setting Properties When your project is complete you can export your makefile and project source files to another directory then build your project from the command line using make Exporting makefiles To export your makefile 1 Select your project in the Eclipse C C Projects view 2 From the Eclipse File menu select Export to launch the Export Wizard 64 Volume I Building Applications 3 On the Select dialog of the Export Wizard select File system then click Next Select Export resources to the local file system Select an export destination amp Team Project Set Zip file ee ee 4 On the File system dialog check both the helloworld and Release directories in the left hand pane Be sure all the project sources in the right hand pane are also checked F Note You may deselect some files in the right hand pane such as the he 11lo o object file and helloworld executable However you must also select Create directory structure for files in the Options section to successfully create the export directory This also applies to project files in the helloworld directory 65 Intel C Compiler for Linux Systems User s Guide 5 Use the Browse button to target the export to an existing directory Eclipse can also create a new directory for full paths entered in the To directory text box If for example you specified cpp export as the export directory Eclipse creates two new sub di
204. e arc tangent of the argument with single precision Compute inverse hyperbolic tangent of the argument with double precision Compute inverse hyperbolic tangent of the argument with single precision Computes absolute value of complex number Computes smallest integral value of double precision argument not less than the argument Computes smallest integral value of single precision argument not less than the argument Computes the hyperbolic cosine of double precison argument Computes the hyperbolic cosine of single precison argument Computes absolute value of single precision argument Computes the largest integral value of the double precision argument not greater than the argument Computes the largest integral value of the single precision argument not greater than the argument Computes the floating point remainder of the division of the first argument by the second argument with double precison Computes the floating point remainder of the division of the first argument by the second argument with single precison Computes the length of the hypotenuse of a right angled triangle with double precision Computes the length of the hypotenuse of a right angled triangle with single precision Reference Intrinsic Description double rint double Computes the integral value represented as double using the IEEE rounding mode float rintf float Computes the integral value represented with single prec
205. e bit index of the most significant set bit of x If x is 0 the result is undefined Reverses the byte order of x Bits 0 7 are swapped with bits 24 31 and bits 8 15 are swapped with bits 16 23 Returns the exception code Returns the exception information Enables the interrupt Disables the interrupt Intrinsic that maps to the IA 32 instruction IN Transfer data byte from port specified by argument Intrinsic that maps to the IA 32 instruction IN Transfer double word from port specified by argument Intrinsic that maps to the IA 32 instruction IN Transfer word from port specified by argument Same as_in_byte Same as_in_dword Same as_in_word Intrinsic that maps to the IA 32 instruction OUT Transfer data byte in second argument to port specified by first argument Intrinsic that maps to the IA 32 instruction OUT Transfer double word in second argument to port specified by first argument Intrinsic that maps to the IA 32 instruction OUT Transfer word in second argument to port specified by first argument Same as_out_byte Same as _out_dword Same as_out_word Returns the number of set bits in x 253 Intel C Compiler for Linux Systems User s Guide Intrinsic extern _ int64 _rdtsc void extern __int64 _rdpmc int p int _set jmp jmp_buf MMX TM Technology Intrinsics Support for MMX TM Technology Description Returns the current value of the processor s 64 bit time stamp c
206. e capabilities of the Intel compiler Code coverage Tool is efficient coverage analysis of an application s subset of modules This analysis is accomplished based on the selected option comp of the tool s execution You can generate the profile information for the whole application or a subset of it and then divide the covered modules into different components and use the coverage tool to obtain the coverage information of each individual component If only a subset of the application modules is compiled with the prof_genx option then the coverage information is generated only for those modules that are involved with this compiler option thus avoiding the overhead incurred for profile generation of other modules To specify the modules of interest use the tool s comp option This option takes the name of a file as its argument That file must be a text file that includes the name of modules or directories you would like to analyze codecov prj Project_Name comp componenti f Note Each line of the component file should include one and only one module name Any module of the application whose full path name has an occurrence of any of the names in the component file will be selected for coverage analysis For example if a line of file component 1 contains mod1 cpp then all modules in the application that have such a name will be selected The user can specify a particular module by giving more specific path information For instance
207. e computed in parallel and the sequential step avoided Recursive Functions Recursive functions also can be used to specify parallel iteration spaces The mechanism is similar to specifying parallelism using the sect ions pragma but is much more flexible because it allows arbitrary code to sit between the taskq and the task pragmas and because it allows recursive nesting of the function to build a conceptual tree of taskq queues The recursive nesting of the taskq pragmas is a conceptual extension of OpenMP worksharing constructs to behave more like nested OpenMP parallel regions Just like nested parallel regions each nested workqueuing construct is a new instance and is encountered by exactly one thread However the major difference is that nested workqueuing constructs do not cause new threads or teams to be formed but rather re use the threads from the team This permits very easy multi algorithmic parallelism in dynamic environments such that the number of threads need not be committed at each level of parallelism but instead only at the top level From that point on if a large amount of work suddenly appears at an inner level the idle threads from the outer level can assist in getting that work finished For example it is very common in server environments to dedicate a thread to handle each incoming request with a large number of threads awaiting incoming requests For a particular request its size may not be obvious at the time the threa
208. e defined the same as in the OpenMP Specifications ordered The ordered clause performs ordered constructs in enclosed task constructs in original sequential execution order The taskq directive to which the ordered is bound must have an ordered clause present 189 Intel C Compiler for Linux Systems User s Guide nowait The nowait clause removes the implied barrier at the end of the taskq Threads may exit the taskq construct before completing all the task constructs queued within it task Construct pragma intel omp task clause clause structured block where clause can be any of the following e private variable list e captureprivate variable list private The private clause creates a private default constructed version for each object in variable list for the task The original object referenced by the variable has an indeterminate value upon entry to the construct must not be modified within the dynamic extent of the construct and has an indeterminate value upon exit from the construct captureprivate The captureprivate clause creates a private copy constructed version for each object in variable list for the task at the time the task is enqueued The original object referenced by each variable retains its value but must not be modified within the dynamic extent of the task construct Combined parallel and taskq Construct pragma intel omp parallel taskq clause clause structured bl
209. e during Number of execution processors Enables TRUE or disables FALSE the dynamic adjustment of the number of threads Enables TRUE or disables FALSE nested parallelism 182 Volume II Optimizing Applications Intel Extension Environment Variables en Environment Variable KMP_LIBRARY KMP_STACKSIZE Description Selects the OpenMP run time library throughput The options for the variable value are serial turnaround or throughput indicating the execution mode The default value of throughput is used if this variable is not specified Sets the number of bytes to allocate for each parallel thread to use as its private stack Use the optional suffix b k m g or t to specify bytes kilobytes megabytes gigabytes or terabytes OpenMP Run time Library Routines Default throughput execution mode IA 32 2m Itanium compiler 4m OpenMP provides several run time library functions to assist you in managing your program in parallel mode Many of these functions have corresponding environment variables that can be set as defaults The run time library functions enable you to dynamically change these factors to assist in controlling your program In all cases a call to a run time library function overrides any corresponding environment variable The following table specifies the interfaces to these routines The names for the
210. e following complex functions CABS Description The cabs function returns the complex absolute value of z Calling interface double cabs double _Complex z long double cabsl long double _Complex z float cabsf float _Complex z CACOS Description The cacos function returns the complex inverse cosine of z Calling interface double _Complex cacos double _Complex z long double _Complex cacosl long double _Complex z float _Complex cacosf float _Complex z 236 Reference CACOSH Description The cacosh function returns the complex inverse hyperbolic cosine of z Calling interface double _Complex cacosh double _Complex z long double _Complex cacoshl long double _Complex z float _Complex cacoshf float _Complex z CARG Description The carg function returns the value of the argument in the interval pi pi Calling interface double carg double _Complex z long double cargl long double _Complex z float cargf float _Complex z CASIN Description The casin function returns the complex inverse sine of z Calling interface double _Complex casin double _Complex z long double _Complex casinl long double _Complex z float _Complex casinf float _Complex z CASINH Description The casinh function returns the complex inverse hyperbolic sine of z Calling interface double _Complex casinh double _Complex z long double _Complex casinhl long double _Complex
211. e information file pgopti dpi dpi Sets the project name Generates dynamic execution counts Treats partially covered code as fully covered code Sets the filename that contains the list of files of interest Finds the differential coverage with respect to ref_dpi_file Demangles both function names and their arguments Sets the name of the web page owner Sets the email address of the web page owner Sets the html color name or code of the uncovered blocks ffff99 Sets the html color name or code of the uncovered fcccc functions Sets the html color name or code of the partially covered fafad2 code Sets the html color name or code of the covered code ffffff Sets the html color name or code of the unknown code ffffff Volume II Optimizing Applications Visual Presentation of the Application s Code Coverage Based on the profile information collected from running the instrumented binaries when testing an application the Intel compiler creates HTML files using a code coverage tool These HTML files indicate portions of the source code that were or were not exercised by the tests When applied to the profile of the performance workloads the code coverage information shows how well the training workload covers the application s critical code High coverage of performance critical modules is essential to taking full advantage of profile guided optimizations The code coverage tool can create two levels of coverage e T
212. e intrinsics to add two arrays __declspec cpu_specific pentium void array_sum int r int a int b size_t 1 for length gt 0 1 resultt 4 katt bt Implementation for a Pentium processor with MMX technology uses an MMX instruction intrinsic to add four elements simultaneously __declspec cpu_specific pentium_MMX void array_sum int r int const a int b size_t 1 m64 mmx_result __m64 result m64 const mmx_a __m64 const a m64 const mmx_b __m64 const b for length gt 3 length 4 mmx_resultt _mm_add_pil6 mmx_a mmx_btt The following code which takes care of excess elements is not needed if the array sizes passed are known to be multiples of four result unsigned short mmx_r a unsigned short const mmx_a b unsigned short const mmx_b for length gt 0 1 result xat bt __declspec cpu_dispatch pentium pentium_MMX void array_sum int r int const a int b size_t 1l Empty function body informs the compiler to generate the CPU dispatch function listed in the cpu_dispatch clause Processor specific Runtime Checks IA 32 Systems The Intel C Compiler optimizations take effect at run time For A 32 systems the compiler enhances processor specific optimizations by inserting a code segment in the program that performs the run ti
213. e length Number of external identifiers file Number of identifiers in a single block Number of macros simultaneously defined Number of parameters to a function call Number of parameters per macro Number of characters in a string Bytes in an object Include file nesting depth Case labels in a switch Members in one structure or union Enumeration constants in one enumeration Levels of structure nesting Size of arrays Tested Values 512 512 512 512 2048 64K 128K 2048 128K 512 512 128K 512K 512 32K 32K 8192 320 2 GB 202 Key Files Reference Key Files Summary for IA 32 Compiler The following tables list and briefly describe files that are installed for use by the A 32 version of the compiler bin Files File codecov iccvars sh iccvars csh pete icpe iccbin icpcbin mcpcom iccbin icpcbin profmerge proforder tselect xiar xild include Files File dvec h emm_func h emmintrin h float h fvec h iso646 h ivec h limits h Description Code coverage tool Batch file to set environment variables Scripts that check for license file and call compiler driver Compiler drivers Intel C Compiler Compiler drivers Utility used for Profile Guided Optimizations Utility used for Profile Guided Optimizations Test prioritization tool Tool used for Interprocedural Optimizations Tool used for Interprocedural Optimizations Description SSE
214. e mm_cmple_ x epi32 epil epis8 pi32 pile pis _mm_and_l y sil28 sil28 sil28 si6 4 si6 4 si6 4 _mm_andnot_ y si128 sil28 i128 si64 si64 si64 _mm_or_l y sil28 sil28 sil28 si64 si64 si64 436 Intrinsic Operators Corresponding lect_nl mm_cmp1 select_nle _mm_cmple Note that_mm_andnot_ y intrinsics do not apply to the fvec classes N A N A x N A N A N A N A N A N A N A N A N A N A N A Conditional Select Operators Corresponding Intrinsics and Classes Part 2 Intrinsic select_ q mm_cmpeq Operators Corresponding _mm_and_l y _mm_or_ y select_neg _mm_cmpeq _mm_andnot_ _mm_and_l y _mm_andnot _mm_or_ y x pd ps _lyl select_gt mm_cmpgt _mm_and_ y _mm_or_ y select_ge mm_cmpge _mm_andnot_ _mm_and_ y _mm_or_ y _mm_andnot_ select lt mm_cmplt _mm_and_ y _mm_andnot_ _mm_or_ y select_le mm_cmple _mm_and_ y _mm_or_ y select_ngt _mm_cmpgt _mm_andnot_ select_nge _mm_cmpge select_nlt _mm_cmplt select_nle _mm_cmple x pd ps ss F64vec2 F32vec4 F32vec1 N A N A N A N A Reference I32vec4 l16vec8 I8veci6 I32vec2 116vec4 I8vec8 lect_ngt mm_cmpgt N A lect_nge _mm_cmpge N A 437 Intel C Compiler for Linux Systems User s Guide Packing and Unpacking Operators Corresponding Intrinsics and Cl
215. e the 32 bit value from the low half of A with the 32 bit value from the low half of B RO AO R1 BO Corresponding intrinsic _mm_unpacklo_epi32 Interleave the 64 bit value from the low half of A with the 64 bit values from the low half of B T 6 4vec2 unpack_low 164vec2 A I64vec2 B Ts64vec2 unpack_low Is 64vec2 A Is64vec2 B Tu64vec2 unpack_low Iu 4vec2 A Iu64vec2 B RO AO Rl BO R2 Al R3 Bl Corresponding intrinsic _mm_unpacklo_epi32 Interleave the two 32 bit values from the low half of A with the two 32 bit values from the low half of B T32vec4 unpack_low I32vec4 A I32vec4 B Is32vec4 unpack_low Is32vec4 A Is32vec4 B Tu32vec4 unpack_low Iu32vec4 A Iu32vec4 B RO AO Rl BO R2 Al R3 Bl 409 Intel C Compiler for Linux Systems User s Guide Corresponding intrinsic _mm_unpacklo_epi32 Interleave the 32 bit value from the low half of A with the 32 bit value from the low half of B T32vec2 unpack_low I32vec2 A I32vec2 B Is32vec2 unpack_low Is32vec2 A Is32vec2 B Tu32vec2 unpack_low Iu32vec2 A Iu32vec2 B RO AO R1 BO Corresponding intrinsic _mm_unpacklo_pi32 Interleave the two 16 bit values from the low half of A with the two 16 bit values from the low half of B Tl6vec8 unpack_low Il6 vec8 A I16vec8 B Isl6vec8 unpack_low Isl6vec8 A Isl6vec8 B Tul6vec8 unpack_low Iul6vec8 A Iul6 vec8 B RO AO Rl BO R2 Al R3 Bl R4
216. e the block coverage in the case of covered functions e the function names This example shows the coverage source view of SAMPLE C void fi int n it ia e ty maw OV c printf 1 or On uncovered functions 53 blocks function 5 void 2 int n e gz a4 i itn en 2i i nos Oy void gi int m int j k covered functions for j O j cm j ej i a 66 67 4 6 2 30 83 33 5 6 fl 100 00 8 8 gi 100 00 15 15 main void gz int m 140 Volume II Optimizing Applications Setting the Coloring Scheme for the Code Coverage The tool provides a visible coloring distinction of the following coverage categories covered code uncovered basic blocks uncovered functions partially covered code unknown The default colors that the tool uses for presenting the coverage information are shown in the table that follows This color Means Covered code The portion of code colored in this color was exercised by the tests The default color can be overridden with the ccolor option Uncovered Basic blocks that are colored in this color were not exercised by any of the basic block tests They were however within functions that were executed during the tests The default color can be overridden with the bcolor option Uncovered Functions that are colored in this color were never called during the tests The function default color can be overridden with the fcolor option Partia
217. e the new data type only on the left hand side of an assignment as a return value or as a parameter You cannot use it with other arithmetic expressions and so on Use the new data type as objects in aggregates such as unions to access the byte elements and structures the address ofan __m64 object may be taken Use new data types only with the respective intrinsics described in this documentation For complete details of the hardware instructions see the Intel Architecture MMX Technology Programmer s Reference Manual For descriptions of data types see the Intel Architecture Software Developer s Manual Volume 2 Streaming SIMD Extensions This section describes the C language level features supporting the Streaming SIMD Extensions SSE in the Intel C Compiler These topics explain the following features of the intrinsics Floating Point Intrinsics Arithmetic Operation Intrinsics Logical Operation Intrinsics Comparison Intrinsics Conversion Intrinsics Load Operations Set Operations Store Operations Cacheability Support Integer Intrinsics Memory and Initialization Intrinsics Miscellaneous Intrinsics Using Streaming SIMD Extensions on Itanium Architecture The prototypes for SSE intrinsics are in the xmmintrin h header file F Note You can also use the single ia32intrin h header file for any IA 32 intrinsics Floating point Intrinsics for Streaming SIMD Extensions You should be familiar with
218. e two 16 bit values from the high half of B Tl6vec8 unpack_high Il6vec8 A I16vec8 B Isl6vec8 unpack_high Isl6vec8 A Isl6vec8 B Tul6vec8 unpack_high Iul6vec8 A Iul6vec8 B RO A2 Rl B2 R2 A3 R3 B3 Corresponding intrinsic _mm_unpackhi_epil6 Interleave the two 16 bit values from the high half of A with the two 16 bit values from the high half of B Il6vec4 unpack_high Il6vec4 A Il6vec4 B Isl6vec4 unpack_high Isl6vec4 A Isl6 vec4 B Tul6vec4 unpack_high Iul6vec4 A Iul6 vec4 B RO R2 A2 R1 A3 R3 B2 B3 Corresponding intrinsic _mm_unpackhi_pil6 Interleave the four 8 bit values from the high half of A with the four 8 bit values from the high half of B I8vec8 unpack_high I8vec8 A I8vec8 B Is8vec8 unpack_high Is8vec8 A I8vec8 B Tu8vec8 unpack_high Iu8vec8 A I8vec8 B RO A4 Rl B4 R2 A5 R3 B5 R4 A6 R5 B6 R6 A7 R7 B7 Corresponding intrinsic _mm_unpackhi_pi8 Interleave the sixteen 8 bit values from the high half of A with the four 8 bit values from the high half of B T8vecl6 unpack_high I8vecl6 A I8vecl6 B 408 Reference Is8vecl6 unpack_high Is8vecl6 A I8vecl6 B Tu8vecl6 unpack_high Iu8vecl6 A I8vecl6 B RO A8 Rl B8 R2 A9 R3 B9 R4 A10 R5 B10 R6 All R7 B11 R8 A12 R8 B12 R2 A13 R3 B13 R4 A14 R5 B14 R6 A15 R7 B15 Corresponding intrinsic _mm_unpackhi_epil6 Interleav
219. each floating point to integer conversion and change it back afterwards The rcd option disables the change to truncation of the rounding mode for all floating point calculations including floating point to integer conversions Turning on this option can improve performance but floating point conversions to integer will not conform to C semantics fp_port Option The fp_port option rounds floating point results at assignments and casts An impact on speed may result fpstkchk Option When a function call returns a floating point value the return value should be placed at the top of the FP stack If the return value is unused the compiler pops the value off the stack to keep the FP stack in the correct state However if the application leaves out the function s prototype or incorrectly prototypes the function then the return value may remain on the stack This may result in the FP stack filling up and eventually overflowing Generally when the FP stack overflows a NaN value is put into FP calculations and the program s results differ Unfortunately the overflow point can be far away from the point of the actual bug The fpchkstk option places code that would access violate immediately after an incorrect call occurred thus making it easier to locate these issues Floating point Arithmetic Options for Itanium based Systems The following options enable you to control the compiler optimizations for floating point computations o
220. eaded code you can set the number of desired threads in the OpenMP environment variable OMP_NUM_THREADS See OpenMP Environment Variables for further information openmp Option The openmp option enables the parallelizer to generate multithreaded code based on the OpenMP directives The code can be executed in parallel on both uniprocessor and multiprocessor systems The openmp option works with both 00 no optimization and any optimization level of 01 02 default and 03 Specifying 00 with openmp helps to debug OpenMP applications OpenMP Directive Format and Syntax An OpenMP directive has the form pragma omp directive name clause newline where e pragma omp Required for all OpenMP directives e directive name A valid OpenMP directive Must appear after the pragma and before any clauses e clause Optional Clauses can be in any order and repeated as necessary unless otherwise restricted e newline Required Proceeds the structured block which is enclosed by this directive OpenMP Diagnostics The openmp_report 0 1 2 option controls the OpenMP parallelizer s diagnostic levels 0 1 or 2 as follows openmp_report0 no diagnostic information is displayed openmp_report1 display diagnostics indicating loops regions and sections successfully parallelized openmp_report2 same as openmp_report1 plus diagnostics indicating MASTER constructs SINGLE constructs
221. eal with the low level details of iteration space partitioning data sharing and thread scheduling and synchronization e Provides the benefit of the performance available from shared memory multiprocessor systems The Intel C Compiler performs transformations to generate multithreaded code based on the user s placement of OpenMP directives in the source program making it easy to add threading to existing software The Intel compiler supports all of the current industry standard OpenMP directives except WORKSHARE and compiles parallel programs annotated with OpenMP directives In addition the Intel C Compiler provides Intel specific extensions to the OpenMP C version 2 0 specification including run time library routines and environment variables F Note As with many advanced features of compilers you must properly understand the functionality of the OpenMP directives in order to use them effectively and avoid unwanted program behavior See parallelization options summary for all of the options of the OpenMP feature in the Intel C Compiler For complete information on the OpenMP standard visit the OpenMP Web site at http www openmp org For OpenMP C version 2 0 API specifications see http www openmp org specs Parallel Processing with OpenMP To compile with OpenMP you need to prepare your program by annotating the code with OpenMP directives The Intel C Compiler first processes the application and produces a multi
222. earest common ancestor of Tu8vec8 and Is8vec8 is I8vecs8 Also the nearest common ancestor between Iu8vec8 and I116vec4 is M64 e Casting Changes the data type from one class to another When an operation uses different data types as operands the return value of the operation must be assigned to a single data type Therefore one or more of the data types must be converted to a required data type This conversion is known as a typecast Sometimes typecasting is automatic other times you must use special syntax to explicitly typecast it yourself e Operator Overloading This is the ability to use various operators on the same user defined data type of a given class Once you declare a variable you can add subtract multiply and perform a range of operations Each family of classes accepts a specified range of operators and must comply by rules and restrictions regarding typecasting and operator overloading as defined in the header files The following table shows the notation used in this documention to address typecasting operator overloading and other rules 389 Intel C Compiler for Linux Systems User s Guide Class Syntax Notation Conventions Class Name Description Any value except 1128vec1 nor 164vecl T 64vecl _ m64 data type I s u 64vec2 two 64 bit values of any signedness I s u 32vec4 four 32 bit values of any signedness I s u 8vec16 eight 16 bit values of any signedness I s u 16vec8 sixteen 8 bit values of any sig
223. eate OFF real object files when used with ipo value Generates a multifile OFF assemblable file named ipo_out s that can be used in further link steps ipo_separate Creates one object file for every OFF source file This option overrides ipo value isystemdir Add directory dir to the start OFF of the system include path ivdep_parallel This option indicates there is OFF i64 only absolutely no loop carried memory dependency in the loop where the IVDEP directive is specified Kc Compile all source or OFF unrecognized file types as C source files kernel Generates code for inclusionin OFF i64 only the kernel Prevents generation of speculation as support may not be available when code runs Suppresses software pipelining Knopic KNOPIC Use fpic instead of this option ON for Itanium 164 only based systems OFF for IA 32 KPIC Kpic Use fpic instead of this option OFF Ldirectory Instruct linker to search OFF directory for libraries M Generates makefile dependency OFF lines for each source file based on the include lines found in the source file 18 Compiler Options Quick Reference Option Description Default march cpu 132 only Generate code excusively fora OFF given cpu Values for cpu are e pentiumpro Intel Pentium Pro processors e pentiumii Intel Pentium II processors e pentiumiii Intel Pentium III processors e pentium Intel
224. ec ocal exec g0 Disable generation of symbolic OFF debug information no global hoist Enables disables hoisting and OFF speculative loads of global variables ipo value Enables interprocedural OFF optimizations across files The optional value argument controls the maximum number of link time compilations or number of object files that are spawned The default for value is 1 when value is not specified for small applications It will generate two or more object files for large applications ipo_separate Creates one object file for every OFF source file This option overrides ipo value kernel Generates code for inclusion in the OFF i64 only kernel Prevents generation of speculation as support may not be available when code runs Suppresses software pipelining MP Add a phony target for each OFF dependency MQtarget Same as MT but quotes special OFF Make characters MTtarget Change the default target rule for OFF dependency generation 0s Enable speed optimizations but OFF disable some optimizations which increase code size for small speed benefit Compiler Options Quick Reference Option Description Default Qlocation gas path Specifies the GNU assembler OFF Qlocation gld path Specifies the GNU linker OFF reserve kernel regs Reserves registers f12 f15 and OFF 164 only 32 f127 for use by the kernel These will not be used by the compiler std gnu8
225. eceaeeeeaaeeeeeeecaeeesaaeeeeaaeseeeeecaeeesaaeeseneeseeeeseaeeseaaeeeeeees 9 Compiler Options Cross Reference ccccccceceeeceeeeeeeeeeeeeeeeeeeaeeeeaeeseeeeceaeeesaaeseeaeeseeeesseaeeseaeeseneeseeeeaas 30 Default Compiler Options 0 ccccceeccceseeeeececeeeeeceaeeeeaaeseneeceaeeecaaeeseaaeseaneeceaeeesaaeseeaeeseaeeesaeseaeeseaaeesenees 37 Deprecated and Unsupported Compiler Options ceccceceeceeeeeeeeeeeeeeeeeeeaeeeeaaeeeeeeeseeeesaeeeeeeseenees 38 Volume I Building Applica On 8 sits cet aise site te ead Ne sa ohecas ha Sault soadiey canddaaabeteasuaoatsiata 39 Building Applications from the Command Line ccccccecceeeeeeceeeeceeeeceeeeeeeaeeeeaaeseeeeeseaeeesaeeseaeeseeeeesaees 41 Building Applications in ECliDS 2 cece ee ariii ihai A E A 43 Compilation Options iiini iiaiai tare Avail enya den aie eden ease 67 LIKING areire a E ive ier ee aii i el nea igen Aer hd ra iis 82 DODUQGING EARE E stews atin A S A A E natalie ats 83 Creating and Using Libraries sisosimruei triny ini a iai iiaiai iaia E E aki ia 87 gcc Compatibility reiii irena asl ees Qe iaa aaa a eee 95 Language Conformance iwe 2 ve veecie ek Avene a een een ah a vee ini ii eae 104 Volume TE Optimizing Applications aviccesedeiiatscdcets Qi interdssl ii sevworeceatbuantiauededd raed taeeuteceatuanatios 108 Optimizations LevelSxs io Ait te Ante A ee i ei ater a eg 108 Floating point Optimizations is assis att tated date edness Beate eee ee
226. ector n must be an immediate r n 0 a0 n 1 al n 2 a2 a3 __m64 _m_pinsrw __m64 a int d int n Inserts word d into one of four words of a The selector n must be an immediate rO n 0 d a0 ri n 1 d al r2 n 2 d a2 r3 n 3 d a3 285 Intel C Compiler for Linux Systems User s Guide __m64 _m_pmaxsw __m64 a __m64 b Computes the element wise maximum of the words in a and b rO min a0 b0 rl min al b1 r2 min a2 b2 r3 min a3 b3 m64 _m_pmaxub __m64 a __m64 b Computes the element wise maximum of the unsigned bytes in a and b rO min a0 bO rl min al bl r7 min a7 b7 __m64 _m_pminsw __m64 a __m64 b Computes the element wise minimum of the words in a and b rO min a0 b0 rl min al b1 r2 min a2 b2 r3 min a3 b3 __m64 _m_pminub __m64 a __m64 b Computes the element wise minimum of the unsigned bytes in a and b rO min a0 b0 rl min al bl r7 min a7 b7 int _m_pmovmskb __m64 a Creates an 8 bit mask from the most significant bits of the bytes in a r sign a7 lt lt 7 sign a6 lt lt 6 sign a0 m64 _m_pmulhuw __m64 a __m64 b Multiplies the unsigned words in a and b returning the upper 16 bits of the 32 bit intermediate results rO hiword a0O b0 rl hiword al bl r2 hiword a2 b2 r3 hiword a3 b3 __m64 _m_pshufw __m64 a int n R
227. ed in ecx The hints parameter will contain hints to the monitor hardware which will be passed in edx A non zero value for extensions will cause a general protection fault extern void _mm_mwait unsigned extensions unsigned hints Generates the MWAIT instruction This instruction is a hint that allows the processor to stop execution and enter an implementation dependent optimized state until occurrence of a class of events In future processor designs extensions and hints parameters may be used to convey additional information to the processor All 338 Reference non zero values of extensions and hints are reserved A non zero value for extensions will cause a general protection fault Intrinsics for Itanium Instructions This section lists and describes the native intrinsics for Itanium instructions These intrinsics cannot be used on the IA 32 architecture The intrinsics for Itanium instructions give programmers access to Itanium instructions that cannot be generated using the standard constructs of the C and C languages The prototypes for these intrinsics are in the ia64intrin h header file Native Intrinsics for Itanium Instructions The prototypes for these intrinsics are in the ia64intrin h header file Integer Operations Intrinsic __int64 _m64_dep_mr __int64 r int64 s const int pos const int len __int64 _m64_dep_mi const int v int64 s const int p cons
228. ed or unsigned 16 bit integers in a left by count bits while shifting in zeros rO a0 lt lt count rl al lt lt count r7 a7 lt lt count 320 Reference __m128i _mm_slli_epi32 __m128i a int count Shifts the 4 signed or unsigned 32 bit integers in a left by count bits while shifting in zeros rO a0 lt lt count ri al lt lt count r2 a2 lt lt count r3 a3 lt lt count __m128i _mm_sll_epi32 __m128i a __m1281i count Shifts the 4 signed or unsigned 32 bit integers in a left by count bits while shifting in zeros r0 a0 lt lt count rl al lt lt count r2 a2 lt lt count r3 a3 lt lt count __m128i _mm_slli_epi64 __m128i a int count Shifts the 2 signed or unsigned 64 bit integers in a left by count bits while shifting in zeros rO a0 lt lt count rl al lt lt count __m128i _mm_sll_epi64 __m128i a __m1281i count Shifts the 2 signed or unsigned 64 bit integers in a left by count bits while shifting in zeros r0 a0 lt lt count rl al lt lt count __m128i _mm_srai_epil6 __m128i a int count Shifts the 8 signed 16 bit integers in a right by count bits while shifting in the sign bit rO a0 gt gt count rl al gt gt count E7 a7 gt gt count __m128i _mm_sra_epil6 __m128i a __m128i count Shifts the 8 signed 16 bit integers in a right by count bits while shifting in the sign bit ro a0 gt gt count rl al gt gt
229. ee whether software pipelining was applied see Optimizer Report Generation Loop Count and Loop Distribution loop count n Directive The loop count n directive indicates the loop count is likely to be n The syntax for this directive is pragma loop count n where n is an integer constant The value of Loop count affects heuristics used in software pipelining vectorization and loop transformations Example of loop count n Directive pragma loop count 10000 for i 0 i lt m i swe likely to occur in this loop afi b i 1 2 193 Intel C Compiler for Linux Systems User s Guide distribute point Directive The distribute point directive indicates to the compiler a preference of performing loop distribution The syntax for this directive is pragma distribute point Loop distribution may cause large loops be distributed into smaller ones This may enable software pipelining for more loops If the directive is placed inside a loop the distribution is performed after the directive and any loop carried dependency is ignored If the directive is placed before a loop the compiler will determine where to distribute and data dependency is observed Only one distribute directive is supported when placed inside the loop Example of distribute point Directive pragma distribute point for i l i lt m i Compiler will automatically decide where to distribute Data dependency is observed
230. egory Compiler Options Option Use Option Category Name General Show startup banner V Include debug information 9 Optimization level 00 default for Debug Config 01 02 default for Release Config 03 fast Warning level w0 w1 default w2 Optimization Provide frame pointers fp Disable prefetch insertion prefetch Enable interprocedural optimization for single file ip compilation Disable intrinsic inline expansion nolib_inline Inline function expansion Ob0 Ob1 Ob2 Optimize for Intel processor tpps tpp6 tpp7 default Parallelization parallel Precompiled Automatic Processing for Precompiled Headers pch Headers Preprocessor gcc compatibility options cxxlib icc cxxlib gec 60 Volume I Building Applications Option Use Category fabi version gcc version Additional include directories Ignore standard include path Preprocessor definitions Do not predefine _GNUC_ _GNUC_MINOR_ no gcc _GNUC_PATCHLEVEL_ macros Undefine preprocessor definitions uU Undefine all preprocessor definitions A Language Enable use of ANSI aliasing rules in optimizations ansi_alias Disable C99 support c99 Recognize the restrict keyword restrict Process OpenMP directives openmp openmp_stubs Compilation Treat warnings as errors Werror Diagnostics Allow usage messages Wcheck Enable equivalent of GNU ANSI ansi Strict ANSI c
231. elines to implement automatic processor dispatch support 118 Stub for cpu_dispatch must have a cpuid defined in cpu_specific elsewhere If the cpu_dispatch stub for a function f contains the cpuid p then a cpu_specific definition of with cpuid p must appear somewhere in the program otherwise an unresolved external error is reported A cou_specific function definition need not appear in the same translation unit as the corresponding cou_dispatch stub unless the cpu_specific function is declared static The inline attribute is disabled for all cpu_specific and cpu_dispatch functions Must have a stub for cpu_speci fic function Ifa function f is defined as__ dec l spec cpu_specific p thena cpu_dispatch stub must also appear for within the program and p must be in the couid list of that stub otherwise that cou_specific definition cannot be called nor generate an error condition Overrides command line settings When a cpu_dispatch stub is compiled its body is replaced with code that determines the processor on which the program is running then dispatches the best cpu_specific implementation available as defined by the cpuid list The cpu_specific function optimizes to the specified Intel processor regardless of command line option settings Volume II Optimizing Applications Processor Dispatch Example Here is an example of how these features can be used include lt mmintrin h gt Pentium processor function does not us
232. en compiling a C translation unit STDC_HOSTED The integer 1 TIME The time of compilation as a string literal in the form hh mm ss 104 C99 Support Volume I Building Applications The following C99 features are supported in this version of the Intel C Compiler compound literals E Note restricted pointers restrict keyword available with restrict See Note variable length Arrays flexible array members complex number support _Complex keyword hexadecimal floating point constants designated initializers mixed declarations and code macros with a variable number of arguments inline functions inline keyword boolean type _Bool keyword The restrict option enables the recognition of the restrict keyword as defined by the ANSI standard By qualifying a pointer with the rest rict keyword the user asserts that an object accessed via the pointer is only accessed via that pointer in the given scope It is the user s responsibility to use the restrict keyword only when this assertion is true In these cases the use of restrict will have no effect on program correctness but may allow better optimization These features are not supported pragma STDC FP_CONTRACT pragma STDC FENV_ACCESS pragma STDC CX_LIMITED_RANGE long double 128 bit representations Conformance to the C Standard The Intel C Compiler conforms to the ANSI ISO standard ISO IEC 14882 1998 for the C language
233. ence Description The asm template is a C language ASCII string which specifies how to output the assembly code for an instruction Most of the template is a fixed string everything but the substitution directives if any is passed through to the assembler The syntax for a substitution directive is a followed by one or two characters The supported substitution directives are specified in a subsequent section The asm interface consists of three parts 1 an optional output list 2 an optional input list 3 an optional clobber list These are separated by colon characters If the output list is missing but an input 1list is given the input list may be preceded by two colons to take the place of the missing output list Ifthe asm interface is omitted altogether the asm statement is considered volatile regardless of whether a volatile keyword was specified An output list consists of one or more output specs separated by commas For the purposes of substitution in the asm template each output spec is numbered The first operand in the output list is numbered 0 the second is 1 and so on Numbering is continuous through the out put 1list and into the input list The total number of operands is limited to 10 i e 0 9 Similar to an output list an input list consists of one or more input specs separated by commas For the purposes of substitution in the asm template each input spec is numbered wit
234. endental instructions complex_limited_range This option enables the use of the basic algebraic expansions of some complex arithmetic operations At the loss of some exponent range the complex_limited_range option can allow for some performance improvement in programs which utilize complex arithmetic By default the compiler disables this option by using complex_limited_range Options for IA 32 Only ae A change of the default precision control or rounding mode for example by using the pc32 flag or by user intervention may affect the results returned by some of the mathematical functions prec_div Option With some optimizations the Intel C Compiler changes floating point division computations into multiplication by the reciprocal of the denominator For example A B is computed as A x 1 B to improve the speed of the computation However for values of B greater than 2 the value of 1 B is flushed changed to 0 When it is important to maintain the value of 1 B use prec_div to disable the floating point division to multiplication optimization The result of prec_div is greater accuracy with some loss of performance pcn Option Use the pcn option to enable floating point significand precision control Some floating point algorithms are sensitive to the accuracy of the significand or fractional part of the floating point value For example iterative operations like division and finding the square root can r
235. ent cache lines which could be detrimental to performance You can instead declare them as follows __declspec align 8 struct int i j sub The compiler now ensures that they are allocated in the same cache line In C you can omit the struct variable name written as sub in the previous example In C however it is required and you must write references to i and j as sub i and sub j If you use many functions with such subscript pairs it is more convenient to declare and use a struct type for them as in the following example typedef struct __declspec align 8 int i j Sub By placing the ___declspec align after the keyword struct you are requesting the appropriate alignment for all objects of that type However that allocation of parameters is unaffected by __declspec align If necessary you can assign the value of a parameter to a local variable with the appropriate alignment You can also force alignment of global variables such as arrays __declspec align 16 float array 1000 Allocating and Freeing Aligned Memory Blocks Use the _mm_malloc and _mm_free intrinsics to allocate and free aligned blocks of memory These intrinsics are based on malloc and free which are in the Libirc a library You need to include malloc h The syntax for these intrinsics is as follows void _mm_malloc int size int align void _mm_free void p The _mm_malloc routine takes an extra parameter which is the ali
236. er If you specify a phase limiting option the compiler produces a separate output file representing the output of the last phase that completes for each primary input file Preprocessor Options This section describes the options you can use to direct the operations of the preprocessor Preprocessing performs such tasks as macro substitution conditional compilation and file inclusion Option Description sy Associates a symbol name with the specified sequence of values Equivalent to an assert preprocessing directive Aname values A Causes all predefined macros and assertions to be inactive C Preserves comments in preprocessed source output Dname value Defines the macro name and associates it with the specified value The default Dname defines a macro with a value ofl E Directs the preprocessor to expand your source module and write the result to standard output EP Directs the preprocessor to expand your source module and write the result to standard output Does not include line directives in the output P Directs the preprocessor to expand your source module and store the result ina i file in the current directory Uname Suppresses any automatic definition for the specified macro name 67 Intel C Compiler for Linux Systems User s Guide Preprocessing Only Using Using Using Use the E P or EP option to preprocess your source files without compiling them
237. erators and Miscellaneous Exceptions A and B converted to M64 Result assigned to Iu8vec8 T64vecl A Is8vec8 B Tu8vec8 C C A amp B Corresponding _mm_and_si64 _mm_and_sil128 _mm_and_si64 _mm_and_sil28 _mm_and_si64 _mm_and_sil28 _mm_and_si64 _mm_and_sil 28 393 Intel C Compiler for Linux Systems User s Guide Same size and signedness operators return the nearest common ancestor I32vec2 R Is32vec2 A Iu32vec2 B A amp B returns M64 which is cast to Tu8vecs C Iu8vec8 A amp B C When A and B are of the same class they return the same type When A and B are of different classes the return value is the return type of the nearest common ancestor The logical operator returns values for combinations of classes listed in the following tables apply when A and B are of different classes Ivec Logical Operator Overloading Return AND OR XOR NAND A Operand B Operand T64vecl R amp A andnot I s u 64vec2 A I64vec2 R amp A andnot I s u 64vec2 A 64vec2 I32vec2 R amp A andnot I s u 32vec2 A I32vec4 R amp A andnot I s u 32vec4 A 32vec4 Il6vec4 R amp A andnot I s u 16vec4 A l vec4 Il vec8 R amp A andnot I s u 16vec8 A 16vec8 I8vec8 R amp A andnot I s u 8vec8 A 8vec8 B I8vecl6 R amp A andnot I s u 8vec16 A 8vec16 For logical operators with assignment the return value of R is always the same da
238. es Yes http gcc gnu org onlinedocs gec 3 4 0 gcec C Attributes html C 20Attributes Java Exceptions No http gcc gnu org onlinedocs gec 3 4 0 gec Java Exceptions html Java 20Exceptions 98 Volume I Building Applications g Language Intel GNU Description and Examples Extension Support Deprecated Features No http gcc gnu org onlinedocs gec 3 4 0 gec Deprecated Features html Deprecated 20Features Backwards http gcc gnu org onlinedocs gec 3 4 0 gcec Compatibility Backwards Compatibility html Backwards 20Compatibility F Note The Intel C Compiler supports gcc style inline ASM if the assembler code uses AT amp T System V 386 syntax See http www gnu org software binutils manual gas 2 9 1 html_node as_196 html for more information gcc Interoperability C compilers are interoperable if they can link object files and libraries generated by one compiler with object files and libraries generated by the second compiler and the resulting executable runs successfully The Intel C Compiler has made significant improvements towards interoperability and is highly compatible with the GNU gcc compiler This section describes features of the Intel C Compiler that provide interoperability with gcc These features include e Compiler Options for Interoperability e Predefined Macros for Interoperability See gcc Compatibility for a detailed list of compatibility features Compiler Options for I
239. es are found temporary files are stored in t mp TA32ROOT IA32 based systems points to the directory containing the bin lib include and substitute header directories IA64ROOT Itanium based systems points to the directory containing the bin lib include and substitute header directories GNU Environment Variables The Intel C Compiler supports the following GNU environment variables CPATH Path to include directory for C C compilations C_INCLUDE_PATH Path include directory for C compilations CPLUS_INCLUDE_PATH Path include directory for C compilations L P D IBRARY_PATH The value of LIBRARY_PATH is a colon separated list of directories much like ATH EPENDENCIES_OUTPUT If this variable is set its value specifies how to output dependencies r Make based on the non system header files processed by the compiler System header files are ignored in the dependency output SUNPRO_DEPENDENCIES This variable is the same as DEPENDENCIES_OUTPUT except that system header files are not ignored mol Compilation Environment Options The Intel C Compiler installation includes shell scripts that you can use to set environment variables See Invoking the Compiler from the Command Line for more information Configuration Files You can decrease the time you spend entering command line options and ensure consistency by using the configu
240. es of a and b the upper 3 SP FP values are passed through from a r0 a0 bO ri al r2 a2 r3 a3 269 Intel C Compiler for Linux Systems User s Guide m128 _mm_mul_ps __m128 a __m128 b Multiplies the four SP FP values of a and b rO a0 bO ri al bl r2 a2 b2 r3 a3 b3 m128 _mm_div_ss __m128 a __m128 b Divides the lower SP FP values of a and b the upper 3 SP FP values are passed through from a ro a0 bO rl al r2 a2 r3 a3 m128 _mm_div_ps __m128 a __m128 b Divides the four SP FP values of a and b rO a0 bO L al f pI r2 a2 b2 r3 a3 p3 m1128 _mm_sqrt_ss __m128 a Computes the square root of the lower SP FP value of a the upper 3 SP FP values are passed through r0 sqrt a0 L al r2 a2 r3 a3 m1128 _mm_sqrt_ps __m128 a Computes the square roots of the four SP FP values of a r0 sqrt a0 rl sqrt al r2 sqrt a2 r3 sqrt a3 m128 _mm_rcp_ss __m128 a Computes the approximation of the reciprocal of the lower SP FP value of a the upper 3 SP FP values are passed through rO recip a0 rl c al p v2 a2 r3 a3 m128 _mm_rcp_ps __m128 a Computes the approximations of reciprocals of the four SP FP values of a rO recip a0 rl recip al r2 recip a2 r3 recip a3 m128 _mm_rsqrt_ss __m128 a Computes the approximation of the reciprocal of the square root of the
241. ess x y int islessequal x y int islessgreater x y int isnan x int isnormal x int isunordered x y int signbit x See also Miscellaneous Functions 242 Reference Intel C Intrinsics Reference The Intel Pentium 4 processor and other Intel processors have instructions to enable development of optimized multimedia applications The instructions are implemented through extensions to previously implemented instructions This technology uses the single instruction multiple data SIMD technique By processing data elements in parallel applications with media rich bit streams are able to significantly improve performance using SIMD instructions The Intel Itanium processor also supports these instructions The most direct way to use these instructions is to inline the assembly language instructions into your source code However this can be time consuming and tedious and assembly language inline programming is not supported on all compilers Instead Intel provides easy implementation through the use of API extension sets referred to as intrinsics Intrinsics are special coding extensions that allow using the syntax of C function calls and C variables instead of hardware registers Using these intrinsics frees programmers from having to program in assembly language and manage registers In addition the compiler optimizes the instruction scheduling so that executables run faster In addition t
242. essage during execution fcode asm Produce assembly file with OFF optional code annotations Requires S ffnalias Assume aliasing within ON functions finline functions Inline any function at the OFF compiler s discretion Same as ipi fminshared Compilation is for the main OFF executable Absolute addressing can be used and non position independent code generated for symbols that are at least protected fno alias Assume no aliasing in program OFF fno common Enables the compiler to treat OFF common variables as if they were defined allowing the use of gprel addressing of common data variables fno exceptions The fno exceptions OFF option turns off exception handling table generation resulting in smaller code Any use of exception handling constructs try blocks throw statements will produce an error Exception specifications are parsed but ignored A preprocessor symbol __EXCEPTIONS is defined when this option is not used It is undefined when this option is present 13 Intel C Compiler for Linux Systems User s Guide Option Description Default fno fnalias Assume no aliasing within OFF functions but assume aliasing across calls fno implicit inline templates Do not emit code for implicit OFF instantiations of inline templates For C only Never emit code for non inline OFF templates which are instantiated i
243. essors and Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 f Note The tpp7 option is ON by default Example The following invocations all result in a compiled binary optimized for Pentium 4 The same binary will also run on Pentium Pentium Pro Pentium II and Pentium III processors prompt gt icpe prog cpp prompt gt icpe tpp7 prog cpp prompt gt icpe mcpu pentium4 prog cpp Processor Optimization Itanium based Systems only The tpp 1 2 options optimize your application s performance for a specific Intel Itanium processor The resulting binary will also run on the processors listed in the table The Intel C Compiler includes gcc compatible versions of the t pp options These options are listed in the gec Version column Option gcc Version Optimizes for tppl mcpu itanium Itanium processors tpp2 mcpu itanium2 Itanium 2 processors Ff Note The tpp2 option is ON by default 114 Volume II Optimizing Applications Example The following invocations all result in a compiled binary optimized for the Intel Itanium 2 processor The same binary will also run on Intel Itanium processors prompt gt icpe prog cpp prompt gt icpce tpp2 prog cpp prompt gt icpe mcpu itanium2 prog cpp Processor specific Optimization IA 32 only The x K W N B P options target your program to run on a specific Intel processor by generating specialized and optimized code The resulting
244. esult zeroing the upper bits rO a0 rl OXO 128i _mm_move_epi64 __128i a Moves the lower 64 bits of the lower 64 bits of the result zeroing the upper bits rO a0 rl 0X0 334 Reference Additional Miscellaneous Intrinsics The prototypes for Streaming SIMD Extensions 2 SSE2 intrinsics are in the emmint rin h header file __m128d _mm_unpackhi_pd __m128d a __m128d b uses UNPCKHPD Interleaves the upper DP FP values of a and b g0 al rl bl _ m128d _mm_unpacklo_pd __m128d a __m128d b uses UNPCKLPD Interleaves the lower DP FP values of a and b ro a0 rl bo int _mm_movemask_pd __m128d a uses MOVMSKPD Creates a two bit mask from the sign bits of the two DP FP values of a r i sign al lt lt 1 sign a0 __m128d _mm_shuffle_pd __m128d a __m128d b int i uses SHUFPD Selects two specific DP FP values from a and b based on the mask i The mask must be an immediate See Macro Function for Shuffle for a description of the shuffle semantics Intrinsics for Casting Support This version of the Intel C Compiler supports casting between various SP DP and INT vector types These intrinsics do not convert values they just change the type extern _m128 _mm_castpd_ps __m128d in extern m1l28i _mm_castpd_sil28 __m128d in extern m128d _mm_castps_pd __m128 in extern m128i _mm_castps_sil28 __m128 in extern m128 mm_castsil28_ps __m128i in extern m1l28d _mm_cast
245. eters e generate entry exit per threaded task e generate calls to parallel runtime routines for thread creation and synchronization Auto parallelization Enabling Options and Environment Variables To enable the auto parallelizer use the parallel option The paralle1l option detects parallel loops capable of being executed safely in parallel and automatically generates multithreaded code for these loops An example of the command using auto parallelization follows prompt gt icpe c parallel prog cpp 172 Volume II Optimizing Applications Auto parallelization Options The parallel option enables the auto parallelizer if the 02 or 03 optimization option is also on the default is O2 The parallel option detects parallel loops capable of being executed safely in parallel and automatically generates multithreaded code for these loops Option Description parallel Enables the auto parallelizer par_threshold 1 100 Controls the work threshold needed for auto parallelization Default n 100 par_report 1 2 3 Controls the diagnostic messages from the auto parallelizer Auto parallelization Environment Variables Variable Default Description Number of processors currently installed in the system while generating the executable Controls the number of threads used OMP_NUM_THRI OMP_SCHE static Specifies the type of runtime scheduling
246. etting the register between data type transitions e See the Correct Usage coding example 255 Intel C Compiler for Linux Systems User s Guide Incorrect Usage m64 x float f For more documentation on 1 Correct Usage m64 x float f _m_paddd y _mm_empty MMX TM Technology General Support Intrinsics Zz init EMMS visit the http developer intel com Web site The prototypes for MMX TM technology intrinsics are in the mmint rin h header file Intrinsic Alternate Corresponding Operation Signed Saturation Name Name Instruction _m_empty _mm_empty EMMS Empty MM 2 a state m_from_int mm_cvtsi32_si64 OVD Convert from int _m to_int _mm_cvtsi64_si32 OVD Convert from int _m_packsswb _mm_packs_pil6 PACKSSWB Pack Yes _m_packssdw _mm_packs_pi32 PACKSSDW Pack Yes _m_packuswb _mm_packs_pul6 PACKUSWB Pack No _m_punpckhbw _mm_unpackhi_pi8 PUNPCKHBW Interleave _m_punpckhwd _mm_unpackhi_pil6 PUNPCKHWD Interleave _m punpckhdq _mm_unpackhi_pi32 PUNPCKHDQ Interleave _mm_unpacklo_pi8 PUNPCKLBW Interleave _mm_unpacklo_pil6 PUNPCKLWD Interleave _m_punpckldq _mm_unpacklo_pi32 PUNPCKLDO Interleave void _m_empty void Empty the multimedia state m64 _m_from_int int i Convert the integer object i to a 64 bit___m64 object The integer value is zero extended to 64 bits int _m_to_int __m64 m Convert
247. eturns a combination of the four words of a The selector n must be an immediate rO word n amp 0x3 of a rl word n gt gt 2 amp 0x3 of a r2 word n gt gt 4 amp 0x3 of a r3 word n gt gt 6 amp 0x3 of a 286 Reference void _m_maskmovq __m64 d __m64 n char p Conditionally store byte elements of d to address p The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored if sign n0O p 0 do if sign nl p 1 dl if sign n7 pl7 a7 __m64 _m_pavgb __m64 a __m64 b Computes the rounded averages of the unsigned bytes in a and b t unsigned short a0 unsigned short b0 rO t gt gt 1 t amp 0x01 t unsigned short a7 unsigned short b7 r7 unsigned char t gt gt 1 t amp Ox01 __m64 _m_pavgw __m64 a __m64 b Computes the rounded averages of the unsigned words in a and b t unsigned int a0 unsigned int b0 rO t gt gt 1 t amp 0x01 t unsigned word a7 unsigned word b7 r7 unsigned short t gt gt 1 t amp 0x01 __m64 _m_psadbw __m64 a __m64 b Computes the sum of the absolute differences of the unsigned bytes in a and b returning he value in the lower word The upper three words are cleared r0 abs a0 b0 abs a7 b7 rl r2 7r3 0 Memory and Initialization Using Streaming SIMD Extensions This section describes the load set and store operations
248. eturns the value x n y for integer n such that if y is nonzero the result has the same sign as x and magnitude less than the magnitude of y errno EDOM for x 0 Calling interface double fmod double x double y long double fmodl long double x long double y float fmodf float x float y REMAINDER Description The remainder function returns the value of x REM y as required by the IEEE standard Calling interface double remainder double x double y long double remainderl long double x long double y float remainderf float x float y 231 Intel C Compiler for Linux Systems User s Guide REMQUO Description The remquo function returns the value of x REM y In the object pointed to by quo the function stores a value whose sign is the sign of x y and whose magnitude is congruent modulo 2 of the integral quotient of x y where n is an implementation defined integer greater than or equal to 3 Calling interface double remquo double x double y int quo long double remquol long double x long double y int quo float remquof float x float y int quo Miscellaneous Functions The Intel Math library supports the following miscellaneous functions COPYSIGN Description The copysign function returns the value with the magnitude of x and the sign of y Calling interface double copysign double x double y long double copysignl long double x long double y float copysignf
249. f 0x0 r3 a3 ord b3 Oxffffffff 0x0 __m128 _mm_cmpunord_ss __m128 a __m128 b Compare for unordered rO a0 unord b0 Oxffffffff 0x0 rl al r2 a2 r3 a3 __m128 _mm_cmpunord_ps __m128 a __m128 b Compare for unordered rO a0 unord b0 Oxffffffff 0x0 ri al unord bl Oxffffffff 0x0 r2 a2 unord b2 Oxffffffff 0x0 r3 a3 unord b3 Oxffffffff 0x0 int _mm_comieq_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a equal to b If a and b are equal 1 is returned Otherwise 0 is returned t a0 b0 Oxl 0x0 int _mm_comilt_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a less than b If a is less than b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 int _mm_comile_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a less than or equal to b If a is less than or equal to b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 int _mm_comigt_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than b If a is greater than b are equal is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 277 Intel C Compiler for Linux Systems User s Guide int _mm_comige_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than or equal to b If a is greater than or equal to b 1
250. f more than 50 error messages are displayed during the compilation of a cpp compilation aborts prompt gt icpe wn50 c a cpp Remark Messages These messages report common but sometimes unconventional use of C or C The compiler does not print or display remarks unless you specify level 4 for the W option as described in Suppressing Warning Messages or Enabling Remarks Remarks do not stop translation or linking Remarks do not interfere with any output files The following are some representative remark messages e function declared implicitly e type qualifiers are meaningless in this declaration e controlling expression is constant 210 Reference Intel Math Library The Intel C Compiler includes a mathematical software library containing highly optimized and very accurate mathematical functions These functions are commonly used in scientific or graphic applications as well as other programs that rely heavily on floating point computations Support for C99 __Complex data types is included by using the c99 compiler option The mathimf h header file includes prototypes for the library functions See Using the Intel Math Library For a complete list of the functions available refer to the Function List in this section Math Libraries for A 32 and Itanium based Systems The math library linked to an application depends on the compilation or linkage options specified Library Description libimf a_ Default stat
251. f_dpi Test3 dpi 147 Intel C Compiler for Linux Systems User s Guide At this step the profmerge tool merges all the dyn files into one file Test 3 dpi that represents the total profile information of the application on Test 3 12 Create a file named tests_list with three lines The first line contains Test 1 dpi the second line contains Test 2 dpi and the third line contains Test 3 dpi When these items are available the Test prioritization Tool may be launched from the command line in PROF_DIR directory as described in the following examples In all examples the discussion references the same set of data Example 1 Minimizing the Number of Tests tselect dpi_list tests_list spi pgopti spi where the spi option specifies the path to the spi file Here is a sample output from this run of the Test prioritization Tool L number of tests 3 L block coverage 52 17 L function coverage 50 00 Num RatCvrg BlkCvrg FncCvrg Test Name Options 1 87 50 45 65 37 50 2 100 00 52 17 50 00 Test2 dpi In this example the Test prioritization Tool has provided the following information By running all three tests we achieve 52 17 block coverage and 50 00 function coverage Test3 covers 45 65 of the basic blocks of the application which is 87 50 of the total block coverage that can be achieved from all three tests e By adding Test 2 we achieve a cumulative block coverage of 52 17 or 100 of
252. face double acosh double x long double acoshl long double x float acoshf float x ASINH Description The asinh function returns the inverse hyperbolic sine of x Calling interface double asinh double x long double asinhl long double x float asinhf float x ATANH Description The at anh function returns the inverse hyperbolic tangent of x errno EDOM for x lt 1 errno ERANGE for x 1 Calling interface double atanh double x long double atanhl long double x float atanhf float x 219 Intel C Compiler for Linux Systems User s Guide COSH Description The cosh function returns the hyperbolic cosine of x e e 2 errno ERANGE for overflow conditions Calling interface double cosh double x long double coshl long double x float coshf float x SINH Description The sinh function returns the hyperbolic sine of x e e 2 errno ERANGE for overflow conditions Calling interface double sinh double x long double sinhl long double x float sinhf float x SINHCOSH Description The sinhcosh function returns both the hyperbolic sine and hyperbolic cosine of x errno ERANGE for overflow conditions Calling interface void sinhcosh double x float sinval float cosval void sinhcoshl long double x long double sinval long double cosval void sinhcoshf float x float sinval float cosval r TANH Descri
253. fault setting click Restore Defaults The Restore Defaults button appears on each property page but the Restore Defaults action applies to ALL property pages 58 Volume I Building Applications Some properties use check boxes while others use drop down lists to specify a compiler option C Show Startup Banner V C Include Debug Information g Optimization Level Maximize Speed 02 Warning Level Warnings and Errors w1 Several options let you specify arguments Click New to add an argument to the list Enter a valid argument for the option then click OK v ORETTE Undefine Preprocessor Definitions U NO_MATH_INLI NES In this example _ NO_MATH_INLINES and__ STGNED_CHARS___are specified as arguments for the U option Undefine Preprocessor Definitions U NO_MATH_INLINES __SIGNED_CHARS__ Remove Move Up Move Down If you want to specify an option that is not available from the Properties dialog use the Command Line category Enter the command line options in the Additional Options text box just as you would enter them on the command line Additional Options E march pentium4 For a complete list of options listed on the Properties page see Properties for Supported Options 59 Intel C Compiler for Linux Systems User s Guide Properties for Supported Options The options listed in the following tables are supported under the corresponding Option Cat
254. fff m128 _mm_cmpge_ss __m128 a Compare for greater than or equal ro rl al r2 a2 r3 a0 gt b0 Oxffffffff a3 m128 _mm_cmpge_ps __m128 a Compare for greater than or equal rO a0 gt b0 Oxfff rl al gt bl Oxfff r2 a2 gt b2 Oxfff r3 a3 gt b3 Oxfff __m128 _mm_cmpneg_ss __m Compare for inequality rO a0 b0 Oxfff rl al j r2 a2 y 4x3 __m128 _mm_cmpneg_ps __m Compare for inequality rO a0 b0 Oxfff rl al bl Oxfff r2 a2 b2 Oxfff r3 a3 b3 Oxfff FELFE ELETT FEEFEE EEFE 128 a CEECEE a3 128 a FfEfE FELTF ELETT Fffff m128 b 0x0 m128 b 0x0 0x0 0x0 0x0 __m128 b 0x0 m128 b 0x0 0x0 0x0 0x0 __m128 b 0x0 __m128 b 0x0 0x0 0x0 0x0 m128 b 0x0 m128 b 0x0 0x0 0x0 0x0 Reference 275 __m128 _mm_cmpnlt_ss __m Compare for not less than rO a0 lt b0 Oxfff rl ad Sr 2 8S a2 eB __m128 _mm_cmpnlt_ps __m Compare for not less than rO a0 lt b0 Oxfff ri al lt bl Oxfff x2 a2 lt b2 Oxfff r3 a3 lt b3 Oxfff m1128 _mm_cmpnle_ss __m Compare for not less than or equal rO a0 lt bO Oxff rl al r2 a2 r3 __m128 _mm_cmpnle_ps __m Compare for not less than or equal rO a0 lt b0 Oxff rl al lt bl Oxff r2 a2 lt
255. fincude isidin ha Ee E gt lx S at helina imt maienc T Arme i b Gi hion vitir printi Helo Eclpee Work in E waists reinnig B vob cep Ey tuhti rmh When your code is complete save your file using File gt Save then proceed to Building a Project 52 Volume I Building Applications Building a Project You can build your project by selecting Rebuild All from the Eclipse Project menu Open Project Close Project Rebuild Project Create Make Target Build Make Target Properties See the Build results in the C Build view int main veid prin Helle Eclipse World in returni l The final step is Running a Project 53 Intel C Compiler for Linux Systems User s Guide Running a Project After Building a Project you can run your project by following these steps 1 Select Run gt Run As gt C Local Application When the C Local Application dialog appears click OK gac Local Application Choose a local application to run helloworld x86le 2 Choose a configuration to run ee GDB Debugger GDB Server 54 Volume I Building Applications After the executable runs the output of hello c appears in the Console view 1 pella Intel Fl ppmami Product a j ere paa ieee aca al aie ie Ole aa 2 IB SIE a WK ST hewa i niies b heimerki fem 7 Ge Aika bD hano ele as fahi E mikati E shir dep 55 I
256. fle_ps _mm_shuffle_pil6 _mm_extract_pil6 _mm_insert_pil6 Access to Classes Using Header Files The required class header files are installed in the include directory with the Intel C Compiler To enable the classes use the include directive in your program file as shown in the table that follows Include Directives for Enabling Classes Instruction Set Extension Include Directive MMxX Technology lude lt ivec h gt _ lt Streaming SIMD Extensions lude lt fvec h gt a Streaming SIMD Extensions 2 lude lt dvec h gt 385 Intel C Compiler for Linux Systems User s Guide Each succeeding file from the top down includes the preceding class You only need to include fvec h if you want to use both the Ivec and Fvec classes Similarly to use all the classes including those for the Streaming SIMD Extensions 2 you need only to include the dvec h file Usage Precautions When using the C classes you should follow some general guidelines More detailed usage rules for each class are listed in Integer Vector Classes and Floating point Vector Classes Clear MMX Registers If you use both the Ivec and Fvec classes at the same time your program could mix MMX instructions called by Ivec classes with Intel x87 architecture floating point instructions called by Fvec classes Floating point instructions exist in the following Fvec functions e fvec constructors e debug
257. floating point computations In general the options discussed here let you decide between performance and accuracy To achieve greater performance it may be necessary to sacrifice some degree of floating point accuracy See also Floating point Arithmetic Options for Itanum based Systems Options for IA 32 and Itanium based Systems mp Option The mp option restricts some optimizations to maintain declared precision and to ensure that floating point arithmetic conforms more closely to the ANSI and IEEE standards Floating point intermediate results are kept in full 10 byte internal precision All spills and reloads of the X87 floating point registers utilize this internal format to prevent accidental loss of precision For most programs specifying this option adversely affects performance If you are not sure whether your application needs this option try compiling and running your program both with and without it to evaluate the effects on performance versus precision Alternatives to mp include xN for the Intel Pentium 4 processor or newer and mp1 e user variables declared as floating point types are not assigned to registers e whenever an expression is spilled moved from a register to memory it is spilled as 80 bits extended precision not 64 bits double precision e floating point arithmetic comparisons conform to the IEEE 754 specification except for NaN behavior e the exact operations specified in the code are
258. following format default in decimal cout lt lt Is32vec4 A cout lt lt Iu32vec4 A cout lt lt hex lt lt Tu32vec4 A print in hex format S3 2A3 2 2 A2 1 2Al1 O 2A0 Corresponding Intrinsics none The two 32 bit values of A are placed in the output buffer and printed in the following format default in decimal cout lt lt Is32vec2 A cout lt lt Iu32vec2 A cout lt lt hex lt lt Tu32vec2 A print in hex format 1 A1 0 A0 404 Reference Corresponding Intrinsics none The eight 16 bit values of A are placed in the output buffer and printed in the following format default in decimal cout lt lt Isl6vec8 A cout lt lt Iul6vec8 A cout lt lt hex lt lt TIul6vec8 A print in hex format 7 A7 6 A6 5 A5 4 A4 3 A3 2 A2 1 Al1 0 A0 Corresponding Intrinsics none The four 16 bit values of A are placed in the output buffer and printed in the following format default in decimal cout lt lt Isl6vec4 A cout lt lt Iul6vec4 A cout lt lt hex lt lt TIul6vec4 A print in hex format 3 A3 2 A2 1 Al 0 A0 Corresponding Intrinsics none The sixteen 8 bit values of A are placed in the output buffer and printed in the following format default is decimal cout lt lt Is8vecl6 A cout lt lt Iu8vecl6 A cout lt lt hex lt lt Iu8vec8 A print in hex format instead of decimal 15 A15 14 A14 13 A13
259. for i l i lt n it The following example shows that using this option and the IVDEP directive ensures there is no loop carried dependency for the store into a pragma ivdep for j 0 j lt n j PREFETCH Directive The PREFETCH directive is supported on Itantum based systems only Syntax pragma prefetch var hint distance where hint value can be 0 TO 1 NT1 2 NT2 or 3 NTA 152 Volume II Optimizing Applications Example for i i0 i il it is float sum b iJ int ip srow il int c col ip pragma NOPREFETCH col pragma PREFETCH value 1 80 pragma PREFETCH x 1 40 for ip lt srow itl c col 4 sum value ip x c yli sum Prefetching The goal of prefetch insertion optimization is to reduce cache misses by providing hints to the processor about when data should be loaded into the cache The prefetch optimization is enabled or disabled by the prefetch compiler option prefetch enables default prefetch insertion optimization Note that 03 must be specified for this option to work To disable prefetch insertion optimization use prefetch To facilitate compiler optimization e Minimize use of global variables and pointers e Minimize use of complex control flow e Choose data types carefully and avoid type casting For more information on how to optimize with prefetch refer to the Intel Pen
260. from a ro a0 gt bO Oxffffffffffffffff Oxo rl al __m128d _mm_cmpord_sd __m128d a __m128d b Compares the lower DP FP value of a and b for ordered The upper DP FP value is passed through from a rO a0 ord bO Oxffffffffffffffff Oxo rl al __m128d _mm_cmpunord_sd __m128d a __m128d b Compares the lower DP FP value of a and b for unordered The upper DP FP value is passed through from a ro rT a0 unord bO Oxffffffffffffffff 0x0 al __m128d _mm_cmpneq_sd __m128d a __m128d b Compares the lower DP FP value of a and b for inequality The upper DP FP value is passed through from a ro rl a0 bO Oxffffffffffrfrfrffff 0x0 al 305 Intel C Compiler for Linux Systems User s Guide __m1l28d _mm_cmpnit_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a not less than b The upper DP FP value is passed through from a rO a0 lt b0 Oxffffffffffffffff 0x0 rl al __m128d _mm_cmpnle_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a not less than or equal to b The upper DP FP value is passed through from a r0 a0 lt b0 Oxffffffffffffffff 0x0 rl al __m128d _mm_cmpngt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a not greater than b The upper DP FP value is passed through from a rO aO gt bO Oxffffffffffffffff Oxo ri t ar __m128d _mm_cmpnge_sd _
261. function as an undef preprocessor directive You can use the no gcc option to disable the __ GNUC_MINOR__ __ GNUC_MINOR__ and ___GNUC_PATCHLEVEL__ macros Customizing the Compilation Environment For IA 32 and the Intel Itanium architecture you will need to set a compilation environment To customize the environment used during compilation you can specify Environment Variables the paths where the compiler and other tools can search for specific files Configuration Files the options to use with each compilation Response Files the options and files to use for individual projects Include Files the names and locations of source header files 72 Volume I Building Applications Environment Variables You can customize your environment by specifying paths where the compiler can search for special files such as libraries and include files LD_LIBRARY_PATH specifies the location for shared objects PATH specifies the directories the system searches for binary executable files ICCCFG specifies the configuration file for customizing compilations when invoking the compiler using icc ICPCCFG specifies the configuration file for customizing compilations when invoking the compiler using icpc Several environment variables are supported to specify the location for temporary files The compiler searches for the following variables in the order specified TMP TMPDIR and TEMP If none of these variabl
262. functions cout and element access e rsqrt_nr F Note MMX registers are aliased on the floating point registers so you should clear the MMX state with the EMMS instruction intrinsic before issuing an x87 floating point instruction as in the following example ivecA ivecA amp ivecB Ivec logical operation that uses MMX instructions clear state cout lt lt f32vec4a F32vec4 operation that uses x87 floating point instructions Failure to clear the MMX registers can result in incorrect execution or poor performance due to an incorrect register state Follow EMMS Instruction Guidelines Intel strongly recommends that you follow the guidelines for using the EMMS instruction Refer to this topic before coding with the Ivec classes Capabilities The fundamental capabilities of each C SIMD class include computation horizontal data motion branch compression elimination caching hints Understanding each of these capabilities and how they interact is crucial to achieving desired results 386 Reference Computation The SIMD C classes contain vertical operator support for most arithmetic operations including shifting and saturation Computation operations include reciprocal rcp and rcp_nr square root sqrt reciprocal square root rsqrt and rsqrt_nr Operations rcp and rsqrt are new approximating instructions with very short latencies that produce results with at least 12 bits of accuracy O
263. g Word and Allows FP Data Types in Aegisiers Anai r PN Failure to empty the multimedia state after using an MMX instruction and before using a floating point instruction can result in unexpected execution or poor performance Cees EMMS Usage Guidelines The guidelines when to use EMMS are e Do not use on Itanitum based systems There are no special registers or overlay for the MMX TM instructions or Streaming SIMD Extensions on Itanium based systems even though the intrinsics are supported e Use_mm_empty after an MMX instruction if the next instruction is a floating point FP instruction for example before calculations on float double or long double You must be aware of all situations when your code generates an MMX instruction with the Intel C Compiler Let e when using an MMX technology intrinsic e when using Streaming SIMD Extension integer intrinsics that use the __m64 data type e when referencing an __m64 data type variable e when using an MMX instruction through inline assembly e Donotuse_mm_empty before an MMX instruction since using _mm_empty before an MMX instruction incurs an operation with no benefit no op e Use different functions for operations that use FP instructions and those that use MMX instructions This eliminates the need to empty the multimedia state within the body of a critical loop e Use _mm_empty during runtime initialization of __m64 and FP data types This ensures res
264. g configuration boxes Click Next to proceed CyNew Project Select a Target Select the platform and configurations you wish to deploy on Platform Linux Executable Using Intel R C C Compiler v Ff Release FF Debug 5 The Additional Project Settings dialog lets you create dependencies between your new project and other existing projects There should not be any other existing projects at this point Click Finish to complete creation of your new helloworld project iv ES Project Additional Project Settings Defined the inter project dependencies if any Referenced C C Projects Referenced C C Projects 48 Volume I Building Applications 6 Ifyou are not currently in the C C Development Perspective you will see the Confirm Perspective Switch dialog Click Yes to proceed Confirm Perspective Swiich F This kind of project is associated with the C C Development Perspective Do you want to svtich to this perspective now O Bo not show this message again 7 fait C Development iniel i Sofas Derekom Pmi is Ee Edt pae Sech Bropect ay jaw Heip HLA FSP he ee ye l l ea Er z lP Tanka 0 neri The next step is Adding a C Source File 49 Intel C Compiler for Linux Systems User s Guide Adding a C Source File After Creating a New Project you can add source files then build and run your completed project Follow these steps to add
265. gcc option is ON by default if you are using gcc 3 2 3 3 or 3 4 When you compile and link your application using the cxx1lib gcc option the resulting C object files and libraries can interoperate with C object files and libraries generated by gcc 3 2 This means that third party C libraries built with gcc 3 2 will work with C code generated by the Intel Compiler The cxxlib gcc option can only be used on Linux distributions that include gcc 3 2 This is required for C ABI conformance By default the Intel C Compiler uses headers and libraries included with the product If you are linking with code compiled with g which was compiled against gnu C headers then differences in the headers might cause incompatibilities that result in run time errors 100 Volume I Building Applications If you build one shared library against the Intel C libraries build a second shared library against the gnu C libraries and use both libraries in a single application you will have two C run time libraries in use Since the application might use symbols from both libraries the following problems may occur e partially initialized libraries e lost I O operations from data put in unaccessed buffers e other unpredictable results such as jumbled output The Intel C Compiler does not support more than one run time library in one application Brvarnine If you successfully compile your application using more than one run t
266. gisters Monitors module level static variables You can refine interprocedural optimizations by using the following Qopt ion specifiers To have an effect the Qopt ion option must be entered with either ip or ipo also specified as in this example ip Qoption f ip_specifier 128 Volume II Optimizing Applications where ip_specifier is one of the Qoption specifiers described in the following table Qoption Specifiers ip_args_in_regs 0 ip_ninl_max_stats n ip_ninl_min_stats n ip_ninl_max_total_stats n Description Disables the passing of arguments in registers By default external functions can pass arguments in registers when called locally Normally only static functions can pass arguments in registers provided the address of the function is not taken and the function does not use a variable number of arguments Sets the valid number of intermediate language statements for a function that is expanded in line The number n is a positive integer The number of intermediate language statements usually exceeds the actual number of source language statements The default value for n is 230 Sets the valid min number of intermediate language statements for a function that is expanded in line The number n is a positive integer The default value for ip_ninl_min_stats is IA 32 compiler ip_ninl_min_stats 7 Itanium compiler ip _ninl_min_stats I15 Sets the maximum increase in size of a function me
267. gned or unsigned 8 bit integers in a and zero extends the upper bits ro3 al5 7 lt lt 15 al4 7 lt lt 14 al y lt lt 1 __m128i _mm_shuffle_epi32 __m128i a int imm Shuffles the 4 signed or unsigned 32 bit integers in a as specified by imm The shuffle value imm must be an immediate See Macro Function for Shuffle for a description of shuffle semantics __m128i _mm_shufflehi_epil6 __m128i a int imm Shuffles the upper 4 signed or unsigned 16 bit integers in a as specified by imm The shuffle value imm must be an immediate See Macro Function for Shuffle for a description of shuffle semantics __m128i _mm_shufflelo_epil6 __m128i a int imm Shuffles the lower 4 signed or unsigned 16 bit integers in a as specified by imm The shuffle value imm must be an immediate See Macro Function for Shuffle for a description of shuffle semantics __m128i _mm_unpackhi_epi8 __m1281i a __m128i b Interleaves the upper 8 signed or unsigned 8 bit integers in a with the upper 8 signed or unsigned 8 bit integers in b rO a8 rl b8 r2 a9 r3 bY ri4 al5 r15 bld __m128i _mm_unpackhi_epil __m128i a __m128i b Interleaves the upper 4 signed or unsigned 16 bit integers in a with the upper 4 signed or unsigned 16 bit integers in b rO a4 rl b4 r2 a5 r3 b5 r4 a6 r5 b6 r6 a7 r7 b7 __m128i _mm_unpackhi_epi32 _m128i a __m128i b Interleaves the upper 2 signed or uns
268. gnment constraint This constraint must be a power of two The pointer that is returned from _mm_malloc is guaranteed to be aligned on the specified boundary F Note Memory that is allocated using _mm_malloc must be freed using _mm_free Calling free on memory allocated with _mm_malloc or calling _mm_free on memory allocated with malloc will cause unpredictable behavior 361 Intel C Compiler for Linux Systems User s Guide Inline Assembly By default the compiler inlines a number of standard C C and math library functions This usually results in faster execution of your program Sometimes inline expansion of library functions can cause unexpected results The inlined library functions do not set the errno variable So in code that relies upon the setting of the errno variable you should use the nolib_inline option which turns off inline expansion of library functions Also if one of your functions has the same name as one of the compiler s supplied library functions the compiler assumes that it is one of the latter and replaces the call with the inlined version Consequently if the program defines a function with the same name as one of the known library routines you must use the nolib_inline option to ensure that the program s function is the one used 5 Note Automatic inline expansion of library functions is not related to the inline expansion that the compiler does during interprocedural optimizations For e
269. h the numbers continuing from those in the output List A clobber list tells the compiler that the asm uses or changes a specific machine register that is either coded directly into the asm or is changed implicitly by the assembly instruction The clobber list is a comma separated list of cLobber specs The input specs tell the compiler about expressions whose values may be needed by the inserted assembly instruction In order to describe fully the input requirements of the asm you can list input specs that are not actually referenced in the asm template Each clobber spec specifies the name of a single machine register that is clobbered The register name may optionally be preceded by a The following are the valid register names eax ebx ecx edx esi edi ebp esp ax bx cx dx si di bp sp al bl cl dl ah bh ch dh st st 1 st 7 mm0 mm7 xmm0 xmm7 and cc It is also legal to specify memory ina clobber spec This prevents the compiler from keeping data cached in registers across the asm statement 363 Intel C Compiler for Linux Systems User s Guide Intrinsics Cross processor Implementation This section provides a series of tables that compare intrinsics performance across architectures Before implementing intrinsics across architectures please note the following Instrinsics may generate code that does not run on all IA processors Therefore the programmer is responsible for using CPUID
270. hdup_ps __m128 a Duplicates odd vector elements into even vector elements r0 al rl al r2 a3 r3 a3 extern _ m1128 _mm_moveldup_ps __m128 a Duplicates even vector elements into odd vector elements r0 a0 rl a0 r2 a2 r3 a2 Double precision Floating point Vector Intrinsics extern _ m128d _mm_addsub_pd __m128d a __m128d b Adds upper vector element while subtracting lower vector element rO a0 b0 rl al bl extern __m128d _mm_hadd_pd __m128d a __m128d b Adds adjacent vector elements rO a0 al rl bO bil extern __m128d _mm_hsub_pd __m128d a __m128d b Subtracts adjacent vector elements r0 a0 al ri b0 bil extern __m128d _mm_loaddup_pd double const dp Duplicates a double value into upper and lower vector elements r0 dp rl dp extern __m128d _mm_movedup_pd __m128d a Duplicates lower vector element into upper vector element rO a0 rl a0 337 Intel C Compiler for Linux Systems User s Guide Integer Vector Intrinsics for Streaming SIMD Extensions 3 The integer vector intrinsic listed here is designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 The prototypes for these intrinsics are in the pmmint rin h header file extern __m128i _mm_lddqu_sil28 __m128i const p Loads an unaligned 128 bit value This differs from movdqu in that it can provide higher performance
271. he invalid floating point exception Calling interface int isless double x double y int islessl long double x long double y int islessf float x float y ISLESSEQUAL Description The islessequal function returns if x is less than or equal to y This function does not raise the invalid floating point exception Calling interface int islessequal double x double y int islessequall long double x long double y int islessequalf float x float y 234 Reference ISLESSGREATER Description The islessgreater function returns 1 if x is less than or greater than y This function does not raise the invalid floating point exception Calling interface int islessgreater double x double y int islessgreaterl long double x long double y int islessgreaterf float x float y ISNAN Description The isnan function returns a non zero value if and only if x has a NaN value Calling interface int isnan double x int isnanl long double x int isnanf float x ISNORMAL Description The isnormal function returns a non zero value if and only if x is normal Calling interface int isnormal double x int isnormall long double x int isnormalf float x ISUNORDERED Description The isunordered function returns 1 if either x or y is a NaN This function does not raise the invalid floating point exception Calling interface int isunordered double x double y int isunorderedl l
272. he native intrinsics for the Itantum processor give programmers access to Itanium instructions that cannot be generated using the standard constructs of the C and C languages The Intel C Compiler also supports general purpose intrinsics that work across all A 32 and Itanium based platforms For more information on intrinsics please refer to the following publications Intel Architecture Software Developer s Manual Volume 2 Instruction Set Reference Manual Intel Corporation doc number 243191 243 Intel C Compiler for Linux Systems User s Guide Intrinsics Availability on Intel Processors Processors MMX TM Streaming Streaming Itanium Technology SIMD SIMD Processor Intrinsics Extensions Extensions 2 Instructions Itanium X X N A Processor Pentium 4 X X X Processor Pentium III X X N A Processor Pentium II X N A N A Processor Pentium with X N A N A MMX Technology Pentium Pro N A N A N A Processor Pentium N A N A N A Processor Benefits of Using Intrinsics The major benefit of using intrinsics is that you now have access to key features that are not available using conventional coding practices Intrinsics enable you to code with the syntax of C function calls and variables instead of assembly language Most MMX TM technology Streaming SIMD Extensions and Streaming SIMD Extensions 2 intrinsics have a corresponding C intrinsic that implements that instruction directly Th
273. he optimization report The min argument provides the minimal summary and max produces the full report The default is opt_report_levelmin e opt_report_routinefileroutine_subst ring generates reports from all routines with names containing the subst ringas part of their name If not specified reports from all routines are generated By default the compiler generates reports for all routines Specifying Optimizations to Generate Reports The compiler can generate reports for an optimizer you specify in the phase argument of the opt_report_phasephase option The option can be used multiple times on the same command line to generate reports for multiple optimizers Currently the following optimizer reports are supported EE Optimizer Optimizer Full Name Logical Name When one of the logical names for optimizers is specified all reports from that optimizer are generated For example opt_report_phaseipo opt_report_phaseecg generates reports from the interprocedural optimizer and the code generator 199 Intel C Compiler for Linux Systems User s Guide Each of the optimizers can potentially have specific optimizations within them Each of these optimizations are prefixed with one of the optimizer logical names For example Optimizer_optimization Full Name ipo_inline Interprocedural Optimizer inline expansion of functions ipo_constant_propagation Interprocedural Optimizer constant propagation
274. here A B are F32vec4 object variables mil28 mm A amp B where A B are F32vecl object variables 416 Reference Constructors and Initialization The following table shows how to create and initialize F32vec objects with the Fvec classes Constructors and Initialization for Fvec Classes Example Intrinsic Returns Constructor Declaration F64vec2 A N A N A F32vec4 B F32vecl C __m128 Object Initialization F64vec2 A __m128d mm N A N A F32vec4 B __m128 mm F32vecl C __m128 mm Double Initialization Initializes two doubles F64vec2 A double d0 double dl F64vec2 A F64vec2 double d0 double dl _mm_set_pd AO d0 F6 4vec2 A double d0 Initializes both return values with the same double precision value _mm_set1_pd AO d0 Float Initialization F32vec4 A float 3 float f2 _mm_set_ps AO f0 float f1 float f0 Al fl F32vec4 A F32vec4 float 3 float f2 A2 float fl float 0 AB f F32vec4 A float 0 _mm_setil_ps AO f0 Initializes all return values Al fo with the same floating point value A2 F32vec4 A double dO Initialize all return values with the same double precision value _mm_set1_ps d A0 d0 F32vec1 A double dO Initializes the lowest value of A with d0 and the other values with 0 _mm_set_ss d AO d0 F32vec1 B float f0
275. huffle ps ml mz _MM_SHUFFLE 1 0 3 2 295 Intel C Compiler for Linux Systems User s Guide Macro Functions to Read and Write the Control Registers The following macro functions enable you to read and write bits to and from the control register For details see Set Operations For Itanium based systems these macros do not allow you to access all of the bits of the FPSR See the descriptions for the get fpsr and setfpsr intrinsics in the Native Intrinsics for Itanium Instructions topic Exception State Macros Macro Arguments _MM_SET_EXCEPTION_STATE x _MM_EXCEPT_INVALID _MM_GET_EXCEPTION_STATE _MM_EXCEPT_DIV_ZERO _MM_EXCEPT_DENORM Macro Definitions _MM_EXCEPT_OVERFLOW Write to and read from the sixth least significant control register bit respectively _MM_EXCEPT_UNDERF LOW The following example tests for a divide by zero exception Exception State Macros with MM_EXCEPT_DIV_ZERO if _MM_GET EXCEPTION STATE x amp _MM EXCEPT DIV_ZERO f Exception has occurred ee _MM_GET_EXCEPTION_MASK MM_MASK_DIV_2Z _MM_EXCEPT_INEXACT sss Exception Mask Macros Macro Arguments _MM_SET_EXCEPTION_MASK x _MM_MASK_INVALID Fe _MM_MASK_D pM Write to and read from the seventh through twelfth control register bits respectively Note All
276. ic Corresponding Intrinsics and Classes Part 1 Operators Corresponding Intrinsic 0 mmaddtxd _mm_ 0 mmaddtxd _mm_sub_ x et _mm_mullo_ _mm_div_ x mul_high _mm_mulhi mul_add mm_madd_ x sqrt _mm_sqrt_ x rcp mm_rcp_ rep_nr mm_rcp_ x mm_add_ x mm_sub_ x mm_mul_ x rsqrt _mm_rsqrt_ l64vec2 i epics epi64 x i N A x N A N A i N A N A x N A rsqrt_nr mm_rsqrt _mm_sub_ x _mm_mul_ x N A Ei I32vec4 l16vec8 pi32 epil epi32 epil N A epil N A N A N A epil N A epil N A N A N A N A N A N A N A N A N A N A l8vec16 epi8 epi8 N A N A N A N A N A N A N A N A N A 433 Intel C Compiler for Linux Systems User s Guide Arithmetic Corresponding Intrinsics and Classes Part 2 Operators Corresponding I32vec2 l16vec4 I8vec8 F64vec2 F32vec4 F32vec1 Intrinsic _mm_add_ x pi32 pil6 pis pd ps ss _mm_sub_ x pi32 pil6 pis pd ps ss 2 _mm_mullo_ x NA pil6 N A pd ps SS l _mm_div_ x N A N A N A pd ps ss mul_high _mm_mulhi_ x N A pil6 N A N A N A N A mul_add mm_madd_ x N A pile N A N A N A N A sqrt _mm_sqrt_ x N A N A N A pd ps ss rcp _mm_rcep_ x N A N A N A pd ps ss rcep_nr rsqrt _mm_rsqrt_ rsqrt_nr mm_rsqrt _mm_sub_ x _mm_mul_ x Shift Operators Corresponding Intrinsics and Classes Part 1 Oper
277. ic math library libimf so Default shared math library Using the Intel Math Library To use the Intel math library include the header file mat himf h in your program Here are two example programs that illustrate the use of the math library 211 Intel C Compiler for Linux Systems User s Guide Example Using Real Functions veal_math c include lt stdio h gt include lt mathimf h gt int main float fp32bits double fpo64bits long double fp80bits long double pi_by_four 3 141592653589793238 4 0 pi 4 radians is about 45 degrees fp32bits float pi_by_four float approximation to pi 4 fp 6 4bits double pi_by_four double approximation to pi 4 fp80bits pi_by_four long double extended approximation to pi 4 The sin pi 4 is known to be 1 sqrt 2 or approximately 7071067 printf When x 8 8f sinf x 8 8f n fp32bits sinf fp32bits printf When x 16 16f sin x 16 16f n fp64bits sin fp 4bits printf When x 20 20Lf sinl x 20 20f n fp80bits sinl fp80bits return 0 Compiling real_math c prompt gt ice real_math c The output of a out will look like this When x 0 78539816 sinf x 0 70710678 When x 0 7853981633974483 sin x 0 7071067811865475 When x 0 78539816339744827900 sinl x 0 70710678118654750275 212 Reference Example Using Complex Functions complex_math
278. ic messages you can modify your program to overcome the known limitations and enable effective vectorizations The following topics summarize the capabilities and restrictions of the vectorizer with respect to loop structures Data Dependence Data dependence relations represent the required ordering constraints on the operations in serial loops Because vectorization rearranges the order in which operations are executed any auto vectorizer must have at its disposal some form of data dependence analysis The Data dependent Loop example shows some code that exhibits data dependence The value of each element of an array is dependent on itself and its two neighbors Data dependent Loop float data N int iz for i 1 i lt N 1 i data i d data i 1 0 25 data i 0 5 data i 1 0 25 The loop in this example is not vectorizable because the write to the current element data i is dependent on the use of the preceding element data i 1 which has already been written to and changed in the previous iteration To see this look at the access patterns of the array for the first two iterations as shown in the following example Data Dependence Vectorization Patterns for i 0 i lt 100 i afi b il has access pattern read b 0 write a 0 read b 1 write a l i 1 READ data 0 EAD data 1l BAD data 2 RITE data 1 2 READ data 1 FAD data 2 BFAD data 3 RITE data 2 ZA aH In the
279. ide elements The elements of a are treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word __m64 _m64_psubluus __m64 a __m64 b a is subtracted from b as eight separate byte wide elements The elements of a are treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word __m64 _m64_psub2uus __m64 a __m64 b a is subtracted from b as four separate 16 bit wide elements The elements of a are treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word m64 _m64_pavgl_nraz __m64 a __m64 b The unsigned byte wide data elements of a are added to the unsigned byte wide data elements of b and the results of each add are then independently shifted to the right by one position The high order bits of each element are filled with the carry bits of the sums m64 _m64_pavg2_nraz __m64 a __m64 b The unsigned 16 bit wide data elements of a are added to the unsigned 16 bit wide data elements of b and the results of each add are then independently shifted to the right by one position The high order bits of each element are filled with the carry bits of the sums Synchronization Primitives The synchronization primitive intrinsics provide a variety of operations Besides performing these operations ea
280. igned 32 bit integers in a with the upper 2 signed or unsigned 32 bit integers in b rO a2 rl r2 a3 r3 b2 b3 333 Intel C Compiler for Linux Systems User s Guide __m128i _mm_unpackhi_epi 4 __m128i a __m128i b Interleaves the upper signed or unsigned 64 bit integer in a with the upper signed or unsigned 64 bit integer in b rO al rl bl __m128i _mm_unpacklo_epi8 __m1281i a __m128i b Interleaves the lower 8 signed or unsigned 8 bit integers in a with the lower 8 signed or unsigned 8 bit integers in b rO a0 rl bO r2 al r3 bl ri4 a7 r15 b7 __m128i _mm_unpacklo_epil __m128i a __m128i b Interleaves the lower 4 signed or unsigned 16 bit integers in a with the lower 4 signed or unsigned 16 bit integers in b rO a0 rl b0O r2 al r3 bl r4 a2 r5 b2 r6 a3 r7 b3 __m128i _mm_unpacklo_epi32 _m128i a __m128i b Interleaves the lower 2 signed or unsigned 32 bit integers in a with the lower 2 signed or unsigned 32 bit integers in b rO a0 rl bO r2 al r3 bl __m128i _mm_unpacklo_epi 4 __m128i a __m128i b Interleaves the lower signed or unsigned 64 bit integer in a with the lower signed or unsigned 64 bit integer in b rO a0 rl bO m64 _mm_movepi64_pi64 __m128i a Returns the lower 64 bits of a as an__m64 type r0 a0 1281 _mm_movpi64_pi64 __m64 a Moves the 64 bits of a to the lower 64 bits of the r
281. ile guided optimizations to select and prioritize application tests based on prior execution profiles of the application The tool offers a potential of significant time saving in testing and developing large scale applications where testing is the major bottleneck The tool can be used for both IA 32 and Itantum architectures This tool lets you select and prioritize the tests that are most relevant for any subset of the application s code When certain modules of an application are changed the Test prioritization Tool suggests the tests that are most probably affected by the change The tool analyzes the profile data from previous runs of the application discovers the dependency between the application s components and its tests and uses this information to guide the process of testing Features and Benefits The tool provides an effective testing hierarchy based on the application s code coverage The advantages of the tool usage can be summarized as follows e Minimizing the number of tests that are required to achieve a given overall coverage for any subset of the application the tool defines the smallest subset of the application tests that achieve exactly the same code coverage as the entire set of tests e Reducing the turn around time of testing instead of spending a long time on finding a possibly large number of failures the tool enables the users to quickly find a small number of tests that expose the defects associated with regres
282. ill use PCH files created from other sources if the headers files are the same For example if you compile sourcel cpp using pch then sourcel pchi is created If you then compile source2 cpp using pch the compiler will use sourcel pchi if it detects the same headers create_pch Use the create_pch filename option if you want the compiler to create a PCH file called filename Note the following regarding this option The filename parameter must be specified The filename parameter can be a full path name The full path to filename must exist The pchi extension is not automatically appended to filename This option cannot be used in the same compilation as use_pch filename The create_pch filename option is supported for single source file compilations only 80 Volume I Building Applications Example 2 command line prompt gt icpe create_pch pch source32 pchi source cpp Example 2 output source cpp creating precompiled header file pch source32 pchi use_pch filename This option directs the compiler to use the PCH file specified by filename It cannot be used in the same compilation as create_pch filename The use_pch filename option supports full path names and supports multiple source files when all source files use the same pchii file Example 3 command line prompt gt icpe use_pch pch source32 pchi source cpp Example 3 output source cpp using precompiled header file pch sou
283. ime library the resulting program will likely be very unstable especially when new code is linked against the shared libraries You should use the cxxlib gcc option if your application includes source files generated by g and source files generated by the Intel C Compiler This option directs the Intel compiler to use the g header and library files to build one set of run time libraries As a result your program should run correctly cxxlib icc option The cxxlib icc option directs the Intel compiler to use the C run time libraries and C header files included with the Intel compiler They include e libcprts standard C headers e libcprts standard C library e libcxa and libunwind C language support F Note The cxxlib icc option is ON by default if are using a gcc version less than 3 2 fabi version The fabi version n option directs the compiler to select a specific ABI implementation By default the Intel compiler uses the ABI implementation that corresponds to the installed version of gcc Both gec 3 2 and 3 3 are not fully ABI compliant Value of n Description Select most recent ABI implementation Select g 3 2 compatible ABI implementation Select most conformant ABI implementation See http www codesourcerey com for more information on ABI conformance 101 Intel C Compiler for Linux Systems User s Guide See Specifying Alternate Tools and Paths for informa
284. ime test for the profitability of executing in parallel for loop with loop parameters that are not compile time constants Coding Guidelines Enhance the power and effectiveness of the auto parallelizer by following these coding guidelines Expose the trip count of loops whenever possible Specifically use constants where the trip count is known and save loop parameters in local variables Avoid placing structures inside loop bodies that the compiler may assume to carry dependent data for example function calls ambiguous indirect references or global references Auto parallelization Data Flow For auto parallelization processing the compiler performs the following steps Sei ol Data flow analysis Loop classification Dependence analysis High level parallelization Data partitioning Multi threaded code generation These steps include Data flow analysis compute the flow of data through the program Loop classification determine loop candidates for parallelization based on correctness and efficiency as shown by threshold analysis Dependence analysis compute the dependence analysis for references in each loop nest High level parallelization e analyze dependence graph to determine loops which can execute in parallel e compute run time dependency Data partitioning examine data reference and partition based on the following types of access shared private and firstprivate Multi threaded code generation e modify loop param
285. in some cases However it also may provide lower performance than movdqu if the memory value being read was just previously written r p Macro Functions for Streaming SIMD Extensions 3 The macro function intrinsics listed here are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 The prototypes for these intrinsics are in the pmmintrin h header file _MM_SET_DENORMALS_ZERO_MOD J x Macro arguments one of _MM_DENORMALS_ZERO_ON _MM_DENORMALS_ZERO_OFF This causes denormals are zero mode to be turned on or off by setting the appropriate bit of the control register _MM_GET_DENORMALS__ZERO_MOD Gl Q No arguments This returns the current value of the denormals are zero mode bit of the control register Miscellaneous Intrinsics for Streaming SIMD Extensions 3 The miscellaneous intrinsics listed here are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 The prototypes for these intrinsics are in the pmmint rin h header file extern void _mm_monitor void const p unsigned extensions unsigned hints Generates the MONITOR instruction This sets up an address range for the monitor hardware using p to provide the logical address and will be passed to the monitor instruction in register eax The extensions parameter contains optional extensions to the monitor hardware which will be pass
286. in maximum performance of your application This documentation assumes that you are familiar with the C standard programming language and with the Intel processor architecture You should also be familiar with the host computer s operating system P Note This document explains how information and instructions apply differently to each targeted architecture If there is no specific indication to either architecture the description is applicable to both architectures Notation Conventions Style Definition m1 This type indicates an element of syntax a reserved word a keyword a file name or style part of a program example text appears in lowercase unless UPPERCASE is required n This type indicates what you type as input style This type indicates an argument on a command line or an option s argument style E items indicates that the items enclosed in brackets are optional m item indicates a set of choices from which you must select one item ellipses indicates that an argument can be repeated several times Compiler Options Quick Reference Conventions Used in the Options Quick Reference Tables Convention n Values in with vertical bars n Words in this style following an option New Options Definition If an option includes as part of the definition then the option can be used to enable or disable the feature For example the
287. ing capability is also known as single instruction multiple data processing SIMD For each computational and data manipulation instruction in the new extension sets there is a corresponding C intrinsic that implements that instruction directly This frees you from managing registers and assembly programming Further the compiler optimizes the instruction scheduling so that your executable runs faster F Note The MM and XMM registers are the SIMD registers used by the IA 32 platforms to implement MMX technology and Streaming SIMD Extensions Streaming SIMD Extensions 2 intrinsics On the Itanium based platforms the MMX and Streaming SIMD Extension intrinsics use the 64 bit general registers and the 64 bit significand of the 80 bit floating point register Data Types Intrinsic functions use four new C data types as operands representing the new registers that are used as the operands to these intrinsic functions The following table shows the data type availability marked with X 245 Intel C Compiler for Linux Systems User s Guide New Data Types Available New Data MMX TM Streaming SIMD Streaming SIMD Itanium Type Technology Extensions Extensions 2 Processor l mes X X X X __m128 N A X X X __m128d N A N A X X m128i N A N A X X _ m64 Data Type The __ m64 data type is used to represent the contents of an MMX register which is the register that is used by the MMX technology intrinsics The __m64 data
288. ing is enabled by prof_use in Phase 3 to improve code locality by splitting routines into different sections one section to contain the cold or very infrequently executed code and one section to contain the rest of the code hot code You can use fnsplit to disable function splitting for the following reasons e Most importantly to get improved debugging capability In the debug symbol table it is difficult to represent a split routine that is a routine with some of its code in the hot code section and some of its code in the cold code section e The fnsplit option disables the splitting within a routine but enables function grouping an optimization in which entire routines are placed either in the cold code section or the hot code section Function grouping does not degrade debugging capability e Another reason can arise when the profile data does not represent the actual program behavior that is when the routine is actually used frequently rather than infrequently 132 Volume II Optimizing Applications Example of Profile guided Optimization The three basic phases of PGO are e Instrumentation Compilation and Linking e Instrumented Execution e Feedback Compilation Instrumentation Compilation and Linking Use prof_gen to produce an executable with instrumented information Use also the prof_dir option as recommended for most programs especially if the application includes the source files located in multiple direc
289. ing thread to wait until the nested lock associated with lock is available The thread is granted ownership of the nested lock when it becomes available omp_unset_nest_lock lock Releases the executing thread from ownership of the nested lock associated with lock if the nesting count is zero Behavior is undefined if the executing thread does not own the nested lock associated with lock omp_test_nest_lock lock Attempts to set the nested lock associated with lock If successful returns the nesting count otherwise returns zero Timing Routines Function Description omp_get_wtime Returns a double precision value equal to the elapsed wallclock time in seconds relative to an arbitrary reference time The reference time does not change during program execution omp_get_wtick Returns a double precision value equal to the number of seconds between successive clock ticks Intel Extensions The Intel C Compiler implements the following groups of functions as extensions to the OpenMP run time library e getting and setting stack size for parallel threads e memory allocation The Intel extensions described in this section can be used for low level debugging to verify that the library code and application are functioning as intended It is recommended to use these functions with caution because using them requires the use of the openmp_stubs command line option to execute the progra
290. ing used The name of the first file is taken from the value of the o option The name of subsequent files is derived from this file by appending a numeric value to the file name For example if the first object file is named foo o the second object file will be named fool o The compiler generates a message indicating the name of each object or assembly file it is generating These files can be added to the real link step to build the final application Creating a Multifile IPO Executable Using xild Use the Intel linker xi 1d instead of step 2 in Command Line for Creating an IPO Executable The xild linker performs the following steps 1 Invokes the compiler to perform IPO if objects containing IR are found 2 Invokes GCC linker 1d to link the application 124 Volume II Optimizing Applications The command line syntax for xild is the same as that of the GCC linker prompt gt xild lt options gt lt LINK_commandline gt where e lt options gt optional may include any GCC linker options or options supported only by xild e lt LINK_commandline gt is your linker command line containing a set of valid arguments to the 1d To create app using IPO use the option o filename as shown in the following example prompt gt xild oapp a o b o c o xild calls the compiler to perform IPO for objects containing IR and creates a new list of object s to be linked Then xild calls 1d to link the object files that are specified in
291. int 64 to type _ m64 Translates to nop since both types reside in the same register on Itanium based systems Convert its double precision argument to a signed integer Map the get f exp instruction and return the 16 bit exponent and the sign of its operand The prototypes for getReg and setReg intrinsics are in the ia64regs h header file whichReg _IA64_RI TA64_REG_PSR_L General Integer Registers whichReg 349 Intel C Compiler for Linux Systems User s Guide Application Registers TA64 REG TA64 REG TA64 REG TA64 REG TA64 REG TA64 REG RSC TA64 REG BSP IA64 REG BSPSTORE IA64 REG RNAT IA64 REG FCR IA64 REG EF IA64 REG CSD IA64 REG SSD IA64 REG CF IA64 REG FSR IA64 REG FIR IA64 REG FDR IA64 REG LAG LAG CCV IA64 REG UNAT IA64 REG FPSR IA64 REG ITC IA64 REG PFS TA64 REG LC TA64 REG EC 350 whichReg 3074 3075 3076 3077 3078 3079 3088 3089 3090 3091 3093 3096 3097 3098 3099 3100 3101 3102 3104 3108 3112 3116 3136 3137 3138 Control Registers TA64 TA64_REG_CR_PTA TA64_REG_CR_IP
292. ion is on by default The Intel C Compiler improves debuggability of optimized code through enhanced support for e tracebacks e variable locations e breakpoints and stepping The options described in the following table control emission of enhanced debug information They must be used in conjunction with the g option Option Description debug inline_info This option produces enhanced source position information for inlined code This leads to greater accuracy when reporting the source location of any instruction It also provides more information to debuggers for function call traceback The Intel debugger idb has been enhanced to use the richer debug information to show simulated call frames for inlined functions This option is off by default debug variable_locations This option produces additional debug information for scalar local variables using a feature of the DWARF object module format known as location lists The runtime locations of local scalar variables are specified more accurately using this feature i e whether at a given position in the code a variable value is found in memory or a machine register The Intel debugger is able to process location lists and display values of local variables at runtime with improved accuracy This option is off by default debug extended This option turns on the debug options described previously e debug inline_info e debug variable_locations 86 Vo
293. ional select operators the return value is stored in C if the comparison is true or in D if false The following table shows the return values for each class of the conditional select operators using the Return Value Notation described earlier Compare Operator Return Value Mapping Operators F32vec4 F64vec2 F32vec1 RO Al select_ BO CO DO X WAL te gt i ee BO CO DO select_ ne nae nle ngt nge B1 X X B1 d a B2 C2 D2 X N A B2 C2 D2 select_ eq te B3 C3 D3 X N A le gt ge B3 C3 D3 select_ ne nit nle ngt nge The following table shows examples for conditional select operations and corresponding intrinsics Conditional Select Operations for Fvec Classes Returns Example Syntax Usage Intrinsic Compare for Equality F32vec4 R select_eq F32vec4 A _mm_cmpeq_ps 2 doubles F64vec2 R select_eq F64vec2 A _mm_cmpeq_pd 1 float F32vecl R select_eq F32vecl A _mm_cmpeq_ss Compare for Inequality 4 floats F32vec4 R select_negq F32vec4 A _mm_cmpneg_ps 2 doubles FO4vec2 R select_neq F64vec2 A _mm_cmpneq_pd 1 float F32vecl R select_neq F32vecl A _mm_cmpneq_ss Compare for Less Than F32vec4 R select_lt F32vec4 A _mm_cmplt_ps 2 doubles F64vec2 R select_lt F64vec2 A _mm_cmplt_pd 428 1 float 2 doubles 2 doubles Com
294. is not known vectorization would be illegal if k lt 0 Example of ivdep Directive pragma ivdep for i 0 i lt m i a i a i k c 196 Volume II Optimizing Applications vector aligned Directive The vector aligned directive means the loop should be vectorized if it is legal to do so ignoring normal heuristic decisions about profitability When the aligned or unaligned qualifier is used the loop should be vectorized using aligned or unaligned operations Specify either aligned or unaligned but not both r TTA If you specify aligned as an argument you must be absolutely sure that the loop will be vectorizable using this instruction Otherwise the compiler will generate incorrect code The loop in the following example uses the aligned qualifier to request that the loop be vectorized with aligned instructions as the arrays are declared in such a way that the compiler could not normally prove this would be safe to do so Example of vector aligned Directive void foo float a pragma vector al for i 0 i lt m it a i a i c The compiler includes several alignment strategies in case the alignment of data structures is not known at compile time A simple example follows but several other strategies are supported as well If in the following loop the alignment of a is unknown the compiler will generate a prelude loop that iterates until the array reference that occurs the most h
295. is frees you from managing registers and enables the compiler to optimize the instruction scheduling The MMX technology and Streaming SIMD Extension instructions use the following new features e new Registers Enable packed data of up to 128 bits in length for optimal SIMD processing e new Data Types Enable packing of up to 16 elements of data in one register The Streaming SIMD Extensions 2 intrinsics are defined only for IA 32 not for Itantum based systems Streaming SIMD Extensions 2 operate on 128 bit quantities 2 64 bit double precision floating point values The Itanium architecture does not support parallel double precision computation so Streaming SIMD Extensions 2 are not implemented on Itanium based systems 244 Reference New Registers A key feature provided by the architecture of the processors are new register sets The MMX instructions use eight 64 bit registers mm0 to mm7 which are aliased on the floating point stack registers MMX TM Technology Registers Tag Word MMA Technology Registers 1 0 63 0 MIMO MM7 OMosss2 Streaming SIMD Extensions Registers The Streaming SIMD Extensions use eight 128 bit registers xmm0O to xmm7 Steaming Slil D Extension Registers rwa AMM moMOes33 These new data registers enable the processing of data elements in parallel Because each register can hold more than one data element the processor can process more than one data element simultaneously This process
296. is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_comineq_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a not equal to b If a and b are not equal 1 is returned Otherwise 0 is returned r a0 b0 0x1 0x0 int _mm_ucomieq_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a equal to b If a and b are equal 1 is returned Otherwise 0 is returned r a0 b0 Oxl 0x0 int _mm_ucomilt_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a less than b If a is less than b 1 is returned Otherwise 0 is returned r a0 lt bO 0x1 0x0 int _mm_ucomile_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a less than or equal to b If a is less than or equal to b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 int _mm_ucomigt_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than b If a is greater than or equal to b 1 is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_ucomige_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than or equal to b If a is greater than or equal to b 1 is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_ucominegq_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a not equal to b If a and b are not equal 1 is returned Other
297. ision using the IEEE rounding mode double sinh double Computes the hyperbolic sine of the double precision argument float sinhf float Computes the hyperbolic sine of the single precision argument float sqrtf float Computes the square root of the single precision argument double tanh double Computes the hyperbolic tangent of the double precision argument float tanhf float Computes the hyperbolic tangent of the single precision argument Not implemented on Itantum based systems double in this case is a complex number made up of two single precision 32 bit floating point elements real and imaginary parts String and Block Copy Related The following are not implemented as intrinsics on Itanium based platforms Intrinsic Description char _strset char _int32 Sets all characters in a string to a fixed value void memcmp const void cs const void ct size_t n Compares two regions of memory Return lt 0 if cs lt ct 0 if cs ct or gt 0 if cs gt ct void memcpy void s const void ct size_t n Copies from memory Returns s void memset void s int c size_t n Sets memory to a fixed value Returns s 251 Intel C Compiler for Linux Systems User s Guide Intrinsic size_t strlen const char cs int strncmp char char int int strncpy char char int Miscellaneous Intrinsics char strcat char s const char ct int
298. issue an informational message indicating that it is generating an explicit linker script ipo_layout script When ipo_layout script is generated the typical response is to modify your link command to use this script script ipo_layout script If your application already requires a custom linker script you can place the necessary contents of ipo_layout script in your script The layout specific content of ipo_layout script is at the beginning of the description of the text section For example to describe the layout order for 12 routines text text00001 text00002 text00003 text00004 text00005 text00006 text00007 text00008 text00009 text00010 text00011 text00012 For applications that already require a linker script you can add this section of the text section description to the customized linker script If you add these lines to your linker script it is desirable to add additional entries to account for future development This is harmless since the syntax makes these contributions optional If you choose to not use the linker script your application will still build but the layout order will be random This may have an adverse affect on application performance particularly for large applications 126 Volume II Optimizing Applications Compilation with Real Object Files In certain situations you might need to generate real object files with ipo
299. ith single precision Returns the sine of x with double precision Returns the sine of x with single precision Returns the cosine of x with double precision Returns the cosine of x with single precision Returns the tangent of x with double precision Returns the tangent of x with single precision Returns the arccosine of x with double precision Returns the arccosine of x with single precision Compute the inverse hyperbolic cosine of the argument with double precision Compute the inverse hyperbolic cosine of the argument with single precision Compute arc sine of the argument with double precision 249 Intel C4 Intrinsic float asinf float double asinh double float asinhf float double atan double float atanf float double atanh double float atanhf float float cabs double double ceil double float ceilf float double cosh double float coshf float float fabsf float double floor double float floorf float double fmod double float fmodf float double hypot double double float hypotf float 250 Compiler for Linux Systems User s Guide Description Compute arc sine of the argument with single precision Compute inverse hyperbolic sine of the argument with double precision Compute inverse hyperbolic sine of the argument with single precision Compute arc tangent of the argument with double precision Comput
300. its an aligned address This makes the alignment properties of a known and the vector loop is optimized accordingly 197 Intel C Compiler for Linux Systems User s Guide Example of Alignment Strategies float a Alignment unknown for i 0 i lt 100 i a iJ a i 1 0f Dynamic loop peeling p a amp Ox0f if p 0 p 16 p 4 for i 0 i lt p i aliJ a iJ 1 0f Loop with a aligned Will be vectorized accordingly for i p i lt 100 i a iJ a i 1 0f novector Directive The novector directive specifies that the loop should never be vectorized even if it is legal to do so In this example suppose you know the trip count ub 1b is too low to make vectorization worthwhile You can use novector to tell the compiler not to vectorize even if the loop is considered vectorizable Example of novector Directive void foo int lb int ub pragma novector for j lb j lt ub j a j a j b j 198 Volume II Optimizing Applications Optimizer Report Generation The Intel C Compiler provides options to generate and manage optimization reports e opt_report generates an optimization report and directs it to stderr By default the compiler does not generate optimization reports e opt_report_filefilename generates an optimization report and directs it to a file specified in filename e opt_report_level min med max specifies the detail level of t
301. ity function computes the present value factor for an annuity 1 1 x x where x is arate and y is a period errno ERANGE for underflow and overflow conditions Calling interface double annuity double x double y long double annuity double x double y float annuityf float x double y 225 Intel C Compiler for Linux Systems User s Guide COMPOUND Description The compound function computes the compound interest factor 1 x where x is a rate and y is a period errno ERANGE for underflow and overflow conditions Calling interface double compound double x double y long double compound double x double y float compoundf float x double y ERF Description The erf function returns the error function value Calling interface double erf double x long double erfl long double x float erff float x ERFC Description The erfc function returns the complementary error function value errno ERANGE for underflow conditions Calling interface double erfc double x long double erfcl long double x float erfcf float x GAMMA Description The gamma function returns the value of the logarithm of the absolute value of gamma errno ERANGE for overflow conditions when x is a negative integer Calling interface double gamma double x long double gammal long double x float gammaf float x GAMMA_R Description The gamma_r function returns the value of the log
302. ium 4 processors as long as there is a likely performance benefit prompt gt icpe axKW prog cpp Manual CPU Dispatch IA 32 only Use ___declspec cpu_specific and__declspec cpu_dispatch in your code to generate instructions specific to the Intel processor on which the application is running and also to execute correctly on other A 32 processors F Note Manual CPU dispatch cannot be used to recognize Intel Itanium processors The syntax of these extended attributes is as follows e cpu_specific cpuid e cpu_dispatch cpuid list The values for cpuid and cpuid list are shown in the following tables Processor Values for cpuid x86 processors not provided by Intel Corporation generic Intel Pentium processors pentium Intel Pentium processors with MMX Technology pentium_mmx Intel Pentium Pro processors pentium_pro Intel Pentium II processors pentium_ii Intel Pentium III processors pentium_iii Intel Pentium III exclude xmm registers pentium_iii_no_xmm_regs Intel Pentium 4 processors pentium_4 Intel Pentium M processors pentium_m Intel Pentium 4 processor with Streaming SIMD pentium_4_sse3 Extensions 3 SSE3 Values for cpuid list cpuid list cpuid 117 Intel C Compiler for Linux Systems User s Guide The attributes are not case sensitive The body of a function declared with __declspec cpu_dispatch must be empty and is referred to as a stub an empty bodied function Use the following guid
303. k Reference Linux Windows Description Linux Default nobss_init Qnobss_init Disable placement of zero initialized variables in BSS use DATA nolib_inline Disable inline expansion of intrinsic functions 0 702 OFF ofile Fefile or Fofile Name output file OFF 00 Od Disable optimizations OFF 01 O1 Optimizes for speed OFF 02 02 ON P EP Preprocess to file OFF pc32 Qpc 32 Set internal FPU OFF precision to 24 bit significand pc64 Qpc 64 Set internal FPU OFF precision to 53 bit significand pc80 Qpc 80 Set internal FPU ON precision to 64 bit significand prec_div Qprec_div Improve precisionof OFF floating point divides some speed impact prof_dirdirectory Qprof_dirdirectory Specify directory for OFF profiling output files dyn and dpi prof_filefilename Qprof_filefilename Specify file name for OFF profiling summary file prof_gen x Qprof_genx Instrument program OFF for profiling with the x qualifier extra information is gathered prof_use Qprof_use Enable use of OFF profiling information during optimization 33 Intel C Compiler for Linux Systems User s Guide Linux Windows Description Linux Default Qinstall dir Set diras root of compiler installation Qlocation str dir Qlocation tool path Set diras the OFF location of tool specified by str Qoption str opts Qoption tool list Pass o
304. lass Libraries Header Extension Set Available on These Processors File ivec h MMX TM Pentium with MMX technology Pentium II Pentium II technology Pentium 4 Intel Xeon TM and Itanium processors fvec h Streaming SIMD Pentium III Pentium 4 Intel Xeon and Itanium processors Extensions dvec h Streaming SIMD Pentium 4 and Intel Xeon processors Extensions 2 About the Classes The Intel C Class Libraries for SIMD Operations include e Integer vector Ivec classes e Floating point vector Fvec classes You can find the definitions for these operations in three header files ivec h fvec h and dvec h The classes themselves are not partitioned like this The classes are named according to the underlying type of operation The header files are partitioned according to architecture e ivec h is specific to architectures with MMX TM technology e fvec h is specific to architectures with Streaming SIMD Extensions e dvec h is specific to architectures with Streaming SIMD Extensions 2 Streaming SIMD Extensions 2 intrinsics cannot be used on Itantum based systems The mmclass h header file includes the classes that are usable on the Itanium architecuture This documentation is intended for programmers writing code for the Intel architecture particularly code that would benefit from the use of SIMD instructions You should be familiar with C and the use of C classes 382 Reference Details About the Libraries
305. lds Change default bitfield type to OFF unsigned funsigned char Change default char type to OFF unsigned f no verbose asm Produce assemblable file with ON compiler comments Default fverbose asm fvisibility default file Space separated symbols listed OFF inthe file argument will get visibility set to default fvisibility extern file Space separated symbols listed OFF in the file argument will get visibility set to extern fvisibility hidden file Space separated symbols listed OFF in the file argument will get visibility set to hidden fvisibility internal file Space separated symbols listed OFF in the file argument will get visibility set to internal fvisibility protected file Space separated symbols listed OFF in the file argument will get visibility set to protected fvisibility extern default protected hidden internal Global symbols common and OFF defined data and functions will get the visibility attribute given by default Symbol visibility attributes explicitly set in the source code or using the symbol visibility attribute file options will override the fvisibility setting 15 Intel C Compiler for Linux Systems User s Guide Option Description Default fwritable strings Ensure that string literals are OFF 132 only placed in a writable data section J Generates symbolic debugging OFF information in the
306. le Direct linker to read link OFF commands from file tcheck The tcheck compiler option OFF enables analysis of threaded applications with Intel Thread Checker which is required to use this option 26 Compiler Options Quick Reference Option Description Default tppl Targets optimization for the 164 only Itanium processor tpp2 Targets optimization for the 164 only Itanium 2 processor Generated code is compatible with the Itanium processor tpp5 Targets the optimizations for the 132 only Pentium processor tpp6 Targets the optimizations for the i32 only Pentium Pro Pentium II and Pentium III processors tpp7 Targets optimizations for the 132 132em Intel Pentium 4 processors no traceback Tells the compiler to generate OFF not generate extra information in the object file to allow the display of source file traceback information at run time when a severe error occurs Suppresses any definition of a OFF macro name Equivalent to a undef preprocessing directive unrolin Disable loop unrolling for n 0 OFF unroll n Disable loop unrolling for n 0 OFF use_asm Produce objects through OFF assembler use_msasm Accept the Microsoft MASM OFF i32 only style inlined assembly format instead of GNU style use_pch filename Manual use of precompiled OFF header filename pchi u symbol Pretend the symbo1 is OFF undefined
307. le on Itanium based systems e Ifno label is present the option is available on all supported systems e If only appears in the label that option is only available on the identified system Option alias_args ansi_alias complex_limited_range falias ffnalias frtti fverbose asm mcpu pentium4 132 only mcpu itanium2 164 only 01 pc80 i32 132em prefetch 50X Description Enable C C rule that function arguments may be aliased Enable use of ANSI aliasing rules in optimizations user asserts that the program adheres to these rules Disable the use of the basic algebraic expansions of some complex arithmetic operations Assume aliasing in program Assume aliasing within functions Support for RTTI Produce assembly file with compiler comments requires S Optimizes for Intel Pentium 4 processor Optimizes for Intel Itanium 2 processor Same as 02 on IA 32 Same as O on Itanium based systems Set internal floating point precision to 64 bit significand Enables the insertion of software prefetching by the compiler Disable saving of compiler options and version in the executable 37 Intel C Compiler for Linux Systems User s Guide Option Description std gnu89 ISO C90 plus GNU extensions Includes some C99 features tpp2 Target optimization to the Itanium 2 processor 164 only tpp7 Target optimization to the Pentium 4 2 processor 132 only
308. lel are specified on the command line the parallel option is honored only in routines that do not contain OpenMP directives For routines that contain OpenMP directives only the openmp option is honored Vectorization The vectorizer is a component of the Intel C Compiler that automatically uses SIMD instructions in the MMX TM SSE and SSE2 instruction sets The vectorizer detects operations in the program that can be done in parallel and then converts the sequential program to process 2 4 8 or 16 elements in one operation depending on the data type This section provides guidelines option descriptions and examples for the Intel C Compiler vectorization on IA 32 systems only The following list summarizes this section s contents a quick reference of vectorization functionality and features descriptions of compiler switches to control vectorization descriptions of the C language features to control vectorization discussion and general guidelines on vectorization levels e automatic vectorization e vectorization with user intervention e examples demonstrating typical vectorization issues and resolutions 155 Intel C Compiler for Linux Systems User s Guide Vectorizer Options Option Description ax K W N BI P Enables the vectorizer and generates specialized and generic A 32 code The generic code is usually slower than the specialized code x K W NI B P Turns on the vectorizer and generates processor s
309. lity list file and that do not have __attribute__ visibilty in their declaration For example the command line options fvisibility protected fvisibility default prot txt where file prot t xt is as previously described will cause all global symbols except a b c d and e to have protected visibility Those five symbols however will have default visibility and thus be preemptable 93 Intel C Compiler for Linux Systems User s Guide Other Visibility related Command line Options fminshared The fminshared option specifies that the compilation unit will be part of a main program component and will not be linked as part of a shareable object Since symbols defined in the main program cannot be preempted this allows the compiler to treat symbols declared with default visibility as though they have protected visibility i e fminshared implies fvisibility protected Also the compiler need not generate position independent code for the main program It can use absolute addressing which may reduce the size of the global offset table GOT and may reduce memory traffic fpic The fpic option specifies full symbol preemption Global symbol definitions as well as global symbol references get default i e preemptable visibility unless explicitly specified otherwise fno common Normally a C C file scope declaration with no initializer and without the extern or static keyword ine iy is represented as a co
310. lized schedule Specifies how iterations of the for loop are divided among the threads of the team copyin Provides a mechanism to assign the same name to threadprivate variables for each thread in the team executing the parallel region OpenMP Support Libraries The Intel C Compiler with OpenMP support provides a production support library Libguide a This library enables you to run an application under different execution modes It is used for normal or performance critical runs on applications that have already been tuned f Note The Libguide 1ib library is linked dynamically regardless of command line options to avoid performance issues that are hard to debug Execution modes The Intel compiler with OpenMP enables you to run an application under different execution modes that can be specified at run time The libraries support the serial turnaround and throughput modes These modes are selected by using the KMP_LIBRARY environment variable at run time Serial The serial mode forces parallel applications to run on a single processor 181 Intel C Compiler for Linux Systems User s Guide Turnaround In a dedicated batch or single user parallel environment where all processors are exclusively allocated to the program for its entire run it is most important to effectively utilize all of the processors all of the time The turnaround mode is designed to keep active all of the processors involved in the parallel
311. lly More than one basic block was generated for the code at this position Some covered code of the blocks were covered while some were not The default color can be overridden with the pcolor option Unknown No code was generated for this source line Most probably the source at this position is a comment a header file inclusion or a variable declaration The default color can be overridden with the ucolor option The default colors can be customized to be any valid HTML color by using the options mentioned for each coverage category in the preceding table For code coverage colored presentation the coverage tool uses the following heuristic Source characters are scanned until reaching a position in the source that is indicated by the profile information as the beginning of a basic block If the profile information for that basic block indicates that a coverage category changes then the tool changes the color corresponding to the coverage condition of that portion of the code and the coverage tool inserts the appropriate color change in the HTML files 5 Note You need to interpret the colors in the context of the code For instance comment lines that follow a basic block that was never executed would be colored in the same color as the uncovered blocks Another example is the closing brackets in C C applications 141 Intel C Compiler for Linux Systems User s Guide Coverage Analysis of a Modules Subset One of th
312. lume I Building Applications F Note When the compiler needs to choose between optimization and quality of debug information optimization is given priority Creating and Using Libraries The Intel C Compiler uses the GNU C Library Dinkumware C Library and the Standard C Library These libraries are documented at the following Internet locations e GNU C Library http www gnu org software libc manual e Dinkumware C Library http www dinkumware com htm_cpl lib_cpp html e Standard C Library http gcc gnu org onlinedocs libstdc documentation html Creating Libraries Libraries are simply an indexed collection of object files that are included as needed in a linked program Combining object files into a library makes it easy to distribute your code without disclosing the source It also reduces the number of command line entries needed to compile your project Static Libraries Executables generated using static libraries are no different than executables generated from individual source or object files Static libraries are not required at runtime so you do not need to include them when you distribute your executable At compile time linking to a static library is generally faster than linking to individual source files To build a static library 1 use the c option to generate object files from the source files prompt gt icpc c my_sourcel cpp my_source2 cpp my_source3 cpp 2 use the GNU ar tool to
313. m sequentially These functions are also generally not recognized by other vendor s OpenMP compliant compilers which may cause the link stage to fail for these other compilers 5 Note The following functions require the pre processor directive include lt omp h gt 185 Intel C Compiler for Linux Systems User s Guide Stack Size In most cases directives can be used in place of extensions For example the stack size of the parallel threads may be set using the KMP_STACKSIZE environment variable rather than the kmp_set_stacksize_s function F Note A run time call to an Intel extension takes precedence over the corresponding environment variable setting See the definitions of stack size functions in the Stack Size table Memory Allocation The Intel C Compiler implements a group of memory allocation functions as extensions to the OpenMP run time library to enable threads to allocate memory from a heap local to each thread These functions are kmp_malloc kmp_calloc and kmp_realloc The memory allocated by these functions must also be freed by the kmp_free function While it is legal for the memory to be allocated by one thread and kmp_ free d by a different thread this mode of operation has a slight performance penalty See the definitions of these functions in the Memory Allocation table Stack Size Function Description kmp_get_stacksize_s Returns the number of bytes that will be allocated for
314. m_cmpge_ x pd ps ss _mm_andnot_l y cmplt _mm_cmplt_ x pd ps ss cmple _mm_cmple_ x pd ps ss _mm_andnot_l y cmpngt _mm_cmpngt_ x pd ps ss cmpnge _mm_cmpnge_ x pd ps ss cmnpnlt _mm_cmpnit_ x pd ps ss cmpnle _mm_cmpnie_ x pd ps ss Conditional Select Operators Corresponding Intrinsics and Classes Part 1 Operators Corresponding I32vec4 l16vec8 I8veci6 I32vec2 116vec4 I8vec8 Intrinsic select_eq mm_cmpeq epi32 epil6 epig pi32 pil6 pis _mm_and_l y sil28 sil28 sil28 si64 si64 si64 _mm_andnot_ sil28 sil28 sil28 si64 si64 si64 _mm_or_l y sil28 sil28 sil28 si64 si64 si64 select_neq _mm_cmpeq epi32 epil epi8 pi32 pil pis _mm_and_l y sil28 sil28 sil28 si64 si64 si64 _mm_andnot_ sil28 sil28 sil28 si64 si64 si64 _mm_or_l y sil28 sil28 sil28 si6 4 si6 4 si6 4 select_gt mm_cmpgt epi32 epil epi8 pi32 pile pis _mm_and_l y sil28 sil28 sil28 si64 si64 si64 _mm_andnot_ sil28 sil28 sil28 si64 si64 si64 _mm_or_l y sil28 sil28 sil28 si64 si64 si64 select_ge mm_cmpge epi32 epil6 epi8 pi32 pile pis _mm_and_l y sil28 sil28 sil28 si6 4 si6 4 si64 _mm_andnot_ sil28 sil28 sil28 si64 si64 si64 _mm_or_ y sil28 sil28 sil28 si64 si64 si64 select_lt epi32 epil epis8 pi32 pile pis sil28 sil28 sil28 si64 si64 si64 1128 sil28 sil28 si64 si64 si64 1128 sil28 sil28 si64 si64 si64 select_l
315. m_cmpngt_pd _mm_cmpngt_ss R select_ng F32vec4 A F64vec2 A R select_ng 4 floats F32vecl 2 doubles F64vec2 1 float F32vecl R select_ng F32vecl A _mm_cmpnge_ps _mm_cmpnge_pd _mm_cmpnge_ss Reference 429 Intel C Compiler for Linux Systems User s Guide Cacheability Support Operations Stores non temporal the two double precision floating point values of A Requires a 16 byte aligned address void store_nta double p F64vec2 A Corresponding intrinsic __mm_stream_pd Stores non temporal the four single precision floating point values of A Requires a 16 byte aligned address void store_nta float p F32vec4 A Corresponding intrinsic _mm_stream_ps Debugging The debug operations do not map to any compiler intrinsics for MMX TM technology or Streaming SIMD Extensions They are provided for debugging programs only Use of these operations may result in loss of performance so you should not use them outside of debugging Output Operations The two single double precision floating point values of A are placed in the output buffer and printed in decimal format as follows cout lt lt F64vec2 A 1 A1 0 A0 Corresponding intrinsics none The four single precision floating point values of A are placed in the output buffer and printed in decimal format as follows cout lt lt F32vec4 A E3S 43 2 A2 1 sAl gt O AQT Corresponding intrinsi
316. me checks described here 119 Intel C Compiler for Linux Systems User s Guide Check for Supported Processor with xN xB or xP To prevent execution errors the compiler inserts code in the program to check for proper processor usage Programs compiled with options xN xB or xP will check at run time whether they are being executed on the Intel Pentium 4 processor Intel Pentium M processor or the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 respectively or a compatible Intel processor If the program is not executed on one of these processors the program terminates with an error Example To optimize the program prog cpp for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 issue the following command prompt gt icpe xP prog cpp The resulting executable aborts if it is executed on a processor that does not support the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 such as the Intel Pentium III or Intel Pentium 4 processor If you intend to run your programs on multiple IA 32 processors do not use the x options that optimize for processor specific features consider using ax to attain processor specific performance and portability among different processors Setting FTZ and DAZ Flags Previously the values of the flags flush to zero FTZ and denormals as zero DAZ for IA 32 processors were off by default However even at the cost of
317. ment and the index of the first zero element is returned The element width is 8 bits so the range of the result is from 0 7 If no zero element is found the default result is 8 int64 _m64_czxlr _ m64 a The 64 bit value a is scanned for a zero element from the least significant element to the most significant element and the index of the first zero element is returned The element width is 8 bits so the range of the result is from 0 7 If no zero element is found the default result is 8 int64 _m64_czx21 __m64 a The 64 bit value a is scanned for a zero element from the most significant element to the least significant element and the index of the first zero element is returned The element width is 16 bits so the range of the result is from 0 3 If no zero element is found the default result is 4 int64 _m64_czx2r __m64 a The 64 bit value a is scanned for a zero element from the least significant element to the most significant element and the index of the first zero element is returned The element width is 16 bits so the range of the result is from 0 3 If no zero element is found the default result is 4 353 Intel C Compiler for Linux Systems User s Guide m64 _m64_mixll __m64 a __m64 b Interleave 64 bit quantities a and b in 1 byte groups starting from the left as shown in Figure 1 and return the result m64 _m64_mixlir __m64 a __m64 b Interleave 64 bit quantities a and b in 1 byte grou
318. ment specifies the hint type Generate the Lfetch fault 1lfhint instruction The value of the first argument specifies the hint type Generate the Lfetch excl 1lfhint instruction The value 0 1 2 3 of the first argument specifies the hint type Generate the lfetch fault excl 1lfhint instruction The value of the first argument specifies the hint type _ CacheSize n returns the size in bytes of the cache at level n 1 represents the first level cache 0 is returned for a non existent cache level For example an application may query the cache size and use it to select block sizes in algorithms that operate on matrices Creates a barrier across which the compiler will not schedule any data access instruction The compiler may allocate local data in registers across a memory barrier but not global data Sets the system mask Maps to the ssm imm24 instruction Resets the system mask bits of PSR Maps to the rsm imm24 instruction Conversion Intrinsics Reference The prototypes for these intrinsics are in the ia64intrin h header file Intrinsic int64 _m_to_int64 __m64 a m64 _m_from_int64 __int64 a int64 _ round_double_to_int64 double d unsigned _ int64 getf_exp double d Register Names for getReg and setReg Description Convert a of type ___ m64 to type __int64 Translates to nop since both types reside in the same register on Itanium based systems Convert a of type ___
319. mmon symbol Such a symbol is treated as an external reference except that if no other compilation unit has a global definition for the name the linker allocates memory for it The fno common option causes the compiler to treat what otherwise would be common symbols as global definitions and to allocate memory for the symbol at compile time This may permit the compiler to use the more efficient GP relative addressing mode when accessing the symbol 94 Volume I Building Applications gcc Compatibility C language object files created with the Intel C Compiler are binary compatible with the GNU gcc compiler and glibc the GNU C language library You can use the Intel compiler or the gcc compiler to pass object files to the linker However to correctly pass the Intel libraries to the linker use the Intel compiler See Linking and Default Libraries for more information The Intel C Compiler provides many of the language extensions provided by the GNU C compiler gcc and the GNU C compiler g gcc Extensions to the C Language GNU C includes several non standard features not found in ISO standard C This version of the Intel C Compiler supports most these extensions listed in the following table See http www gnu org for more information gcc Language Extension Intel GNU Description and Examples Support Statements and Declarations in Yes http gcc gnu org onlinedocs gec 3 4 0 gcec Expressions Statement Exprs html St
320. move undefine a pre defined macro Syntax Uname Description Argument name The name of the macro to undefine M Note If you use D and U in the same compilation the compiler processes the D option before U rather than processing them in the order they appear on the command line 69 Intel C Compiler for Linux Systems User s Guide Predefined Macros The Intel C Compiler supports the predefined macros listed in the following table The compiler also includes predefined macros specified by the ISO ANSI standard See Conformance to the C Standard Macro Name Value Architecture BASE_FILE Name of source file Both __ cplusplus 1 Both __EDG_ _ 1 Both __EDG_VERSION__ 303 Both __ ELF 1 Both __ EXCEPTIONS Defined when fno IA 32 only exceptions is not used __GNUC__ 2 if gcc version is less than 3 2 Both 3 if gcc version is 3 2 3 3 or 3 4 gnu_linux 1 Both GNUC_MINOR 95 if gcc version is less than Both 3 2 2 if gcc version is 3 2 3 if gcc version is 3 3 4 if gcc version is 3 4 __GNUC_PATCHLEVEL__ 0 Both __GXX_ABI_VERSION 102 Both __ i386 1 IA 32 only __i386__ 1 IA 32 only i386 1 IA 32 only __ia64 1 Itanium architecture only _ia64__ 1 Itanium architecture only ia64 1 Itanium architecture only __INTEL COMPILER 810 Both __INTEL COMPILER BUILD DATE YYYYMMDD Both __ INTEL _CXXLIB_ICC 1 when cxxlib_icc option Both is specified during c
321. mple Difference Operator This example shows a simple parallel loop where the amount of work in each iteration is different Dynamic scheduling is used to get good load balancing The for has a nowait because there is an implicit barrier at the end of the parallel region void for_1 float a float b int n int ip J pragma omp parallel shared a b n private i j pragma omp for schedule dynamic 1 nowait for i 1 i lt n i for j 0 J lt i jtt b j n i a j n i a j n i 1 2 0 191 Intel C Compiler for Linux Systems User s Guide Two Difference Operators The following example uses two parallel loops fused to reduce fork join overhead The first for has a nowait because all the data used in the second loop is different than all the data used in the first loop void for_2 float a float b float c float d int n int m Int Ly J pragma omp parallel shared a b c d n m private i j pragma omp for schedule dynamic 1 nowait for i 1 i lt n itt For J 0p J lt ip JFF b j n i alj n i a j n i 1 2 0 pragma omp for schedule dynamic 1 nowait for i 1 i lt m itt For SOF WSs i JFE d j m i c j m i c j m i 1 2 0 Optimization Support Features This section describes language extensions to the Intel C Compiler that let you optimize your source code directly
322. mplex z float _Complex clog2f float _Complex z CLOG10 Description The clog10 function returns the complex logarithm base 10 of z Calling interface double _Complex clog10 double _Complex z long double _Complex clogl0l long double _Complex z float _Complex clogl0f float _Complex z 239 Intel C Compiler for Linux Systems User s Guide CONJ Description The conj function returns the complex conjugate of z by reversing the sign of its imaginary part Calling interface double _Complex conj double _Complex z long double _Complex conjl long double _Complex z float _Complex conjf float _Complex z CPOW Description The cpow function returns the complex power function x Calling interface double _Complex cpow double _Complex x double _Complex y long double _Complex cpowl long double _Complex x double _Complex y float _Complex cpowf float _Complex x float _Complex y CPROJ Description The cpro j function returns a projection of z onto the Riemann sphere Calling interface double _Complex cproj double _Complex z long double _Complex cprojl long double _Complex z float _Complex cprojf float _Complex z CREAL CSIN Description The creal function returns the real part value of z Calling interface double creal double _Complex z long double creall long double _Complex z float crealf float _Complex z Description The csin function
323. mplicitly i e by use only emit code for explicit instantiations For C only fno implicit templates f no rtti Enable disable RTTI support frtti i32 and i64 fnsplit Enables disables function OFF splitting Default is ON with prof_use To disable function splitting when you use prof_use also specify fnsplit fp Disable using the EBP register OFF i32 132em as general purpose register fpic fPIC For IA 32 this option generates OFF position independent code For Itanium based systems this option generates code allowing full symbol preemption fp_port Round fp results at assignments OFF i32 only and casts Some speed impact fpstkchk Generates extra code after every OFF i32 only function call to assure the FP stack is in the expected state r32 Use only lower 32 floating point OFF i64 only registers fshort enums Allocate as many bytes as OFF needed for enumerated types fsource asm Produce assemblable file with OFF optional code annotations Requires S fsyntax only Same as syntax OFF 14 Compiler Options Quick Reference Option Description Default ftls model model Change thread local storage OFF model where model canbe the following global dynamic local dynamic initial exec local exec ftz Flushes denormal results to zero OFF i32em 164 The option is turned ON with 03 funsigned bitfie
324. ms so you know each system s effect on your timings For programs that run for less than a few seconds run several timings to ensure that the results are not misleading Certain overhead functions like loading external programs might influence short timings considerably If your program displays a lot of text consider redirecting the output from the program Redirecting output from the program will change the times reported because of reduced screen I O The following program illustrates a model for program timing Sample Timing include lt stdio h gt include lt stdlib h gt include lt time h gt int main void clock_t start finish long loop double duration loop_calc start clock for loop 0 loop lt 2000 loop loop_calc 123 456 789 printf inculded to facilitate example printf nThe value of loop is d loop finish clock duration double finish start CLOCKS_PER_ printf n 2 3f seconds n duration 201 Reference Compiler Limits The following table shows the size or number of each item that the compiler can process All capacities shown in the table are tested values the actual number can be greater than the number shown Item Control structure nesting block nesting Conditional compilation nesting Declarator modifiers Parenthesis nesting levels Significant characters internal identifier External identifier nam
325. must be provided on the same line of dpi_list file after the test name in dd hh mm ss format verbose Generates more logging information about the program progress Usage Requirements To run the Test prioritization Tool on an application s tests the following files are required e The spi file generated by the Intel compilers when compiling the application for the instrumented binaries with the prof_genx option e The dpi files generated by the Intel compiler profmerge tool as a result of merging the dynamic profile information dyn files of each of the application tests The user needs to apply the profmerge tool to all dyn files that are generated for each individual test and name the resulting dpi ina fashion that uniquely identifies the test The profmerge tool merges all the dyn files that exist in the given directory F Note It is very important that you make sure that unrelated dyn files oftentimes from previous runs or from other tests are not present in that directory Otherwise profile information will be based on invalid profile data This can negatively impact the performance of optimized code as well as generate misleading coverage information d Note For successful tool execution you should e Name each test dpi file so that the file names uniquely identify each test e Create a DPI list file a text file that contains the names of all dpi test files The name of this file serves as
326. n Itantum based systems lt fEZ IPF_fma IPF_fp_speculationmode IPF_flt_eval_method0 IPF_fltacc Default IPF_fltacc IPF_fp_relaxed Flush Denormal Results to Zero Use the ft z option to flush denormal results to zero 112 Volume II Optimizing Applications Contraction of FP Multiply and Add Subtract Operations IPF_fma enables disables the contraction of floating point multiply and add subtract operations into a single operation Unless mp is specified the compiler contracts these operations whenever possible The mp option disables the contractions Use IPF_fma and IPF_fma to override the default compiler behavior For example a combination of mp and IPF_fma enables the compiler to contract operations on Itanium based systems only prompt gt icpe mp IPF_fma prog cpp FP Speculation IPF_fp_speculationmode sets the compiler to speculate on floating point operations in one of the following modes fast sets the compiler to speculate on floating point operations safe enables the compiler to speculate on floating point operations only when it is safe strict disables the speculation of floating point operations off disables the speculation on floating point operations F Note IPF_fp_speculationsafe is the default when 00 is specified FP Operations Evaluation IPF_flt_eval_method0 directs the compiler to evaluate the expressions involving floating point operands in
327. n No _m_psubusw _mm_subs_pul6 PSUBUSW Subtraction No _m_pmaddwd _mm_madd_pil6 PMADDWD Multiplication PMULHW Multiplication Yes PMULLW Multiplication Packed Arithmetic Intrinsics Part 2 Intrinsic Alternate Name Corresponding Argument Result Name Instruction Values Bits Values Bits _m_paddb _mm_add_pi8 PADDB 8 8 8 8 _m_paddw _mm_add_pil6 PADDW 4 16 4 16 _m_paddd _mm_add_pi32 PADDD 2 32 2 32 _m_paddsb _mm_adds_pi8 PADDSB 8 8 8 8 _m_paddsw _mm_adds_pil6 PADDSW 4 16 4 16 258 Intrinsic Name _m_paddusb m_paddusw m_psubb m_psubw _m_psubd m_psubsb m_psubsw _m_psubusb m_psubusw _m_pmaddwd Alternate Name _mm_adds_pu8 _mm_adds_pul6 _mm_sub_pi8 _mm_sub_pil _mm_sub_pi32 _mm_subs_pi8 _mm_subs_pil6 _mm_subs_pu8 _mm_subs_pul6 _mm_madd_pil _mm_mullo_pil6 Corresponding Instruction PADDUSB PADDUSW PSUBB PSUBW PSUBD PSUBSB PSUBSW PSUBUSB PSUBUSW PMADDWD PMULHW PMULL m64 _m_paddb __m64 ml __m64 m2 Argument Values Bits 8 8 4 16 8 8 4 16 2 32 8 8 4 16 8 8 4 16 4 16 4 16 4 16 Add the eight 8 bit values in m1 to the eight 8 bit values in m2 __m64 _m_paddw __m64 ml __m64 m2 Add the four 16 bit values in m1 to the four 16 bit values in m2 m64 _m_paddd __m64 ml __m64 m2 Add the two 32 bit values in m1 to the two 32 bit values in m2 __m64 _m_paddsb __m64 ml __m64 m2 Reference Result Values Bits
328. n and options used in the assembly listing for xild If the xild invocation leads to an IPO multi object compilation either because the application is big or because the user explicity asked for multiple objects the first s file takes its name from the qipo_fa option The compiler derives the names of subsequent s files by appending a number to the name for example foo s and fool s for qipo_fafoo s The same is true for the qipo_fo option Code Layout and Multi Object IPO One of the optimizations performed during an IPO compilation is code layout IPO analysis determines a layout order for all of the routines for which it has IR If a single object is being generated the compiler generates the layout simply by compiling the routines in the desired order For a multi object IPO compilation the compiler must tell the linker about the desired order The compiler first puts each routine in a named text section the first routine in text 00001 the second in text00002 and so forth It then generates a linker script that tells the linker to first link contributions from text00001 then text00002 This happens transparently when the same invocation is used for both the link time compilation and the final link However the linker script must be taken into account by the user if ipo_c or ipo_S is used With these switches the IPO compilation and actual link are done by different invocations When this occurs the compiler will
329. n of the associated structured block to a single thread at a time Synchronizes all the threads in a team Ensures that a specific memory location is updated atomically Specifies a cross thread sequence point at which the implementation is required to ensure that all the threads in a team have a consistent view of certain objects in memory The structured block following an ordered directive is executed in the order in which iterations would be executed in a sequential loop Makes the named file scope or namespace scope variables specified private to a thread but file scope visible within the thread Volume II Optimizing Applications OpenMP Clauses Clause Description private Declares variables to be private to each thread in a team firstprivate Provides a superset of the functionality provided by the private clause lastprivate Provides a superset of the functionality provided by the private clause shared Shares variables among all the threads in a team default Enables you to affect the data scope attributes of variables reduction Performs a reduction on scalar variables ordered The structured block following an ordered directive is executed in the order in which iterations would be executed in a sequential loop if Ifthe if scalar_logical_expression clause is present the enclosed code block is executed in parallel only if the scalar_logical_expression evaluates to TRUE Otherwise the code block is seria
330. n the most frequently executed call sites based on the profile information gathered for the program e By default the compiler will not inline functions with more than 230 intermediate statements You can change this value by specifying the option Qoption c ip_ninl_max_stats new_value Note there is a higher limit for functions declared by the user as inline or__inline e The default inline heuristic will stop inlining when direct recursion is detected e The default heuristic will always inline very small functions that meet the minimum inline criteria e Default for Itanium based applications ip _ninl_min_stats 15 e Default for A 32 applications ip_ninl_min_stats 7 This limit can be modified with the option Qoption c ip_ninl_min_stats new_value If you do not use profile guided optimizations with ip or ipo value the compiler uses less aggressive inlining heuristics e Inline a function if the inline expansion will not increase the size of the final program e Inline a function if it is declared with the inline or_ inline keywords 130 Volume II Optimizing Applications Profile guided Optimizations Profile guided optimizations PGO tell the compiler which areas of an application are most frequently executed By knowing these areas the compiler is able to use feedback from a previous compilation to be more selective in optimizing the application For example the use of PGO often enables the compiler to make better de
331. n with Eclipse CDT includes online help From the Eclipse toolbar select Help gt Help Contents Welcome Tips and Tricks Help Contents Software Updates About Intel R Software Development Products The Help Contents option lets you narrow your search for help information by presenting all the help modules registered with Eclipse Select Intel R C Compiler for Linux to open the Compiler User s Guide this document The Help Contents may also include links to the Eclipse Workbench User Guide the C C Development User Guide and other similar documents z a harila aiii il Eie Edi Go Bovtmares Tots Window Help Search O Contents d Workbench User Guide inaeighe Ce Compiler for Linux B Giscisiner and Legal informatice E welcome to te melf C Commer PAD compilar Options Quick Reference FO User s Guide ED Reference Oe Denat User Guide BD A rapier 07 108ameprep siconpierhaiphmiaoi m nn o 45 Intel C Compiler for Linux Systems User s Guide Selecting a Different Browser If you want to select a different browser to view the Help Contents open Windows gt Preferences gt Help from the Eclipse toolbar Check Custom Browser user defined program then complete the necessary information in the Custom Browser command text box Click OK to complete your browser selection The selected web browser adapler determines the web braver used ta isplay help
332. nary increases because it contains processor specific versions of some of the code as well as a generic version of the code e Performance is affected slightly by the run time checks to determine which code to use F Note Applications that you compile with this option will execute on any IA 32 processor If you specify both the x and ax options the x option forces the generic code to execute only on processors compatible with the processor type specified by the x option Option Optimizes Your Code for Oa axK Intel Pentium III and compatible Intel processors j axW Intel Pentium 4 and compatible Intel processors axN Intel Pentium 4 and compatible Intel processors This option also enables new optimizations in addition to Intel processor specific optimizations axB Intel Pentium M and compatible Intel processors This option also enables new optimizations in addition to Intel processor specific optimizations axP Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 This option also enables new optimizations in addition to Intel processor specific optimizations 116 Volume II Optimizing Applications Example The following compilation will generate a single executable that includes e a generic version for use on any IA 32 processor e aversion optimized for Intel Pentium III processors as long as there is a likely performance benefit e a version optimized for Intel Pent
333. ndependence the compiler will eventually resort to a powerful hierarchical dependence solver that uses Fourier Motzkin elimination to solve the data dependence problem in all dimensions Loop Constructs Loops can be formed with the usual for and while constructs However the loops must have a single entry and a single exit to be vectorized Correct Usage while i lt n If branch is inside body of loop a i b i c i if a i lt 0 0 Incorrect Usage while i lt n if condition break 2nd exit i 159 Intel C Compiler for Linux Systems User s Guide Loop Exit Conditions Loop exit conditions determine the number of iterations that a loop executes For example fixed indexes for loops determine the iterations The loop iterations must be countable that is the number of iterations must be expressed as one of the following e a constant e a loop invariant term e a linear function of outermost loop indices Loops whose exit depends on computation are not countable The following examples illustrate countable and non countable loop constructs Correct Usage for Countable Loop Exit condition specified by N 1b 1 count N while count 1b 1b is not affected within loop a li b i x b il i lt sgqrt d il count Correct Usage for Countable Loop Exit condition is n m 2 2 i 0 for l m l lt n 1 2 ij b i x ij c i t sqrt d
334. ndex __builtin_bcmp builtin bzero _builtin_sinl __builtin_cos __builtin_sqrtl _builtin_fabsl __builtin_frame_address _ bui ltin_return_address Volume I Building Applications IA 32 only IA 32 only For more information on gcc built in functions see http gcc gnu org onlinedocs gec 3 4 1 gcc Other Builtins html Other 20Builtins gcc Function Attributes This version of the Intel C Compiler supports the following gcc function attributes e noinline prevents a function from being inlined e always_inline inlines the function even if no optimization is specified e used code must be emitted for the function even if the function is not referenced Example int round_sqrt int __attribute__ always_inline In this example the function round_sqrt is inlined even if no optimization is specified Thread local Storage The Intel C Compiler supports the storage class keyword __ thread which can be used in variable definitions and declarations Variables defined and declared this way are automatically allocated locally to each thread __thread int i __thread struct state s extern _ thread char p F Note The __ thread keyword is only recognized when the GNU compatibility version is 3 3 or higher You may need to specify the gcc version 330 compiler option to enable thread local storage See also http gcc gnu org onlinedocs gcec Thread Local html 103 Intel C
335. nedness I slu 32vec2 two 32 bit values of any signedness I s u 1l6vec4 four 16 bit values of any signedness I s u 8vec8 eight 8 bit values of any signedness Rules for Operators To use operators with the Ivec classes you must use one of the following three syntax conventions Ivec_Class R Ivec_Class A operator Ivec_Class B Example 1 164vecl R I64vecl A amp I64vecl B Ivec_Class R operator Ivec_Class A Ivec_Class B Example 2 164vecl R andnot I164vecl A I64vecl B Ivec_Class R operator Ivec_Class A Example 3 164vecl R amp I64vecl A operator Jan operator for example amp or Ivec_Class an Ivec class R A B variables declared using the pertinent Ivec classes The table that follows shows automatic and explicit sign and size typecasting Explicit means that it is illegal to mix different types without an explicit typecasting Automatic means that you can mix types freely and the compiler will do the typecasting for you 390 Reference Summary of Rules Major Operators ae Operators Sign Size Other Typecasting Requirements Typecasting Typecasting N A N A c Assignment Logical Automatic Automatic Explicit typecasting is required for to left different types used in non logical expressions on the right side of the Addition and Automatic Explicit Subtraction assignment cl Multiplication Automatic Explicit N A
336. ng No other effort by the programmer is needed The following example illustrates how a loop s iteration space can be divided so that it can be executed concurrently on two threads Original Serial Code i1 lt 100 itt al GI Transformed Parallel Code Thread 1 for i 1 i lt 50 i ali a i plil clil Thread 2 for i 50 i lt 100 i a i a i b i clil Programming with Auto parallelization The auto parallelization feature implements some concepts of OpenMP such as worksharing construct with the parallel for directive This section provides specifics of auto parallelization Guidelines for Effective Auto parallelization Usage A loop is parallelizable if e The loop is countable at compile time This means that an expression representing how many times the loop will execute also called the loop trip count can be generated just before entering the loop e There are no FLOW READ after WRITE OUTPUT WRITE after READ or ANTI WRITE after READ loop carried data dependences A loop carried data dependence occurs when the same memory location is referenced in different iterations of the loop At the compiler s discretion a loop may be parallelized if any assumed inhibiting loop carried dependencies can be resolved by run time dependency testing 171 Intel C Compiler for Linux Systems User s Guide The compiler may generate a run t
337. nhibits many valuable compiler optimizations because symbols with default visibility are not bound to a memory address until runtime For example calls to a routine with default visibility cannot be inlined because the routine might be preempted if the compilation unit is linked into a shareable object A preemptable data symbol cannot be accessed using GP relative addressing because the name may be bound to a symbol in a different component the GP relative address is not known at compile time Symbol preemption is a very rarely used feature that has drastic negative consequences for compiler optimization For this reason by default the compiler treats all global symbol definitions as non preemptable i e protected visibility Global references to symbols defined in other compilation units are assumed by default to be preemptable 1 e default visibility In those rare cases when you need all global definitions as well as references to be preemptable specify the fpic option to override this default Specifying Symbol Visibility Explicitly You can explicitly set the visibility of an individual symbol using the visibility attribute on a data or function declaration For example int i __attribute__ visibility default void __attribute__ visibility hidden x extern void y __attribute__ visibilty protected The visibility declaration attribute accepts one of the five keywords external default p
338. nt in line 12 was covered However only one of the conditions in line 11 was ever true With the nopartial option the tool treats the partially covered code like the code on line 11 as covered 142 Volume II Optimizing Applications Differential Coverage Using the code coverage tool you can compare the profiles of the application s two runs a reference run and a new run identifying the code that is covered by the new run but not covered by the reference run This feature can be used to find the portion of the application s code that is not covered by the application s tests but is executed when the application is run by a customer It can also be used to find the incremental coverage impact of newly added tests to an application s test space The dynamic profile information of the reference run for differential coverage is specified by the ref option such as in the following command codecov prj Project_Name dpi customer dpi ref appTests dpi The coverage statistics of a differential coverage run shows the percentage of the code that was exercised on a new run but was missed in the reference run In such cases the coverage tool shows only the modules that included the code that was uncovered The coloring scheme in the source views also should be interpreted accordingly The code that has the same coverage property covered or not covered on both runs is considered as covered code Otherwise if the new run indicates that
339. nt to the Intel C Compiler e ISO IEC 9989 1990 Programming Languages C e ISO IEC 14882 1998 Programming Languages C e The Annotated C Reference Manual Special Edition Ellis Margaret Stroustrup Bjarne Addison Wesley 1991 Provides information on the C programming language e The C Programming Language 3rd edition 1997 Addison Wesley Publishing Company One Jacob Way Reading MA 01867 e The C Programming Language 2nd edition Kernighan Brian W Ritchie Dennis W Prentice Hall 1988 Provides information on the K amp R definition of the C language e C A Reference Manual 3rd edition Harbison Samual P Steele Guy L Prentice Hall 1991 Provides information on the ANSI standard and extensions of the C language e Intel Architecture Software Developer s Manual Volume 1 Basic Architecture Intel Corporation doc number 243190 e Intel Architecture Software Developer s Manual Volume 2 Instruction Set Reference Manual Intel Corporation doc number 243191 e Intel Architecture Software Developer s Manual Volume 3 System Programming Intel Corporation doc number 243192 e Intel Itanium Assembler User s Guide e Intel Itanium based Assembly Language Reference Manual e Itanium Architecture Software Developer s Manual Vol 1 Application Architecture Intel Corporation doc number 245317 001 e Itanium Architecture Software Developer s Manual Vol 2 System Architecture I
340. ntel C Compiler for Linux Systems User s Guide Intel C C Error Parser The Intel C C Error Parser lets you track compile time errors in Eclipse CDT However you must enable the Error Parser to see the results 1 On the Eclipse toolbar select Window gt Preferences 2 On the Preferences dialog select C C gt New Make Projects 3 Click the Error Parsers tab Check the Intel R C C Error Parser selection to enable this feature og eh rec es bo Wnekbeerh Now Make Projects b Ant uid Cider ti F N So the enw gamz for this project Ead Console OT Edno Cade Tempaics Po Debug E COT Visual C Emor Parser k COT CHU Assembler Eno Famer Debeay G CO Gat Linker Error Parser h l li COT ONU Cites Emor Parser s iraj Ladas E inne C7044 Eine Parsee gt Team E COT G U Make Ener Parser 4 Click OK to update your choices and close the dialog 56 Volume I Building Applications Using the Intel C C Error Parser If you introduce an error into your hello c program such as include lt xstdio h gt then compile he11o c the error is reported in the Tasks view and a marker appears in the source file at the line where the error was detected r ate ii Dino prment hello let Sotware Developement Products Ele Edit Bavigate Search Project Aun Window Heip S 8 oe e Sw ee he ee Ge ee gE imt main veid pin Hello Eclipse World in retunr Ch
341. ntel Corporation doc number 245318 001 e Itanium Architecture Software Developer s Manual Vol 3 Instruction Set Reference Intel Corporation doc number 245319 001 e tanium Architecture Software Developer s Manual Vol 4 Itanium Processor Programmer s Guide Intel Corporation doc number 245319 001 e Intel Architecture Optimization Manual Intel Corporation doc number 245127 e Intel Processor Identification with the CPUID Instruction Intel Corporation doc number 241618 e Intel Architecture MMX TM Technology Programmer s Reference Manual Intel Corporation doc number 241618 e Pentium Pro Processor Developer s Manual 3 volume Set Intel Corporation doc number 242693 e Pentium II Processor Developer s Manual Intel Corporation doc number 243502 001 e Pentium Processor Specification Update Intel Corporation doc number 242480 e Pentium Processor Family Developer s Manual Intel Corporation doc numbers 241428 005 Most Intel documents are also available from the Intel Corporation Web site at http developer intel com software products Intel C Compiler User s Guide How to Use This Document This User s Guide explains how to use the Intel C Compiler It provides information on how to get started with the Intel C Compiler how this compiler operates and what capabilities it offers for high performance You learn how to use the standard and advanced compiler optimizations to ga
342. nterleave the four 8 bit values from the low half of m1 with the four values from the low half of m2 The interleaving begins with the data from m1 m64 _m_punpcklwd __m64 ml __m64 m2 Interleave the two 16 bit values from the low half of m1 with the two values from the low half of m2 The interleaving begins with the data from m1 __m64 _m_punpckldq __m64 ml __m64 m2 Interleave the 32 bit value from the low half of m1 with the 32 bit value from the low half of m2 The interleaving begins with the data from m1 257 Intel C Compiler for Linux Systems User s Guide MMX TM Technology Packed Arithmetic Intrinsics The prototypes for MMX TM technology intrinsics are in the mmint rin h header file Packed Arithmetic Intrinsics Part 1 Intrinsic Alternate Name Corresponding Operation Signed Name Instruction _m_paddb _mm_add_pi8 PADDB Addition _m_paddw _mm_add_pil6 PADDW Addition _m_paddd _mm_add_pi32 PADDD Addition _m_paddsb _mm_adds_pi8 PADDSB Addition Yes _m_paddsw _mm_adds_pil6 PADDSW Addition Yes _m_paddusb __mm_adds_pu8 PADDUSB Addition No _m_paddusw _mm_adds_pul6 PADDUSW Addition No _m_psubb _mm_sub_pi8 PSUBB Subtraction _m_psubw _mm_sub_pil PSUBW Subtraction _m_psubd _mm_sub_pi32 PSUBD Subtraction _m_psubsb _mm_subs_pi8 PSUBSB Subtraction Yes _m_psubsw _mm_subs_pil6 PSUBSW Subtraction Yes _m_psubusb _mm_subs_pu8 PSUBUSB Subtractio
343. nteroperability The Intel C Compiler options that affect gcc interoperability include gcc name gcc version cxxlib gcec cxxlib icc fabi version no gcc see Predefined Macros for Interoperability gcc name option The gcc name name option used with cxxlib gcc lets you specify the location of gcc if the compiler cannot locate the gcc C libraries Use this option when referencing a non standard gcc installation 99 Intel C Compiler for Linux Systems User s Guide gcc version option The gcc version nnn option provides compatible behavior with gcc where nnn indicates the gcc version The gcc version option is ON by default and the value of nnn depends on the version of gcc installed on your system This option selects the version of gcc with which you achieve ABI interoperability Installed Version of gcc Default Value of gcc version older than version 3 2 cxxlib gcc option The cxxlib gec GCC root dir option lets you to build your applications using the C libraries and header files included with the gcc compiler They include e libstdc standard C header files e libstdc standard C library e libgcec C language support Use the optional argument GCC root dir to specify the top level location for the gcc binaries and libraries f Note The Intel C Compiler is compatible with gcc 3 2 3 3 and 3 4 The cxxlib
344. o b o and c o all contain IR so the compiler will generate ipo_out o ipo_outl o ipo_out2 o0 and ipo_out3 o The first object file contains global symbols The other object files correspond to the source files This naming convention is also applied to user specified names For example prompt gt icpc ipo_separat ipo_c o appl o a o b o c o This will generate appl o appll o app12 0 and app13 o Capturing Intermediate Outputs of IPO The ipo_c and ipo_S options are useful either for analyzing the effects of IPO or when using IPO on modules that do not make up a complete program Use the ipo_c option to optimize across files and produce an object file This option performs optimizations as described for ipo but stops prior to the final link stage leaving an optimized object file The default name for this file is ipo_out o You can use the o option to specify a different name For example prompt gt icpe tpp6 ipo_c ofilename a cpp b cpp c cpp Use the ipo_S option to optimize across files and produce an assembly file This option performs optimizations as described for i po but stops prior to the final link stage leaving an optimized assembly file The default name for this file is ipo_out s You can use the o option to specify a different name For example prompt gt icpce tpp6 ipo_S ofilename a cpp b cpp c cpp The ipo_c and ipo_S options generate multiple outputs if multi object IPO is be
345. object code for use by source level debuggers The g option changes the default optimization from 02 to 00 g0 Disable generation of symbolic OFF i32 only debug information gcc name name Use this option to specify the OFF location of g when compiler cannot locate gec C libraries For use with cxxlib gec configuration Use this option when referencing a non standard gcc installation gcc version nnn This option provides compatible OFF behavior with gcc where nnn indicates the gcc version no global hoist Enables disables hoisting and OFF speculative loads of global variables H Print include file order and OFF continue compilation help Prints compiler options OFF summary idirafterdir Add directory dir to the OFF second include file search path after T Idirectory Specifies an additional OFF directory to search for include files i_dynamic Link Intel provided libraries OFF dynamically inline_debug_info Produces enhanced source OFF position information for inlined code It also provides enhanced debug information useful for function call traceback To use this option for debugging you must also specify g 16 ip IPF_fma 164 only IPF_fltacc 164 only IPF_flt_eval_method0 164 only IPF_fp_relaxed 164 only IPF_fp_speculationmode 164 only ip_no_inlining ip_no_pinlining 132 i32em i
346. ock where clause can be any of the following if scalar expression num_threads integer expression copyin variable list default shared none shared variable list private variable list firstprivate variable list lastprivate variable list reduction operator variable list ordered Clause descriptions are the same for parallel and taskq construct 190 Volume II Optimizing Applications Example Function The test1 function is a natural candidate to be parallelized using the workqueuing model You can express the parallelism by annotating the loop with a parallel taskq pragma and the work in the loop body with a task pragma The parallel taskq pragma specifies an environment for the while loop in which to enqueue the units of work specified by the enclosed task pragma Thus the loop s control structure and the enqueuing are executed single threaded while the other threads in the team participate in dequeuing the work from the taskq queue and executing it The captureprivate clause ensures that a private copy of the link pointer p is captured at the time each task is being enqueued hence preserving the sequential semantics void test1 LIST p pragma intel omp parallel taskq shared p while p NULL pragma intel omp task captureprivate p do_work1 p p p gt next Examples of OpenMP Usage The following examples show how to use the OpenMP feature A Si
347. of data loaded from the address p rO a0 rri al r2 p0 r3 pl void _mm_storeh_pi __m64 p __m128 a Stores the upper two SP FP values to the address p p0 a2 pl a3 __m128 _mm_movehl_ps __m128 a __m128 b Moves the upper 2 SP FP values of b to the lower 2 SP FP values of the result The upper 2 SP FP values of a are passed through to the result r3 a3 r2 a2 rl b3 rO b2 __m128 _mm_movelh_ps __m128 a __m128 b Moves the lower 2 SP FP values of b to the upper 2 SP FP values of the result The lower 2 SP FP values of a are passed through to the result r3 bl r2 pO ri al rO a0 m128 _mm_loadl_pi __m128 a __m64 const p Sets the lower two SP FP values with 64 bits of data loaded from the address p the upper two values are passed through from a YO p0 rl pl r2 a2 r3 a3 void _mm_storel_pi __m64 p __m128 a Stores the lower two SP FP values of a to the address p p0 a0 zol al 293 Intel C Compiler for Linux Systems User s Guide int _mm_movemask_ps __m128 a Creates a 4 bit mask from the most significant bits of the four SP FP values r sign a3 lt lt 3 sign a2 lt lt 2 sign al lt lt l sign a0 Using Streaming SIMD Extensions on Itanium Architecture The Streaming SIMD Extensions SSE intrinsics provide access to Itantum instructions for Streaming SIMD Extensions To provide source compatibility with the A 32 architecture
348. of explicit instantiations 107 Volume IT Optimizing Applications Optimization Levels This section discusses the command line options O0 01 02 and 03 The 00 option disables optimizations Each of the other three turns on several compiler capabilities To specify one of these optimizations take into consideration the nature and structure of your application as indicated in the more detailed description of the options In general terms 01 02 and 03 optimize as follows e O1 code size and locality e 0O2 code speed this is the default option e 03 enables 02 with more aggressive optimizations These options behave similarly on A 32 and Itanitum architectures with some specifics that are detailed in the sections that follow Setting Optimization Levels The following table details the effects of the O0 O1 02 03 and fast options The table first describes the characteristics shared by both IA 32 and Itanium architectures and then explicitly describes the specifics if any of the On options behavior on each architecture Option Effect 00 Disables optimizations 01 Optimizes to favor code size and code locality Disables loop unrolling May improve performance for applications with very large code size any branches and execution time not dominated by code within loops In most cases 02 is recommended over O1 IA 32 systems Disables intrinsics inlining to reduce code size
349. og cpp and displays compiler errors but not warnings prompt gt icpce W0 newprog cpp Use the ww we or wd option to indicate specific diagnostics Option Description wwL1 L2 Ln Changes the severity of diagnostics L1 through Ln to warning weLl L2 Ln Changes the severity of diagnostics L1 through Ln to error wdL1 L2 Ln Disables diagnostics L1 through Ln Example test c int main int x 0 If you compile test c using the Wal1 option enable all warnings the compiler will emit warning 177 prompt gt ice Wall test c x was declared but never referenced remark 177 variable To disable warning 177 use the wd option prompt gt ice Wall wd177 test c Likewise using the we option will result in a compile time error prompt gt ice Wall wel177 test c x was declared but never referenced error 177 variable compilation aborted for test c 209 Intel C Compiler for Linux Systems User s Guide Limiting the Number of Errors Reported Use the wnn option to limit the number of error messages displayed before the compiler aborts By default if more than 100 errors are displayed compilation aborts Description Limit the number of error diagnostics that will be displayed prior to aborting compilation to n Remarks and warnings do not count towards this limit For example the following command line specifies that i
350. ollowing registers See Register Names for getReg and setReg m SIO void _ setReg const int Sets the value for a hardware register based on the whichReg unsigned _ int64 index passed in Produces a corresponding mov value r instruction See Register Names for getReg and setReg Pp unsigned _ int64 Return the value of an indexed register The index getIndReg const int is the 2nd argument the register file is the first whichIndReg _ int64 index argument 345 Intel C Compiler for Linux Systems User s Guide A Intrinsic void __setIndReg const int whichIndReg __int64 index unsigned __int64 value void __ptr64 _rdteb void a void _ isrlz void m void _ dsrlz void emmm unsigned _ int64 fetchadd4_acq unsigned int addend const int increment mmm unsigned _ int64 __fetchadd4_rel unsigned int addend const int increment emmm unsigned _ int64 fetchadd8_acq unsigned int64 addend const int increment unsigned _ int64 __fetchadd8_rel unsigned int64 addend const int increment fm void _ fwb void k void __ldfs const int whichFloatReg void src eT void __ldfd const int whichFloatReg void src fa void __ldfe const int whichFloatReg void src fp void _ ldf8 const int whichFloatReg void src void __ldf fill const int whichFloatReg void src void __stfs void dst const int whichFlo
351. ompatibility with gcc 359 Intel C Compiler for Linux Systems User s Guide Data Alignment Memory Allocation Intrinsics and Inline Assembly This section describes features that support usage of the intrinsics The following topics are described e Alignment Support e Allocating and Freeing Aligned Memory Blocks Alignment Support To improve intrinsics performance you need to align data For example when you are using the Streaming SIMD Extensions you should align data to 16 bytes in memory operations to improve performance Specifically you must align ___m128 objects as addresses passed to the _mm_load and _mm_store intrinsics If you want to declare arrays of floats and treat them as___m128 objects by casting you need to ensure that the float arrays are properly aligned Use ___declspec align to direct the compiler to align data more strictly than it otherwise does on both JA 32 and Itanium based systems For example a data object of type int is allocated at a byte address which is a multiple of 4 by default the size of an int However by using __declspec align you can direct the compiler to instead use an address which is a multiple of 8 16 or 32 with the following restrictions on IA 32 e 32 byte addresses must be statically allocated e 16 byte addresses can be locally or statically allocated You can use this data alignment support as an advantage in optimizing cache line usage By clustering small objects that
352. ompilation 70 Volume I Building Applications Macro Name Value Architecture INTEL_RTTI 1 when fno rtti is not Both specified __INTEL STRICT_ANSI__ 1 when strict_ansi is Both specified _ INTEGRAL _ MAX BITS 64 Itanium architecture only itanium 1 Itanium architecture only linux 1 Both __ linux 1 Both linux 1 Both LONG_DOUBLE_ SIZE 80 IA 32 only LONG_MAX 9223372036854775807L Itanium architecture only __1p64 1 Itanium architecture only __LP64__ 1 Itanium architecture only _LP64 1 Itanium architecture only __NO_INLINE__ 1 Both __NO_MATH_INLINES 1 Both __NO_STRING_INLINES 1 Both __OPTIMIZE__ 1 Both PIC__ 1 when fPIC is used Both _pic__ 1 when fPIC is used Both _PGO_INSTRUMENT 1 when prof_gen x is Both used PTRDIFF_TYPE int Both on JA 32 long on Itanium architecture __REGISTER_PREFIX__ no value Both __SIGNED_CHARS___ 1 Both 71 Intel C Compiler for Linux Systems User s Guide Macro Name Value Architecture SIZE_TYPE unsigned on JA 32 unsigned long on Itanium architecture no value __VERSION__ Intel C gce 3 0 mode WCHAR_T 1 long int on JA 32 int on Itanium architecture WINT_TYPE unsigned int Suppress Macro Definition Use the Uname option to suppress any macro definition currently in effect for the specified name The U option performs the same
353. ompiler to specify alternate tools for preprocessing compilation assembly and linking Further you can invoke options specific to your alternate tools on the command line The following sections explain how to use Qlocation and Qoption to do this How to Specify an Alternate Component Use Qlocation to specify an alternate path for a tool This option accepts two arguments using the following syntax prompt gt icpe Qlocation tool path tool Description cpp Specifies the compiler front end preprocessor c Specifies the C compiler asm Specifies the assembler 1d Specifies the linker gas Specifies the GNU assembler gld Specifies the GNU linker path is the complete path to the tool How to Pass Options to Other Programs Use Qoption to pass an option specified by opt list toa tool where optlist is a comma separated list of options The syntax for this command is the following prompt gt icpe Qoption tool optlist optlist indicates one or more valid argument strings for the designated program If the argument is a command line option you must include the hyphen If the argument contains a space or tab character you must enclose the entire argument in quotation characters You must separate multiple arguments with commas The following example directs the linker to create a memory map when the compiler produces the executable file from the source prompt gt icpe Qoption link map
354. ompletes the compilation producing a real object file or executable Generally different compiler versions produce IL based on different definitions and therefore the ILs from different compilations can be incompatible Intel C Compiler assigns a unique version number with each compiler s IL definition If a compiler attempts to read IL in a file with a version number other than its own the compilation proceeds but the IL is discarded and not used in the compilation The compiler then issues a warning message about an incompatible IL detected and discarded IL in Libraries More Optimizations The IL produced by the Intel compiler is stored in a file with a i1 suffix Then the i1 file is placed in the library If this library is used in an IPO compilation invoked with the same compiler as produced the IL for the library the compiler can extract the i1 file from the library and use it to optimize the program For example it is possible to inline functions defined in the libraries into your source code Creating a Library from IPO Objects Normally libraries are created using a library manager such as ar Given a list of objects the library manager will insert the objects into a named library to be used in subsequent link steps prompt gt xiar cru user a a o b o The above command creates a library named user a that contains the a o and b o objects 127 Intel C Compiler for Linux Systems User s Guide If however
355. omponents of the program Each global symbol definition or reference in a compilation unit has a visibility attribute that controls how or if it may be referenced from outside the component in which it is defined There are five possible values for visibility e EXTERNAL The compiler must treat the symbol as though it is defined in another component For a definition this means that the compiler must assume that the symbol will be overridden preempted by a definition of the same name in another component See Symbol Preemption Ifa function symbol has external visibility the compiler knows that it must be called indirectly and can inline the indirect call stub e DEFAULT Other components can reference the symbol Furthermore the symbol definition may be overridden preempted by a definition of the same name in another component e PROTECTED Other components can reference the symbol but it cannot be preempted by a definition of the same name in another component e HIDDEN Other components cannot directly reference the symbol However its address might be passed to other components indirectly for example as an argument to a call to a function in another component or by having its address stored in a data item reference by a function in another component e INTERNAL The symbol cannot be referenced outside its defining component either directly or indirectly Static local symbols in C C declared at file scope or elsewhere
356. on 00 which is what slows the program down If both 02 and g are specified the code should run nearly the same speed as if g were not specified Refer to the following table for the summary of the effects of using the g option with the optimization options These Produce these results options g Debugging information produced 00 enabled optimizations disabled fp enabled for IA 32 targeted compilations g 01 Debugging information produced 01 optimizations enabled g 02 Debugging information produced 02 optimizations enabled g 03 fp Debugging information produced 03 optimizations enabled fp enabled for IA 32 targeted compilations Debugging and Assembling The assembly file is generated without debugging information but if you produce an object file it will contain debugging information If you link the object file and then use the GDB debugger on it you will get full symbolic representation 85 Intel C Compiler for Linux Systems User s Guide Options for Debug Information The Intel C Compiler provides basic debugging information and new features for enhanced debugging of optimized code The basic debugging switches are listed in the following table Description debug all These options are equivalent to g They turn on production of basic debug debug full information They are off by default debug none This option turns off production of debug information This opt
357. on libguide_stats a OpenMP static library for the parallelizer tool with performance libguide_stats so statistics and profile information Library that resolves references to OpenMP subroutines when OpenMP is not in use Short vector math library Intel support library for PGO and CPU dispatch Intel math library Intel math library Library libunwind libcxa a libcxa so libcprts a libcprts so libcprts so 3 libunwind libunwind libcxa so libcxaguard a libcxaguard so libcxaguard so 3 Reference Description Dinkumware C Library a Unwinder library so so 3 Intel run time support for C features 3 Used for interoperability support with the cxxlib gcc option See gcc Interoperability Key Files Summary for Itanium Compiler The following tables list and briefly describe files that are installed for use by the Itanitum compiler bin Files File codecov idc cetg ice icpe iccbin icpcbin mcpcom iccbin icpcbin profmerge proforder tselect xiar xild iccvars sh Description Code coverage tool Batch file to set environment variables Configuration file for use from command line Scripts that check for license file and call compiler driver Compiler drivers Intel C Compiler Compiler drivers Utility used for Profile Guided Optimizations Utility used for Profile Guided Optimizations Test prioritization tool Tool used for Interprocedur
358. on a j is used within a loop by placing prefetch a in front of the loop the compiler will insert prefetches for a j d within the loop where d is determined by the compiler This directive is supported when option 03 is on 195 Intel C Compiler for Linux Systems User s Guide Example of prefetch Directive pragma noprefetch b pragma prefetch a for i 0 i lt m i Vectorization Support IA 32 The vector directives control the vectorization of the subsequent loop in the program but the compiler does not apply them to nested loops Each nested loop needs its own directive preceding it You must place the vector directive before the loop control statement vector always Directive The vector always directive instructs the compiler to override any efficiency heuristic during the decision to vectorize or not and will vectorize non unit strides or very unaligned memory accesses Example of vector always Directive pragma vector al for i 0 i lt N i a 32 i b 99 i ivdep Directive The ivdep directive instructs the compiler to ignore assumed vector dependences To ensure correct code the compiler treats an assumed dependence as a proven dependence which prevents vectorization This directive overrides that decision Use ivdep only when you know that the assumed loop dependences are safe to ignore The loop in the following example will not vectorize with the ivdep since the value of k
359. on and automatic loop vectorization in the same compilation In most cases the compiler will consider outermost loops for parallelization and innermost loops for vectorization If deemed profitable however the compiler may even apply loop parallelization and vectorization to the same loop 156 Volume II Optimizing Applications Note that in some cases successful loop parallelization either automatically or by means of OpenMP directives may affect the messages reported by the compiler for loop vectorization for example under the vec_report2 option indicating loops not successfully vectorized Vectorization Key Programming Guidelines The goal of vectorizing compilers is to exploit single instruction multiple data SIMD processing automatically Review these guidelines and restrictions see code examples in further topics and check them against your code to eliminate ambiguities that prevent the compiler from achieving optimal vectorization Guidelines for loop bodies e use straight line code a single basic block e use vector data only that is arrays and invariant expressions on the right hand side of assignments Array references can appear on the left hand side of assignments e use only assignment statements Avoid the following in loop bodies function calls unvectorizable operations mixing vectorizable types in the same loop data dependent loop exit conditions Preparing your code for vectorization To make you
360. on floating point from a into low 2 SP FP and shuffle any 2 SP FP from b into high 2 SP FP of destination define SHUFFLE a b i F32vec4 _mm_shuffle_ps a b i include lt stdio h gt define SIZE 20 Global variables float result _MM ALIGN 16 float array SIZ Gl l J RRR RK RK KK KK KK IR I A kkk kkk kk kkk Function Add20ArrayElements Add all the elements of a 20 element array J RRR KR RK KKK KK IK KK KK OK kkk k kkk void Add20ArrayElements F32vec4 array float result F32vec4 vec0 vecl vecO _mm_load_ps float array Load array s first 4 floats J RRR KR RK KK KK RK I OR kkk k kk k Add all elements of the array 4 elements at a time J BRR RRR KKK KK OK I OK kk k vecO array 1 Add elements 5 8 vecO array 2 Add elements 9 12 vecO array 3 Add elements 13 16 vecO array 4 Add elements 17 20 J RRR KKK KK KK KK IK I I I KK KK KK There are now 4 partial sums Add the 2 lowers to the 2 raises then add those 2 results together J RR RK RR KK KK KR I kkk kkk k k kk k vecl SHUFFLE vec1 vec0O 0x40 vec0 vecl vecl SHUFFLE vec1 vec0 0x30 vec0 vecl vec0 SHUFFLE vec0 vec0 2 _mm_store_ss result vec0 Store the final sum void main int argc char argv int i Initialize the array for i 0 i lt SIZE i array i
361. onformance dialect strict_ansi OpenMP report openmp_report0 openmp_report1 openmp_report2 Auto parallelizer report par_report0 par_report1 par_report2 par_report3 Vectorizer report vec_report0 vec_report1 vec_report2 vec_report3 vec_report4 vec_report5 6l Intel C Compiler for Linux Systems User s Guide Option Use Option Category Name Data Disable argument aliasing alias_args Assume no aliasing in program fno alias Allow gpre1 addressing of common data variables fno common Allocate as many bytes as needed for enumerated fshort enums types Change default bitfield type to unsigned funsigned bitfields Change default char type to unsigned funsigned char Store string literals in a writable section fwritable strings Disable placement of zero initialized variables in nobss_init bss use data Default symbol visibility fvisibility extern fvisibility default fvisibility protected fvisibility hidden fvisibility internal Structure member alignment Zpl Zp2 Zp4 Zp8 Zp16 default Floating Point Improve floating point consistency mp Round floating point results fp_port Limit Complex range complex_limited_range Check floating point stack fpstkchk Output Files Generate assembler source file S Code Generate position independent code fpic Generation Use Intel processor extensions axK axN axB axP Require Intel processor extensions xK
362. ong double x long double y int isunorderedf float x float y NEXTAFTER Description The next after function returns the next representable value in the specified format after x in the direction of y errno ERANGE for values too large Calling interface double nextafter double x double y long double nextafterl long double x long double y float nextafterf float x float y 235 Intel C Compiler for Linux Systems User s Guide NEXTTOWARD Description The next toward function returns the next representable value in the specified format after x in the direction of y If x equals y then the function returns y converted to the type of the function errno ERANGE for values too large Calling interface double nexttoward double x double y long double nexttowardl long double x long double y float nexttowardf float x float y SIGNBIT Description The signbit function returns a non zero value if and only if the sign of x is negative Calling interface int signbit double x int signbitl long double x int signbitf float x SIGNIFICAND Description The significand function returns the significand of x in the interval 1 2 For x equal to zero NaN or infinity the original x is returned Calling interface double significand double x long double significandl long double x float significandf float x Complex Functions The Intel Math library supports th
363. onlinedocs gec 3 4 0 gcc Variable Length html Variable 20Length Macros with a Variable Number Yes http gcc gnu org onlinedocs gec 3 4 0 gec of Arguments Variadic Macros html Variadic 20Macros Slightly Looser Rules for Escaped No http gcc gnu org onlinedocs gec 3 4 0 gcec Newlines Escaped Newlines html Escaped 20Newlines String Literals with Embedded Yes http gcc gnu org onlinedocs gec 3 3 gec Newlines Multi line Strings html Multi line 20Strings Non Lvalue Arrays May Have Yes http gcc gnu org onlinedocs gec 3 4 0 gec Subscripts Subscripting html Subscripting Arithmetic on void Pointers Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Pointer Arith html Pointer 20Arith Arithmetic on Function Pointers Yes http gcc gnu org onlinedocs gec 3 4 0 gcec Pointer Arith html Pointer 20Arith Non Constant Initializers Yes http gcc gnu org onlinedocs gcec 3 4 0 gcec Initializers html Initializers Compound Literals Yes http gcc gnu org onlinedocs gec 3 4 0 gcc Compound Literals html Compound 20Literals Designated Initializers Yes http gcc gnu org onlinedocs gec 3 4 0 gec Designated Inits html Designated 20Inits Cast to a Union Type Yes http gcc gnu org onlinedocs gcec 3 4 0 gcec Cast to Union html Cast 20to 20Union Case Ranges Yes http gcc gnu org onlinedocs gec 3 4 0 gcec Case Ranges html Case 20Ranges Mixed Declarations and Code Yes http gcc gnu org onlinedocs gec 3 4 0 gcec Mixed
364. ons as a NOP for source compatibility only _mm_cmpeq_pi8 _mm_cmpeq_pil _mm_cmpegq_pi32 _mm_cmpgt_pi8 _mm_cmpgt_pil6 _mm_cmpgt_pi32 mm setzero si64 N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A gt gt gt i gt i ey ey eye ey ey ee ee ALO rele er oy oye ey ey ey ee ee Reference 369 Intel C Compiler for Linux Systems User s Guide Streaming SIMD Extensions Intrinsics Implementation Regular Streaming SIMD Extensions intrinsics work on 4 32 bit single precision values On Itanium based systems basic operations like add or compare will require two SIMD instructions Both can be executed in the same cycle so the throughput is one basic Streaming SIMD Extensions operation per cycle or 4 32 bit single precision operations per cycle Key to the table entries e A Expected to give significant performance gain over non intrinsic based code equivalent e B Non intrinsic based source code would be better the intrinsic s implementation may map directly to native instructions but they offer no significant performance gain e C Requires contorted implementation for particular microarchitecture Will result in very poor performance if used Intrinsic Name mm_add_ss mm_add_ps mm_sub_ss mm_sub_ps mm_mul_ss _mm_mul_ps mm_div_ss mm_div_ps mm_sqrt_ss mm_sqrt_ps mm_rcp_ss _mm_rcp_ps mm_rsqrt_ss mm_rsqrt_ps mm_min_ss mm_min_ps mm_ma
365. op level for a group of selected modules e Individual module source view Top Level Coverage The top level coverage reports the overall code coverage of the modules that were selected The following options are provided You can select the modules of interest For the selected modules the tool generates a list with their coverage information The information includes the total number of functions and blocks in a module and the portions that were covered e By clicking on the title of columns in the reported tables the lists may be sorted in ascending or descending order based on e basic block coverage e function coverage e function name The example that follows shows a top level coverage summary for a project By clicking on a module name for example SAMP LE C the browser will display the coverage source view of that particular module Be Ect wem Fakes Jos Hep FJ Intel Compilers code coverage information for Sample Project Microsoft Internet Explorer a z E i Hiie gt OD A Aseh irs Bede GO SG S RIBAS TEARS Ajres 0 Coverageliaz2 corrpiler sanple sarrgleS CODE_COWERAGE HTML p Pa Inte serereted by Intaia compeers code corerage toot Covered Files in Sample_Project Functions total ee ee 14 29 4 00 34 a 67 65 inte geversted dr Intel Compiers Web Pepe Owner Intal cove corerage ton r metre Sy Intaite Wab Paga Ovner Inte cotle corerage to 139
366. op transformation technique for enabling SIMD encodings of loops as well as providing a means of improving memory performance By fragmenting a large loop into smaller segments or strips this technique transforms the loop structure in two ways e It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm e It reduces the number of iterations of the loop by a factor of the length of each vector or number of operations being performed per SIMD operation In the case of Streaming SIMD Extensions this vector or strip length is reduced by 4 times four floating point data items per single Streaming SIMD Extensions single precision floating point SIMD operation are processed First introduced for vectorizers this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine The compiler automatically strip mines your loop and generates a cleanup loop Before Vectorization i 0 while i lt n Original loop code afi b i t c il i 161 Intel C Compiler for Linux Systems User s Guide After Vectorization The vectorizer generates the following two loops i 0 while ix n n 4 Vector strip mined loop Subscript i i 3 denotes SIMD execution af i it3 b i 1 3 c i it 3 i i 4 while i lt n Scalar clean u
367. ormance count should be a constant MMX TM Technology Logical Intrinsics The prototypes for MMX TM technology intrinsics are in the mmint rin h header file Intrinsic Alternate Operation Corresponding Name Name Instruction m_pand _mm_and_si64 Bitwise AND _m_pandn _mm_andnot_si64 Logical NOT _m_por _mm_or_si64 Bitwise OR _m_pxor _mm_xor_si64 Bitwise Exclusive OR PXOR m64 _m_pand __m64 ml __m64 m2 Perform a bitwise AND of the 64 bit value in m1 with the 64 bit value in m2 __m64 _m_pandn __m64 ml __m64 m2 Perform a logical NOT on the 64 bit value in m1 and use the result in a bitwise AND with the 64 bit value in m2 m64 _m_por __m64 ml __m64 m2 Perform a bitwise OR of the 64 bit value in m1 with the 64 bit value in m2 m64 _m_pxor __m64 ml __m64 m2 Perform a bitwise XOR of the 64 bit value in m1 with the 64 bit value in m2 MMX TM Technology Compare Intrinsics The prototypes for MMX TM technology intrinsics are in the mmint rin h header file Intrinsic Alternate Comparison Number Element Corresponding Name Name of Bit Size Instruction Elements _m_pcmpegb _mm_cmpeq_pi8 Equal 8 8 PCMPEQB _m_pcmpeqw _mm_cmpeq_pil 6 Equal 4 16 PCMPEQW _m_pcmpegd _mm_cmpeq_pi32 Equal 2 32 PCMPEOD _m_pcmpgtb _mm_cmpgt_pi8 Greater Than 8 8 PCMPGTB 263 Intel C Compiler for Linux Systems User s Guide Intrinsic Name _m_pcmpgtw _mm_cmpgt_pil6 Greater
368. ors A F32vec4 F64vec2 F32vec1 efap Ea Es NA This table lists standard arithmetic operator syntax and intrinsics TE nan TE ji eo Ss Standard Arithmetic Operations for Fvec Classes c Operation Returns Example Syntax Usage Intrinsic Addition 4 floats F32vec4 R F32vec4 A F32vec4 _mm_add_ps B F32vec4 R F32vec4 A m 2 F64vec2 R F64vec2 A F32vec2 _mm_add_pd doubles B F64vec2 R F64vec2 A l 1 float F32vec1 R F32vec1 A F32vec1 _mm_add_ss B F32vec1 R F32vec1 A m Subtraction 4 floats F32vec4 R F32vec4 A F32vec4 _mm_sub_ps B F32vec4 R F32vec4 A m 2 F64vec2 R F64vec2 A F32vec2 _mm_sub_pd doubles By F6 4vec2 R F64vec2 A lt 1 float F32vec1 R F32vec1 A F32vec1 _mm_sub_ss B F32vecl R F32vecl A a Multiplication 4 floats F32vec4 R F32vec4 A F32vec4 _mm_mul_ps B F32vec4 R F32vec4 A m 2 F64vec2 R F64vec2 A F364vec2 _mm_mul_pd doubles B F64vec2 R F64vec2 A Ee 1 float F32vecl R F32vecl A F32vecl _mm_mul_ss B F32vecl R F32vecl A m Division 4 floats F32vec4 R F32vec4 A F32vec4 _mm_div_ps B F32vec4 R F32vec4 A L ul 2 F64vec2 R F 6 4vec2 A F64vec2 _mm_div_pd doubles Bs F6 4vec2 R F64vec2 A 419 Intel C Compiler for Linux Systems User s Guide Operation Returns Example Synt
369. ototypes for SSE2 intrinsics are in the emmint rin h header file Intrinsic Shift Shift Corresponding Direction Type Instruction _mm_slli_si128 Left Logical _mm_slli_epil Left Logical _mm_sll_epil Left Logical _mm_slli_epi32 Left Logical 319 Intel C Compiler for Linux Systems User s Guide Intrinsic Shift Shift Corresponding Direction Type Instruction _mm_sll_epi32 Left Logical PSLLD _mm_slli_epi6d4 Left Logical PSLLQ _mm_sll_epi64 Left Logical PSLLQ _mm_srai_epil Right Arithmetic PSRAW _mm_sra_epil6 Right Arithmetic PSRAW _mm_srai_epi32 Right Arithmetic PSRAD _mm_sra_epi32 Right Arithmetic PSRAD _mm_srli_sil28 Right Logical PSRLDQ _mm_srli_epil Right Logical PSRLW _mm_srl_epil Right Logical PSRLW _mm_srli_epi32 Right Logical PSRLD _mm_srl_epi32 Right Logical PSRLD _mm_srli_epi 4 Right Logical PSRLQ _mm_srl_epi6 4 Right Logical PSRLQ m128i _mm_slli_sil28 __m128i a int imm Shifts the 128 bit value in a left by imm bytes while shifting in zeros imm must be an immediate r a lt lt imm 8 __m128i _mm_slli_epil6 __m128i a int count Shifts the 8 signed or unsigned 16 bit integers in a left by count bits while shifting in zeros rO a0 lt lt count rl al lt lt count r7 a7 lt lt count __m128i _mm_sll_epil6 __m128i a __m1281i count Shifts the 8 sign
370. ould be substituting for the scalar form whenever possible The address of a__m128 object may be taken For more information see Intel Architecture Software Developer s Manual Volume 2 Instruction Set Reference Manual Intel Corporation doc number 243191 Implementation on Itanium based systems SSE intrinsics are defined for the __m128 data type a 128 bit quantity consisting of four single precision FP values SIMD instructions for Itantum based systems operate on 64 bit FP register quantities containing two single precision floating point values Thus each __m128 operand is actually a pair of FP registers and therefore each intrinsic corresponds to at least one pair of Itanium instructions operating on the pair of FP register operands 294 Reference Compatibility versus Performance Many of the SSE intrinsics for Itanium based systems were created for compatibility with existing A 32 intrinsics and not for performance In some situations intrinsic usage that improved performance on IA 32 will not do so on Itanium based systems One reason for this is that some intrinsics map nicely into the IA 32 instruction set but not into the Itanium instruction set Thus it is important to differentiate between intrinsics which were implemented for a performance advantage on Itanium based systems and those implemented simply to provide compatibility with existing A 32 code The following intrinsics are likely to reduce performance and sho
371. ounter Returns the current value of the 40 bit performance monitoring counter specified by p A fast version of set jmp which bypasses the termination handling Saves the callee save registers stack pointer and return address MMX TM technology is an extension to the Intel architecture IA instruction set The MMX instruction set adds 57 opcodes and a 64 bit quadword data type and eight 64 bit registers Each of the eight registers can be directly addressed using the register names mmO to mm7 The prototypes for MMX technology intrinsics are in the mmint rin h header file The EMMS Instruction Why You Need It Using EMMS is like emptying a container to accommodate new content For instance MMX TM instructions automatically enable an FP tag word in the register to enable use of the __m6 4 data type This resets the FP register set to alias it as the MMX register set To enable the FP register set again reset the register state with the EMMS instruction or via the _mm_empty intrinsic 254 Reference Why You Need EMMS to Reset After an MMX TM Instruction MMX Instuction Regsiess Nesd m4 Data ypes jo Fay 63 SERR in MMO MM7 FP Tag Wired Akatee FP Regitte to Act Like nix Regeeters in Arren m4 Data Types Cae T agva od mith EMA pty i FP Instruction Registes Need ib be A s t to Accept FP Datla ypes of 32 64 and 80 bits FPTA g 70 EF Fagers amp FPO g nas FP7 mm Amoty i Clews fie FP Ta
372. p LOG Description The log function returns the natural log of x 1n x This function may be inlined by the Itantum compiler errno EDOM for x lt 0 errno ERANGE for x 0 Calling interface double log double x long double logl long double x float logf float x LOG10 Description The 1og10 function returns the base 10 log of x log 9 x This function may be inlined by the Itanium compiler errno EDOM for x lt 0 errno ERANGE for x 0 Calling interface double logl0 double x long double logl0l long double x float logl0f float x LOG1P Description The 1og1p function returns the natural log of x 1 1n x 1 errno EDOM for x lt 1 errno ERANGE for x 1 Calling interface double loglp double x long double loglpl long double x float loglpf float x 223 Intel C Compiler for Linux Systems User s Guide LOG2 Description The 1og2 function returns the base 2 log of x logs x errno EDOM for x lt 0 errno ERANGE for x 0 Calling interface double log2 double x long double log21 long double x float log2f float x LOGB Description The logb function returns the signed exponent of x errno EDOM for x 0 Calling interface double logb double x long double logbl long double x float logbf float x POW Description The pow function returns x raised to the power of y x Calling interface errno EDOM for x
373. p loop a i sb i c i i Statements in the Loop Body The vectorizable operations are different for floating point and integer data Floating point Array Operations The statements within the loop body may contain float operations typically on arrays Supported arithmetic operations include addition subtraction multiplication division negation square root max and min Operation on double precision types is not permitted unless optimizing for a Pentium 4 processor system Integer Array Operations The statements within the loop body may contain char unsigned char short unsigned short int and unsigned int Calls to functions such as sqrt and fabs are also supported Arithmetic operations are limited to addition subtraction bitwise AND OR and XOR operators division 16 bit only multiplication 16 bit only min and max You can mix data types only if the conversion can be done without a loss of precision Some example operators where you can mix data types are multiplication shift or unary operators Other Operations No statements other than the preceding floating point and integer operations are allowed In particular note that the special ___m64 and ___m128 datatypes are not vectorizable The loop body cannot contain any function calls Use of the Streaming SIMD Extensions intrinsics _mm_add_ps are not allowed Language Support and Directives This topic addresses language features that better help
374. pare for Greater Than F32vecl R select_lt F32vecl A Compare for Less Than or Equal F32vec4 R s F32vec4 F64vec2 R s Q ca F64vec2 Q ct F32vecl R s F32vec4 R sel F64vec2 R sel lect_gt F32vec4 lect_gt F6 4vec2 F32vecl R sel lect_gt F32vecl Compare for Greater Than or Equal To 2 doubles F32vecl R sel ct_ge F32vec4 A F64vec2 R sel ct_ge F64vec2 A F32vecl R sel Compare for Not Less Than 4 floats F32vecl R sel 2 doubles ct_ge F32vecl A F64vec2 R sel ect_nlt F32vec4 A ect_nlt F64vec2 A 1 float F32vecl R sel ect_nlt F32vecl A Compare for Not Less Than or Equal 4 floats 2 doubles F64vec2 R sel 1 float F32vecl R sel F32vecl R select_nle F32vec4 A ct_nle F64vec2 A ct_nle F32vecl A Compare for Not Greater Than 4 floats 2 doubles F32vecl R se F64vec2 sel lect_ngt F32vec4 A lect_ngt F6 4vec2 A F32vecl R se lect_ngt F32vecl A Compare for Not Greater Than or Equal _mm_cmplt_ss _mm_cmpgt_ps _mm_cmpgt_pd _mm_cmpgt_ss _mm_cmpge_ps _mm_cmpge_pd _mm_cmpge_ss _mm_cmpn _mm_cmpn _mm_cmpn _mm_cmpnl _mm_cmpn _mm_cmpn lt_ps lt_pd Lt ss e_ps le_pd Le_SS _mm_cmpngt_ps _m
375. pe __m1281i f Note The Pentium 4 processor SSE2 intrinsics are defined only for IA 32 platforms not Itantum based platforms Pentium 4 processor SSE2 operate on 128 bit quantities 2 64 bit double precision floating point values The Itanium processor does not support parallel double precision computation so Pentium 4 processor SSE2 are not implemented on Itanium based systems For more details refer to the Pentium 4 processor Streaming SIMD Extensions 2 External Architecture Specification EAS and other Pentium 4 processor manuals available for download from the developer intel com web site You should be familiar with the hardware features provided by the StSE2 when writing programs with the intrinsics The following are three important issues to keep in mind e Certain intrinsics such as_mm_loadr_pd and __mm_cmpgt_sd are not directly supported by the instruction set While these intrinsics are convenient programming aids be mindful of their implementation cost Data loaded or stored as____m128d objects must be generally 16 byte aligned Some intrinsics require that their argument be immediates that is constant integers literals due to the nature of the instruction The prototypes for SSE2 intrinsics are in the emmint rin h header file 298 FP Note You can also use the single ia32intrin h header file for any IA 32 intrinsics Floating point Arithmetic Operations for Streaming SIMD Extensions 2 Reference
376. pecific specialized code vec_reportn Controls the vectorizer s level of diagnostic messages n 0 no diagnostic information is displayed n display diagnostics indicating loops successfully vectorized default e n 2sameasn 1 plus diagnostics indicating loops not successfully vectorized e n 3sameasn 2 plus additional information about any proven or assumed dependences Usage If you use c ipo with vec_report n option or c x K W N B P or ax K W N B P with vec_report n the compiler issues a warning and no report is generated To produce a report when using the aforementioned options you need to add the ipo_obj option The combination of c and ipo_obj produces a single file compilation and hence does generate object code and eventually a report is generated The following commands generate a vectorization report e prompt gt icpe x K W N B P vec_report3 file cpp e prompt gt icpe x K W N B P ipo ipo_obj vec_report3 file cpp e prompt gt icpe c x K W N B P ipo ipo_obj vec_report3 file cpp The following commands do not generate a vectorization report e prompt gt icpe c x K W M B P vec_report3 file cpp e prompt gt icpe x K W N B P ipo vec_report3 file cpp e prompt gt icpe c x K W N B P ipo vec_report3 file cpp Loop Parallelization and Vectorization Combining the parallel and x K W N B P options instructs the compiler to attempt both automatic loop parallelizati
377. perands are of the same size but different signs Il6vec4 R select_negq Isl6 vec4 Isl vec4 Isl6vec4 Iul6vec4 Conditional Select for Equality RO AO BO CO DO Rl Al Bl Cl Di R2 A2 B2 C2 D2 R3 A3 B3 C3 D3 Conditional Select for Inequality RO AO BO CO DO Rl Al Bl Cl Dil R2 A2 B2 C2 D2 R3 A3 B3 C3 D3 Conditional Select Symbols and Corresponding Intrinsics Conditional Select For Operators Corresponding Additional Intrinsic Intrinsic Applies to All Equality R mm_cmpegq_pi32 _mm_and_si64 select_eq A mm_cmpeq_pil6 mm_or_si64 B C D mm_cmpeq_pi8 mm_andnot_si64 R select_ned A B C D _mm_cmpeq_pi32 mm_cmpeq_pil6 _mm_cmpeq_pi8 Inequality Greater Than select_gt R _mm_cmpgt_pi32 select_gt A _mm_cmpgt_pil6 B C D _mm_cmpgt_pi8 402 Reference Conditional Select For Operators Corresponding Additional Intrinsic Intrinsic Applies to All R select_gt A B C D _mm_cmpge_pi32 _mm_cmpge_pil6 _mm_cmpge_pi8 Greater Than or Equal To select_lt R select_1t A B C D Less Than Less Than R or Equal To select_le A B C D All conditional select operands must be of the same size The ret
378. perations rcp _nrand rsqrt_nr use software refining techniques to enhance the accuracy of the approximations with a minimal impact on performance The nr stands for Newton Raphson a mathematical technique for improving performance using an approximate result Horizontal Data Support The C SIMD classes provide horizontal support for some arithmetic operations The term horizontal indicates computation across the elements of one vector as opposed to the vertical element by element operations on two different vectors The add_horizontal unpack_low and pack_sat functions are examples of horizontal data support This support enables certain algorithms that cannot exploit the full potential of SIMD instructions Shuffle intrinsics are another example of horizontal data flow Shuffle intrinsics are not expressed in the C classes due to their immediate arguments However the C class implementation enables you to mix shuffle intrinsics with the other C functions For example F32vec4 fveca fvechb fvecd fveca fvech fvecd _mm_shuffle_ps fveca fvecb 0 Typically every instruction with horizontal data flow contains some inefficiency in the implementation If possible implement your algorithms without using the horizontal capabilities Branch Compression Elimination Branching in SIMD architectures can be complicated and expensive possibly resulting in poor predictability and code expansion The SIMD C classes provide f
379. plex cexp21 long double _Complex z float _Complex cexp2f float _Complex z CEXP10 Description The cexp10 function computes 10 Calling interface double _Complex cexp10 double _Complex z long double _Complex cexpl01l long double _Complex z float _Complex cexpl0f float _Complex z CIMAG Description The cimag function returns the imaginary part value of z Calling interface double cimag double _Complex z long double cimagl long double _Complex z float cimagf float _Complex z 238 Reference CIS Description The cis function returns the cosine and sine as a complex value of z measured in radians Calling interface double _Complex cis double z long double _Complex cisl long double z float _Complex cisf float z CISD Description The cis function returns the cosine and sine as a complex value of z measured in degrees Calling interface double _Complex cis double z long double _Complex cisl long double z float _Complex cisf float z CLOG Description The clog function returns the complex natural logarithm of z Calling interface double _Complex clog double _Complex z long double _Complex clogl long double _Complex z float _Complex clogf float _Complex z CLOG2 Description The clog2 function returns the complex logarithm base 2 of z Calling interface double _Complex clog2 double _Complex z long double _Complex clog21 long double _Co
380. pnile F32vecl A Compare for Not Greater Than 4 floats F32vec4 R cmpngt F32vec4 A _mm_cmpngt_ps 2 doubles F64vec2 R cmpngt F6 4vec2 A _mm_cmpngt_pd 1 float F32vecl R cmpngt F32vecl A _mm_cmpngt_ss Compare for Not Greater Than or Equal 4 floats F32vec4 R cmpnge F32vec4 A _mm_cmpnge_ps 2 doubles F64vec2 R cmpnge F64vec2 A _mm_cmpnge_pd 1 float F32vecl R cmpnge F32vecl A _mm_cmpnge_ss Conditional Select Operators for Fvec Classes Reference Each conditional function compares single precision floating point values of A and B The C and D parameters are used for return value Comparison between objects of any Fvec class returns the same class Conditional Select Operators for Fvec Classes Conditional Select for Operators Syntax Equality select_eq R select_eq A B Inequality select_neq R select_neq A B Greater Than select_gt R select_gt A B Greater Than or Equal To select_g R select_ge A B Not Greater Than select_gt R select_gt A B Not Greater Than or Equal To select_g R select_ge A B Less Than select_lt R select_lt A B Less Than or Equal To select_ l R select_le A B Not Less Than select_nlt R select_nlt A B Not Less Than or Equal To select_nle R select_nle A B 427 Intel C Compiler for Linux Systems User s Guide Conditional Select Operator Usage For condit
381. po n Compiler Options Quick Reference Option Description Default Enables interprocedural OFF optimizations for single file compilation Enable disable the combining OFF of floating point multiplies and add subtract operations Enable disable optimizations OFF that affect floating point accuracy Floating point operands OFF evaluated to the precision indicated by the program Enable disable use of faster but OFF slightly less accurate code sequences for math functions such as divide and square root Enable floating point OFF speculations with the following mode conditions e fast speculate floating point operations e safe speculate only when safe e strict same as off e off disables speculation of floating point operations Disables inlining that would OFF result from the ip interprocedural optimization but has no effect on other interprocedural optimizations Disable partial inlining OFF Requires ip or ipo value Enables interprocedural OFF optimizations across files The optional n argument controls the maximum number of link time compilations or number of object files that are spawned The default for value is 1 when value is not specified Intel C Compiler for Linux Systems User s Guide Option Description Default Generates a multifile object file OFF ipo_out o that can be used in further link steps Forces the compiler to cr
382. proto map proto cpp 78 Volume I Building Applications The Qoption link option in the preceding example is passing the map option to the linker This is an explicit way to pass arguments to other tools in the compilation process Also you can use the Xlinker val to pass values va 1 to the linker Monitoring Data Settings The options described here provide monitoring of Intel compiler generated code Specifying Structure Tag Alignments You can specify an alignment constraint for structures and unions in two ways e Place a pack pragma in your source file or e Enter the alignment option on the command line Both specifications change structure tag alignment constraints Flushing Denormal Values to Zero for Itanium based Systems Only Option ft z flushes denormal results to zero when the application is in the gradual underflow mode Use this option if the denormal values are not critical to application behavior Flushing the denormal values to zero with ft z may improve performance of your application The default status of ft z is OFF By default the compiler lets results gradually underflow The ft z switch only needs to be used on the source containing function main The effect of the ft z switch is to turn on FTZ mode for the process started by main The initial thread and any threads subsequently created by that process will operate in FTZ mode d Note The 03 option turns ftz ON Use ft z to disable fl
383. ps starting from the right as shown in Figure 2 and return the result ENEN EED ne a en s J WaT Foz m64 _m64_mix21 __m64 a _ m64 b Interleave 64 bit quantities a and b in 2 byte groups starting from the left as shown in Figure 3 and return the result m m eae Fios m64 _m64_mix2r __m64 a _ m64 b Interleave 64 bit quantities a and b in 2 byte groups starting from the right as shown in Figure 4 and return the result m64 _m64_mix41 __m64 a _ m64 b Interleave 64 bit quantities a and b in 4 byte groups starting from the left as shown in Figure 5 and return the result ay l a A Fig 5 354 Reference m64 _m64_mix4r __m64 a _ m64 b Interleave 64 bit quantities a and b in 4 byte groups starting from the right as shown in Figure 6 and return the result m64 _m64_mux1 _ m64 a const int n Based on the value of n a permutation is performed on a as shown in Figure 7 and the result is returned Table 1 shows the possible values of n rev amix brest 355 Intel C Compiler for Linux Systems User s Guide Values of n for m64_mux1 Operation brest m64 _m64_ mux2 __m64 a const int n Based on the value of n a permutation is performed on a as shown in Figure 8 and the result is returned mux2 ri r2 Oxe4 alternate 11 01 10 00 mux2 ri r2 Oxaa broadcast 10 10 10 10 Fig 8 __m64 _m64_pavgsubl __m64 a __m64 b The unsigned data elements
384. ption The tanh function returns the hyperbolic tangent of x e e e e Calling interface double tanh double x long double tanhl long double x float tanhf float x 220 Reference Exponential Functions CBRT EXP The Intel Math library supports the following exponential functions Description The cbrt function returns the cube root of x Calling interface double cbhrt double x long double cbrtl long double x float cbhrtf float x Description The exp function returns e raised to the x power e This function may be inlined by the Itanium compiler errno ERANGE for underflow and overflow conditions Calling interface double exp double x long double expl long double x float expf float x EXP10 EXP2 Description The exp10 function returns 10 raised to the x power 10 errno ERANGE for underflow and overflow conditions Calling interface double exp10 double x long double exp101 long double x float exp10f float x Description The exp2 function returns 2 raised to the x power 2 errno ERANGE for underflow and overflow conditions Calling interface double exp2 double x long double exp21 long double x float exp2f float x 221 Intel C Compiler for Linux Systems User s Guide EXPM1 Description The expm1 function returns e raised to the x power minus 1 e 1 errno ERANGE for overflow conditions Calling in
385. ptions New predefined macros Support for exported templates Support for template Instantiation Invoking the compiler with icc and icpc New defaults for gcc interoperability options Support for thread local storage Support for high level optimization for C on A 32 Support for additional debug information Deprecated compiler options For further information on New Features see the Release Notes Features and Benefits The Intel C Compiler allows your software to perform best on computers based on the Intel architecture Using new compiler optimizations such as profile guided optimization prefetch instruction and support for Streaming SIMD Extensions SSE and Streaming SIMD Extensions 2 SSE2 the Intel C Compiler provides high performance Feature Benefit High Performance Achieve a significant performance gain by using optimizations Support for Streaming Advantage of Intel microarchitecture SIMD Extensions Automatic vectorizer Advantage of SIMD parallelism in your code achieved automatically OpenMP Support Shared memory parallel programming Floating point Improved floating point performance optimizations Data prefetching Improved performance due to the accelerated data delivery Interprocedural Larger application modules perform better optimizations Profile guided Improved performance based on profiling frequently used functions optimization Processor dispatch Taking advantage of the latest Intel architectu
386. ptions opts to OFF tool specified by str qp pP NA Compile and link for OFF function profiling with UNIX gprof tool rcd Qrcd Enable fast floating OFF point to integer conversions restrict Qrestrict Enable the restrict OFF keyword for disambiguating pointers S S Generates OFF assemblable files with s suffix then stops the compilation sox Qsox Enable disable s0x saving of compiler options and version in the executable syntax Zs Perform syntax check OFF only tpps Optimize for Pentium OFF processor tpp6 Optimize for Pentium OFF Pro Pentium II and Pentium III processors tpp7 G7 Optimize for Pentium OFF 4 processor 34 Compiler Options Quick Reference Linux Windows Description Linux Default no traceback no traceback Generate do not OFF generate extra information in the object file that allows the display of source file traceback information at run time when a severe error occurs Uname Uname Remove predefined OFF macro unroll0 Qunroll0 Disable loop OFF unrolling Display compiler version n pw fe isptay errors OE errors Enable remarks warnings and errors Wbrief Produces less verbose OFF diagnostics Control diagnostics OFF Display errors n 0 Display warnings and errors n 1 Display remarks warnings and errors n 2 wdL1 L2 Qwd tag Disable diagnostics OFF L1 through LN weLl L2
387. r 32 bit values found in A and B into eight 16 bit values with signed saturation Isl6ovec4 pack_sat Is32vec2 A Is32vec2 B Corresponding intrinsic _mm_packs_pi32 Pack the sixteen 16 bit values found in A and B into sixteen 8 bit values with signed saturation Is8vecl6 pack_sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic __mm_packs_epil6 411 Intel C Compiler for Linux Systems User s Guide Pack the eight 16 bit values found in A and B into eight 8 bit values with signed saturation Is8vec8 pack_sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic _mm_packs_pil6 Pack the sixteen 16 bit values found in A and B into sixteen 8 bit values with unsigned saturation Tu8vecl6 packu_sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic __mm_packus_epil6 Pack the eight 16 bit values found in A and B into eight 8 bit values with unsigned saturation Tu8vec8 packu_sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic _mm_packs_pul6 Clear MMX TM Instructions State Operator Empty the MMX TM registers and clear the MMX state Read the guidelines for using the EMMS instruction intrinsic void empty void Corresponding intrinsic _mm_empty Integer Intrinsics for Streaming SIMD Extensions F Note You must include fvec h header file for the following functionality Compute the element wise maximum of the respective signed integer words in A and B Isl6vec4 simd_max Isl6vec4 A Isl6vec4 B Corresponding intrinsi
388. r a specific processor use one of the x options For example prompt gt icpe fast xW source_file cpp The options set by fast may change from release to release Restricting Optimizations The following options restrict or preclude the compiler s ability to optimize your program Option Description 00 Disables optimizations Enables the fp option mp Restricts optimizations that cause some minor loss or gain of precision in floating point arithmetic to maintain a declared level of precision and to ensure that floating point arithmetic more nearly conforms to the ANSI and IEEE standards ze Specifying the g option turns off the default 02 option and makes 00 the default unless O1 02 or O3 is explicitly specified in the command line together with g nolib_inline Disables inline expansion of intrinsic functions 109 Intel C Compiler for Linux Systems User s Guide F Note You can turn off all optimizations for specific functions by using pragma optimize In the following example all optimization is turned off for function foo pragma optimize off foo Valid second arguments for pragma optimize are on or off With the on argument foo is compiled with the same optimization as the rest of the program The compiler ignores first argument values Floating point Optimizations Floating point Arithmetic Precision There are several compiler options that affect
389. r code vectorizable you will often need to make some changes to your loops However you should make only the changes needed to enable vectorization and no others In particular you should avoid these common changes e do not unroll your loops the compiler does this automatically e do not decompose one loop with several statements in the body into several single statement loops Restrictions Hardware The compiler is limited by restrictions imposed by the underlying hardware In the case of Streaming SIMD Extensions the vector memory operations are limited to st ride 1 accesses with a preference to 16 byte aligned memory references This means that if the compiler abstractly recognizes a loop as vectorizable it still might not vectorize it for a distinct target architecture Style The style in which you write source code can inhibit optimization For example a common problem with global pointers is that they often prevent the compiler from being able to prove two memory references at distinct locations Consequently this prevents certain reordering transformations Many stylistic issues that prevent automatic vectorization by compilers are found in loop structures The ambiguity arises from the complexity of the keywords operators data references and memory operations within the loop bodies 157 Intel C Compiler for Linux Systems User s Guide However by understanding these limitations and by knowing how to interpret diagnost
390. r_ps mm_Move_ss mm_getcsr mm_setcsr 288 all four words The address must be 16 byte aligned Store four values address aligned Store four values address unaligned Store four values in reverse order Set the low word and pass in three high values Return register contents Control Register Corresponding Instruction OVSS MOVSS Shuffling MOVAPS OVUPS MOVAPS Shuffling Composite Composite Composite Composite Composite MOVSS Shuffling OVSS OVAPS MOVUPS OVAPS Shuffling OVSS STMXCSR LDMXCSR Reference Intrinsic Alternate Operation Corresponding Name Name Instruction mm_prefetch mm_stream_pi mm_stream_ps mm_sfence mm_cvtss_f32 m128 _mm_load_ss float const a Loads an SP FP value into the low word and clears the upper three words r0 a rl 0 0 r2 0 0 r3 0 0 m128 _mm_load_psl float const a Boas a single SP FP value copying it into all four words r0 a rl a r2 a r3 a m128 _mm_load_ps float const a fond four SP FP values The address must be 16 byte aligned ro a 0 ri 1 r2 2 r3 3 m128 _mm_loadu_ps float const a Poa four SP FP values The address need not be 16 byte aligned ro ri Y2 r3 a 0 1 2 3 ot Al oo m128 _mm_loadr_ps float const a load four SP FP values in reverse order The address must be 16 byte aligned ro
391. rates an object file for each C or C source file or preprocessed source file Also takes an assembler file and invokes the assembler to generate an object file c99 Enables disables C99 support for C programs complex_limited_range Enables the use of delete basic algebraic expansions of some arithmetic operations involving data of type _Complex This can cause some performance improvements in programs that use _Complex arithmetic but values at the extremes of the exponent range may not compute correctly Default is complex_limited_range create_pch filename Manual creation of precompiled OFF header filename pchi cxxlib gcec GCC root dir Link using C run time OFF libraries provided with gcc This option is ON by default if your gcc version is 3 2 3 3 or 3 4 Use the optional argument GCC root dir to specify the top level location of the gcc binaries and libraries cxxlib icc Link using C run time libraries provided by Intel This option is ON by default if your gcc version is less than 3 2 debug no inline_debug_info Produces enhanced source position information for inlined code debug no variable_locations Produces additional debug information for scalar local variables using a feature of the DWARF object module format known as location lists Intel C Compiler for Linux Systems User s Guide Option Description
392. rating arithmetic r0 SignedSaturate a0 b0 rl SignedSaturate al bl r7 SignedSaturate a7 b7 __m128i _mm_adds_epu8 __m128i a __m1281i b Adds the 16 unsigned 8 bit integers in a to the 16 unsigned 8 bit integers in b using saturating arithmetic rO UnsignedSaturate a0 bO rl UnsignedSaturate al bl r15 UnsignedSaturate al5 b15 __m128i _mm_adds_epul6 __m128i a __m1281i b Adds the 8 unsigned 16 bit integers in a to the 8 unsigned 16 bit integers in b using saturating arithmetic rO UnsignedSaturate a0 bO rl UnsignedSaturate al bl r15 UnsignedSaturate a7 b7 __m128i _mm_avg_epu8 __m128i a __m128i b Computes the average of the 16 unsigned 8 bit integers in a and the 16 unsigned 8 bit integers in b and rounds ro a0 b0 2 rl al bl 2 r15 al5 b15 2 315 Intel C Compiler for Linux Systems User s Guide __m128i _mm_avg_epul6 __m128i a __m128i b Computes the average of the 8 unsigned 16 bit integers in a and the 8 unsigned 16 bit integers in b and rounds rO a0 b0 2 rl al bl 2 ae a7 b7 2 _ m128i _mm_madd_epil __m128i a __m1281i b Multiplies the 8 signed 16 bit integers from a by the 8 signed 16 bit integers from b Adds the signed 32 bit integer results pairwise and packs the 4 signed 32 bit integer results EO a0 bO al bl rl lt a2 b2 a3 b3 r2 a4
393. ration file to automate often used command line entries You can insert any valid command line option into the configuration file The compiler processes options in the configuration file in the order they appear followed by the command line options that you specify when you invoke the compiler B Note Options in the configuration file will be executed every time you run the compiler If you have varying option requirements for different projects see Response Files 73 Intel C Compiler for Linux Systems User s Guide How to Use Configuration Files The following example illustrates a basic configuration file After you have written the cfg file simply ensure it is in the same directory as the compiler s executable file when you run the compiler The text following the pound character is recognized as a comment The configuration file is icc cfg Sample configuration file Define preprocessor macro MY_PROJECT DMY_PROJECT Additional directories to be searched for INCLUDE files before the default I project include Specifying the Location with ICCCFG You can use the ICCCFG environment variable to specify the location of your configuration file ICCCFG cpp config my_options cfg Each time you invoke the compiler with icc my_options cfg is used as your configuration file The ICPCCFG environment variable is supported for invoking the compiler with icpc See Environment Variables
394. rators Syntax Intrinsic m Equality cmpeq R cmpeq A B _mm_cmpeq_pi32 _mm_cmpeq_pil6 _mm_cmpeq_pi8 m Inequality cmpneq R cmpneq A B mm_cmpeq_pi32 _mm_andnot_si64 _mm_cmpeq_pil6 _mm_cmpeq_pi8 a Greater Than cmpgt A B _mm_cmpgt_pi32 _mm_cmpgt_pil6 _mm_cmpgt_pi8 Greater Than cmpge A B mm_cmpgt_pi32 _mm_andnot_si64 or Equal To _mm_cmpgt_pil _mm_cmpgt_pi8 Less Than emplt R cmplt A B _mm_cmpgt_pi32 _mm_cmpgt_pil6 _mm_cmpgt_pi8 ooo Less Than cmple R cmple A B mm_cmpgt_pi32 mm_andnot_si64 or Equal To _mm_cmpgt_pil _mm_cmpgt_pi8 Comparison operators have the restriction that the operands must be the size and sign as listed in the Compare Operator Overloading table Compare Operator Overloading R Comparison A B I32vec2 R cmpeq I s u 32vec2 B cmpne I s u 32vec2 B Il6vec4 R I8vec8 R I32vec2 R cmpgt cmpge cmplt cmple If s uJl6vec4 B I s u 8vec8 B Is32vec2 B Il6vec4 R Isl6vec4 B I8vec8 R Is8vec8 B I s u l6vec4 B I s u 8vec8 B Is32vec2 B Isl6vec4 B Is8vec8 B 401 Intel C Compiler for Linux Systems User s Guide Conditional Select Operators For conditional select operands the third and fourth operands determine the type returned Third and fourth operands with same size but different signedness return the nearest common ancestor data type Conditional Select Syntax Usage Return the nearest common ancestor data type if third and fourth o
395. rce32 pchi pch_dir dirname Use the pch_dir dirname option to specify the path dirname to the PCH file You can use this option with pch create_pch filename and use_pch filename Example 4 command line prompt gt icpe pch pch_dir pch source32 cpp Example 4 output source32 cpp creating precompiled header file pch source32 pchi Organizing Source Files If many of your source files include a common set of header files place the common headers first followed by the pragma hdrstop directive This pragma instructs the compiler to stop generating PCH files For example if sourcel cpp source2 cpp and source3 cpp all include common h then place pragma hdrstop after common h to optimize compile times include common h pragma hdrstop include noncommon h When you compile using the pch option prompt gt icpce pch sourcel cpp source2 cpp source3 cpp the compiler will generate one PCH file for all three source files sourcel cpp creating precompiled header file sourcel pchi source2 cpp using precompiled header file sourcel pchi source3 cpp using precompiled header file sourcel pchi 81 Intel C Compiler for Linux Systems User s Guide If you don t use pragma hdrstop a different PCH file is created for each source file if different headers follow common h and the subsequent compile times will be longer oragma hdrstop has no effect on compilation
396. re features while maintaining object code compatibility with previous generations of Intel Pentium processors for A 32 based systems only Intel C Compiler User s Guide Product Web Site and Support For the latest information about Intel C Compiler visit http www intel com software products compilers clin For specific details on the Itantum architecture visit the web site at http www intel com software products browse itanium htm System Requirements IA 32 Processor System Requirements e Acomputer based on a Pentium processor or subsequent IA 32 based processor Pentium 4 processor recommended e 128 MB of RAM 256 MB recommended e 100 MB of disk space Itanium Processor System Requirements e A computer with an Itanium processor e 256 MB of RAM e 100 MB of disk space Software Requirements See the Release Notes for a complete list of system requirements FLEXIm Electronic Licensing The Intel C Compiler uses Macrovision s FLEXIm licensing technology The compiler requires a valid license file in the 1icenses directory in the installation path The default directory is opt intel_cc_80 licenses The license files have a lic file extension If you require a counted license see Using the Intel License Manager for FLEXIm lex_ug pdf Intel C Compiler for Linux Systems User s Guide Related Publications The following documents provide additional information releva
397. re for equality rO a0 b0 Oxffffffff 0x0 rl al r2 a2 r3 a3 m1128 _mm_cmpeq_ps __m128 a __m128 Compare for equality rO a0 b0 Oxffffffff 0x0 rl al bl Oxffffffff 0x0 r2 a2 b2 Oxffffffff 0x0 r3 a3 b3 Oxffffffff 0x0 m128 _mm_cmplt_ss __m128 a __m128 Compare for less than rO a0 lt b0 Oxffffffff 0x0 rl al r2 a2 r3 a3 m1128 _mm_cmplt_ps __m128 a __m128 Compare for less than rO a0 lt b0 Oxffffffff 0x0 rl al lt bl OxfffffffFft 0x0 r2 a2 lt b2 Oxffffffff 0x0 r3 a3 lt b3 Oxffffffff 0x0 274 Corresponding Instruction CMPUNORDPS COMISS COMISS COMISS COMISS COMISS COMISS COMISS COMISS U COMISS COMISS COMISS COMISS U b m128 _mm_cmple_ss __m128 a Compare for less than or equal ro rl al g F2 tS a2 13 a0 lt bO Oxffffffff a3 m128 _mm_cmple_ps __m128 a Compare for less than or equal ro a0 lt b0 Oxffffffff rl al lt bl Oxffffffff r2 a2 lt b2 Oxffffffff r3 a3 lt b3 Oxffffffff m128 _mm_cmpgt_ss __m128 a Compare for greater than rO a0 gt bO Oxffffffff rl al g r2 a2 r3 a3 m128 _mm_cmpgt_ps __m128 a Compare for greater than rO a0 gt bO Oxffffffff ri al gt bl OxfffffffFt r2 a2 gt b2 OXffffffff r3 a3 gt b3 Oxfffff
398. rectories e cpp export helloworld e 6 cpp export helloworld Release File system Export resources to the bocal file system be EA G helloworld z E hello o A Aletepece Ei E hellcworit E maketite EI E subdirdep ffl E subdiromk Select Types Select All Deselect an To directory STI Options O Overste existing files without warming Create directory structure for files Create only selected directones a 6 Click Finish to complete the export Running make In a terminal window change to the cpp export helloworld Release directory then run make by typing make clean all You should see the following output rm rf hello o helloworld icc 02 wL Obl tpp7 unroll par_threshold75 wnl00 Zp16 c o hello o hello c icc o helloworld hello o 66 Volume I Building Applications Compilation Options This section describes the Intel C Compiler options that determine the compilation process and output By default the compiler converts source code directly to an executable file Appropriate options allow you to control the process by directing the compiler to produce Preprocessed files i with the P option Assembly files s with the S option Object files o with the c option Executable files out by default You can also name the output file or designate a set of options that are passed to the link
399. red and a 128 bit mask is returned For the scalar form the lower DP FP values of a and b are compared and a 64 bit mask is returned the upper DP FP value is passed through from a The mask is set to Oxf fffffFffLLLLLLLL for each element where the comparison is true and 0x0 where the comparison is false The r following the instruction name indicates that the operands to the instruction are reversed in the actual implementation The comparison intrinsics for the Streaming SIMD Extensions 2 SSE2 are listed in the following table followed by detailed descriptions The prototypes for SSE2 intrinsics are in the emmint rin h header file Intrinsic Corresponding Compare Name Instruction For _mm_cmpeq_pd CMPEQPD Equality _mm_cmplt_pd CMPLTPD Less Than _mm_cmple_pd CMP LEPD Less Than or Equal _mm_cmpgt_pd CMPLTPDr Greater Than _mm_cmpge_pd CMPLEPDr Greater Than or Equal _mm_cmpord_pd CMPORDPD Ordered _mm_cmpunord_pd CMPUNORDPD Unordered Inequality Not Less Than Not Less Than or Equal _mm_cmpngt_pd CMPNLTPDr Not Greater Than Not Greater Than or Equal Equality Less Than Less Than or Equal Greater Than _mm_cmpge_sd CMPLESDr Greater Than or Equal _mm_cmpord_sd CMPORDSD Ordered mm_cmpunord_sd CMPUNORDSD Unordered Reference Intrinsic Corresponding Compar
400. returns the complex sine of z Calling interface double _Complex csin double _Complex z long double _Complex csinl long double _Complex z float _Complex csinf float _Complex z 240 Reference CSINH Description The csinh function returns the complex hyperbolic sine of z Calling interface double _Complex csinh double _Complex z long double _Complex csinhl long double _Complex z float _Complex csinhf float _Complex z CSQRT Description The csqrt function returns the complex square root of z Calling interface double _Complex csqrt double _Complex z long double _Complex csgqrtl long double _Complex z float _Complex csqrtf float _Complex z CTAN Description The ct an function returns the complex tangent of z Calling interface double _Complex ctan double _Complex z long double _Complex ctanl long double _Complex z float _Complex ctanf float _Complex z CTANH Description The ct anh function returns the complex hyperbolic tangent of z Calling interface double _Complex ctanh double _Complex z long double _Complex ctanhl long double _Complex z float _Complex ctanhf float _Complex z 241 Intel C Compiler for Linux Systems User s Guide C99 Macros The Intel Math library and mathimf h header file support the following C99 macros int fpclassify x int isfinite x int isgreater x y int isgreaterequal x y int isinf x int isl
401. rgument Maps to the cmpxchg2 rel instruction with appropriate setup Same as the previous intrinsic but using acquire semantic Atomically increment by one the value specified by its argument Maps to the fetchadd4 instruction Atomically decrement by one the value specified by its argument Maps to the fetchadd4 instruction Do an exchange operation atomically Maps to the xchg4 instruction Intrinsic int _InterlockedCompareExchange volatile int Destination int Exchange int Comparand int _InterlockedExchangeAdd volatile int x addend int void increment int _InterlockedAdd volatile int addend int increment _InterlockedCompareExchangePointer void volatile Destination void Exchange void Comparand unsigned _ i nt64 _Interlocked int Target ExchangeU volatile unsigned unsigned __int64 value unsigned _ int64 _InterlockedCompareExchange_rel volatile unsigned int int64 Comparand Exchange Destination unsigned unsigned __int64 unsigned __int64 _InterlockedCompareExchange_acq volatile unsigned int int64 Comparand Destination unsigned Exchange unsigned _ int64 void _ReleaseSpinLock volatile int x int64 _InterlockedIncrement64 volatile int64 int64 addend InterlockedDecrement 64 volatile int64 addend int64 InterlockedExchange64 volatile
402. rotected hidden internal The value of the visibility declaration attribute overrides the default set by the fvisibility fpic or fno common attributes 92 Volume I Building Applications If you have a number of symbols for which you wish to specify the same visibility attribute you can set the visibility using one of the five command line options fvisibility external file fvisibility default file fvisibility protected file fvisibility hidden file fvisibility internal file where file is the pathname of a file containing a list of the symbol names whose visibility you wish to set The symbol names in the file are separated by white space blanks TAB characters or newlines For example the command line option fvisibility protected prot txt where file prot t xt contains sets protected visibility for symbols a b c d and e This has the same effect as __attribute__ visibility protected on the declaration for each of the symbols Note that these two ways to explicitly set visibility are mutually exclusive you may use __ at tribute visibilty onthe declaration or specify the symbol name in a file but not both You can set the default visibility for symbols using one of the command line options fvisibility external fvisibility default fvisibility protected fvisibility hidden fvisibility internal This option sets the visiblity for symbols not specified in a visibi
403. routines are in user name space The omp h and omp_lib h header files are provided in the INCLUD installation E directory of your compiler There are definitions for two different locks onp_lock_kind and omp_nest_lock_kind which are used by the functions in the table that follows Execution Environment Routines Function omp_set_num_threads nthreads omp_get_num_threads omp_get_max_threads omp_get_num_procs Description Sets the number of threads to use for subsequent parallel regions Returns the number of threads that are being used in the current parallel region Returns the maximum number of threads that are available for parallel execution omp_get_thread_num Returns the unique thread number of the thread currently executing this section of code to the program Returns the number of processors available 183 Intel C Compiler for Linux Systems User s Guide Function omp_in_parallel omp_get_dynamic omp_set_nested nested omp_get_nested Lock Routines Function omp_init_lock lock omp_destroy_lock lock omp_set_lock lock omp_unset_lock lock omp_test_lock lock omp_init_nest_lock lock omp_destroy_nest_lock lock 184 omp_set_dynamic dynamic_threads Description Returns TRUE if called within the dynamic extent of a parallel region executing in parallel otherwise returns FALSE Enables or dis
404. s Intrinsic Description void _fsetc int amask int omask Sets the control bits of FPSR sf0 Maps to the fsetc sf0 r rinstruction There is no corresponding instruction to read the control bits Use_mm_getfpsr void _fclrf void Clears the floating point status flags the 6 bit flags of FPSR sf0 Maps to the fclrf sf0 instruction int64 _m64_dep_mr __int64 r __int64 s const int pos const int len The right justified 64 bit value r is deposited into the value in s at an arbitrary bit position and the result is returned The deposited bit field begins at bit position pos and extends to the left toward the most significant bit the number of bits specified by len __int64 _m64_dep_mi const int v __int64 s const int p const int len The sign extended value v either all 1s or all Os is deposited into the value in s at an arbitrary bit position and the result is returned The deposited bit field begins at bit position p and extends to the left toward the most significant bit the number of bits specified by len __int64 _m64_dep_zr __int64 s const int pos const int len The right justified 64 bit value s is deposited into a 64 bit field of all zeros at an arbitrary bit position and the result is returned The deposited bit field begins at bit position pos and extends to the left toward the most significant bit the number of bits specified by len int64 _m64_dep_zi const int v
405. s e B Intel Pentium M and compatible Intel processors e P Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Only the xW and xP options are available on Intel EM64T Xlinker val Pass val directly to the linker for processing Z2p 1 2 4 8 16 Packs structures on 1 2 4 8 or 16 byte boundaries Compiler Options Cross Reference Linux Windows Description QA Remove all predefined macros Aname val QAname val Create an assertion name having value val Linux Default Enable disable assumption of ANSI conformance ansi Za 30 Compiler Options Quick Reference Linux Windows Description Linux Default ax K W N BIP Qax K WINIBIP ne Dname value Dname value Dname value value E E fp Oy g Zi Generates specialized OFF code for processor specific codes K W N B and P while also generating generic IA 32 code K Intel Pentium III and compatible Intel processors W Intel Pentium 4 and compatible Intel processors N Intel Pentium 4 and compatible Intel processors B Intel Pentium M and compatible Intel processors P Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Pe pe ot strip comments OFF Compile to object OFF o only do not link Dname value Dname value Define macro OFF Preprocess to stdout OFF Use EBP based stack OFF
406. s For example prompt gt profmerge prof_dir lt pl gt src_old lt p2 gt src_new lt p3 gt where e lt p1 gt is the full path to dynamic information file dpi e lt p2 gt is the old full path to source files e lt p3 gt is the new full path to source files This command will read the pgopti dpi file For each function represented in the pgopti dpi file whose source path begins with the lt p2 gt prefix profmerge replaces that prefix with lt p3 gt The pgopti dpi file is updated with the new source path information You can execute profmerge more than once on a given pgopti dpi file You may need to do this if the source files are located in multiple directories For example 134 Volume II Optimizing Applications prompt gt profmerge prof_dir src_old src prog_1 src_new src prog_2 prompt gt profmerge prof_dir src_old proj_1 src_new proj_2 In the values specified for src_old and src_new uppercase and lowercase characters are treated as identical Likewise forward slash and backward slash characters are treated as identical Because the source relocation feature of profmerge modifies the pgopti dpi file you may wish to make a backup copy of the file prior to performing the source relocation PGO API Support Overview Profile Information Generation Support lets you control of the generation of profile information during the instrumented execution phase of profile guided optimizations Normally
407. s User s Guide Example Selectively collect profile information for the portion of the application involved in processing input data input_data get_input_data while input_data _PGOPTI_Prof_Reset process_data input_data _PGOPTI_Prof_Dump input_data get_input_data Resetting the Dynamic Profile Counters void _PGOPTI_Prof_Reset void Description This function resets the dynamic profile counters Recommended Usage Use this function to clear the profile counters prior to collecting profile information on a section of the instrumented application See the example under PGOPTI_Prof_Dump Dumping and Resetting Profile Information void _PGOPTI_Prof_Dump_And_Reset void Description This function may be called more than once Each call will dump the profile information to anew dyn file The dynamic profile counters are then reset and execution of the instrumented application continues Recommended Usage Periodic calls to this function allow a non terminating application to generate one or more profile information files These files are merged during the feedback phase of profile guided optimization The direct use of this function allows your application to control precisely when the profile information is generated 136 Volume II Optimizing Applications Interval Profile Dumping void _PGOPTI_Set_Interval_Prof_Dump int interval Description This function ac
408. s a 16 byte boundary which results in an additional memory access causing a six to twelve cycle stall You can avoid the stalls if you know that the data is aligned and you specify to assume alignment Misaligned Data Crossing 16 Byte Boundary 16 Byte 16 Byte Boundaries Boundaries CT a Misaligned Data For example if you know that elements a 0 and b 0 are aligned on a 16 byte boundary then the following loop can be vectorized with the alignment option on pragma vector aligned Alignment of Pointers is Known float a b int i for int i 0 i lt 10 i After vectorization the loop is executed as shown here 168 Volume II Optimizing Applications Vector and Scalar Clean up Iterations 2 vector iterations 2 clean up iterations in scalar mode q _ 1 0 1 2 31 4 5 6 7 j 8 9 Both the vector iterations a 0 3 b 0 3 and a 4 7 b 4 7 can be implemented with aligned moves if both the elements a 0 and b 0 or likewise a 4 and b 4 are 16 byte aligned F TA If you specify the vectorizer with incorrect alignment options the compiler will generate unexpected behavior Specifically using aligned moves on unaligned data will result in an illegal instruction exception Data Alignment Examples This example contains a loop that vectorizes but only with unaligned memory instructions The compiler can align the local arrays but because 1b is not known at compile time The correct
409. s that do not use these PCH options Linking This topic describes the options that let you control and customize the linking with tools and libraries and define the output of the 1d linker See the 1d man page for more information on the linker Option Ldirectory Qoption tool list shared shared libcxa i_dynamic static static libcxa Bstatic 82 Description Instruct the linker to search directory for libraries Passes an argument list to another program in the compilation sequence such as the assembler or linker Instructs the compiler to build a Dynamic Shared Object DSO instead of an executable shared libcxa has the opposite effect of static Libcxa When it is used the Intel provided Libcxa C library is linked in dynamically allowing the user to override the static linking behavior when the static option is used Note By default all C standard and support libraries are linked dynamically Specifies that all Intel provided libraries should be linked dynamically Causes the executable to link all libraries statically as opposed to dynamically When static is not used e lib ld linux so 2 is linked in e all other libs are linked dynamically When static is used e lib 1ld linux so 2 is not linked in e all other libs are linked statically By default the Intel provided 1ibcxa C library is linked in dynamically Use stat ic libcxa on the command line to link 1ib
410. s to sdtout There is no output for source files free of syntax errors Executable file out files Controlling Compilation Flow Option C Kpic KPIC lname nobss_init sox Z2p 1 2 4 8 16 Description Stops the compilation process after an object file has been generated The compiler generates an object file for each C or C source file or preprocessed source file Also takes an assembler file and invokes the assembler to generate an object file Generate position independent code Link with a library indicated in name Places variables that are initialized with zeroes in the DATA section Stops the compilation process after C or C source files have been preprocessed and writes the results to files named according to the compiler s default file naming conventions Generates assemblable file only with s suffix then stops the compilation Enables disables the saving of compiler options and version information in the executable file Default is sox Packs structures on 1 2 4 8 or 16 byte boundaries Controlling Compilation Output Option oname iit Description Produces an assembly file with the specified file name or the default file name if name is not specified Generates assemblable file only with s suffix then stops the compilation 77 Intel C Compiler for Linux Systems User s Guide Specifying Alternate Tools and Paths You can direct the c
411. shed specifications Current characterized software defects are available on request Intel SpeedStep Intel Thread Checker Celeron Dialogic 1386 i486 iCOMP Intel Intel logo Intel386 Intel486 Intel740 IntelDX2 IntelDX4 IntelSX2 Intel Inside Intel Inside logo Intel NetBurst Intel NetStructure Intel Xeon Intel XScale Itanium MMX MMX logo Pentium Pentium II Xeon Pentium III Xeon Pentium M and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries Other names and brands may be claimed as the property of others Copyright Intel Corporation 1996 2004 Welcome to the Intel C Compiler Welcome to the Intel C Compiler Before you use the compiler see System Requirements Most Linux distributions include the GNU C library assembler linker and others The Intel C Compiler includes the Dinkumware C library See Libraries Overview Please look at the individual sections within each main section of this User s Guide to gain an overview of the topics presented For the latest information visit the Intel Web site http www intel com software products compilers clin See Getting Started for basic information on running the compiler Intel C Compiler for Linux Systems User s Guide What s New in This Release New features for this version of the Intel C Compiler include New Eclipse IDE integration New compiler o
412. signed char Destination unsigned int64 Exchange unsigned _ int64 Comparand unsigned _ int64 _InterlockedCompareExchange8_acq volatile unsigned char Destination unsigned __int64 Exchange unsigned __int64 Comparand unsigned __int64 _InterlockedExchangel6 volatile unsigned short Target unsigned __int64 value unsigned _ int64 InterlockedCompareExchangel6 rel volatile unsigned short Destination unsigned int64 Exchange unsigned _ int64 Comparand unsigned _ int64 InterlockedCompareExchangel6 acq volatile unsigned short Destination unsigned int64 Exchange unsigned __int64 Comparand int _InterlockedIncrement volatile int addend int _InterlockedDecrement volatile int addend int _InterlockedExchange volatile int Target long value 342 Description Map to the xchg 1 instruction Atomically write the least significant byte of its 2nd argument to address specified by its 1st argument Compare and exchange atomically the least significant byte at the address specified by its 1st argument Maps to the cmpxchgl1 rel instruction with appropriate setup Same as the previous intrinsic but using acquire semantic Map to the xchg2 instruction Atomically write the least significant word of its 2nd argument to address specified by its 1st argument Compare and exchange atomically the least significant word at the address specified by its 1st a
413. sil28_pd __m128i in 335 Intel C Compiler for Linux Systems User s Guide Streaming SIMD Extensions 3 The Intel C intrinsics listed in this section are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 They will not function correctly on other A 32 processors New SSE3 intrinsics include Floating point Vector Intrinsics Integer Vector Intrinsics Miscellaneous Intrinsics Macro Functions The prototypes for these intrinsics are in the pmmint rin h header file 5 Note You can also use the single ia32intrin h header file for any IA 32 intrinsics Floating point Vector Intrinsics for Streaming SIMD Extensions 3 The floating point intrinsics listed here are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 The prototypes for these intrinsics are in the pmmint rin h header file Single precision Floating point Vector Intrinsics extern __m128 _mm_addsub_ps __m128 a __m128 b Subtracts even vector elements while adding odd vector elements rO a0 b0 ri al bl r2 a2 b2 r3 a3 b3 extern m128 _mm_hadd_ps __m128 a __m128 b Adds adjacent vector elements rO a0 al rl a2 a3 r2 bO bl r3 b2 b3 extern m128 _mm_hsub_ps __m128 a __m128 b Subtracts adjacent vector elements r0 a0 al rl a2 a3 r2 DO bl r3 b2 b3 336 Reference extern __m128 _mm_move
414. sions caused by a change set e Selecting and prioritizing the tests to achieve certain level of code coverage in a minimal time based on the data of the tests execution time Command line Syntax The syntax for this tool is as follows tselect dpi_list file where dpi_list is a required tool option that sets the path to the DPI list ile that contains the list of the dpi files of the tests you need to prioritize Tool Options The tool uses options that are listed in the table that follows Option Description help Prints all the options of the test prioritization tool spi file Sets the path name of the static profile information file spi Default is pgopti spi dpi_list file Sets the path name of the file that contains the name of the dynamic profile information dpi files Each line of the file should contain one dpi name optionally followed by its execution time The name must uniquely identify the test 144 Volume II Optimizing Applications Option Description prof_dpi file Sets the path name of the output report file comp Sets the filename that contains the list of files of interest cutoff value Terminates when the cumulative block coverage reaches value of pre computed total coverage value must be greater than 0 0 for example 99 00 It may be set to 100 nototal Does not pre compute the total coverage mintime Minimizes testing execution time The execution time of each test
415. six exception mask bits are always affected Bits not set explicitly are cleared fd _ _MM_MASK_IN 296 Macro Definitions _MM MASK _OVE _MM_MASK_UN EXACT Reference The following example masks the overflow and underflow exceptions and unmasks all other exceptions Exception Mask with __MM_MASK_OVERFLOW and_MM_MASK_UNDERFLOW EPTION_MASK MM_MASK_OVERFLOW _MM MASK _UNDERFLOW Macro Arguments SET_ROUNDING_MOD _MM_ROUND_N _ROUNDING_MOD _MM_ROUND_DOWN Macro Definition _MM_ROUND_UP Write to and read from bits thirteen and fourteen of the control register _MM_ROUND_TOWARD_Z The following example tests the rounding mode for round toward zero Rounding Mode with_MM_ROUND_TOWARD_ZERO if _MM_GET_ROUNDING_MODE _MM_ROUND_TOWARD_Z Rounding mode is round toward zero Flush to Zero Mode Macro Arguments MM_SET_FLUSH_ZERO_MODE x MM_FLUSH_ZERO_ON MM_GET_FLUSH_ZERO_MODE MM_FLUSH_ZERO_OFF Macro Definition Write to and read from bit fifteen of the control register The following example disables flush to zero mode Flush to Zero Mode with _MM_FLUSH_ZERO_OFF ET_FLUSH_ZERO_MODE _MM_FLUSH_ZERO_OFF Macro Function for Matrix Transposition The Streaming SIMD Extensions SSE also provide the following macro function to transpose a 4 by
416. size 02 Same as 01 on JA 32 Same as ON O on Itanium based systems 164 03 Enable 02 plus more OFF aggressive optimizations that may increase the compilation time Impact on performance is application dependent some applications may not see a performance improvement 0bn Controls the compiler s inline OFF expansion The amount of inline expansion performed varies with the value of n as follows e 0 Disables inlining e 1 Enables default inlining of functions declared with the __ inline keyword Also enables inlining according to the C language e 2 Enables inlining of any function However the compiler decides which functions to inline Enables interprocedural optimizations and has the same effect as ip ofile Name output file OFF 21 Intel C Compiler for Linux Systems User s Guide Option Description Default Enables the parallelizer to OFF generate multi threaded code based on the OpenMP directives The openmp option works with both 00 and any optimization level of 01 02 and 03 openmp openmp_profile The openmp_profile OFF option enables analysis of OpenMP applications with Thread Profiler which is required to use this option openmp_report 0 1 2 Controls the OpenMP OFF parallelizer s diagnostic levels openmp_stubs Enables OpenMP programs to OFF compile in sequential mode The OpenMP directives are ignored and a stub OpenMP library is
417. slation units one is not included in the other That allows the two functions trace to coexist with internal linkage Usage prompt gt icpe export export_dir usr2 export c filel cpp prompt gt icpe export export_dir usr2 export c file2 cpp prompt gt icpe export export_dir usr2 export filel o file2 o Template Instantiation The Intel C Compiler supports extern template which lets you specify that a template in a specific translation unit will not be instantiated because it will be instantiated in a different translation unit or different library The compiler now includes additional support for e inline template instantiates the compiler support data for the class i e the vtable for a class without instantiating its members e static template instantiates the static data members of the template but not the virtual tables or member functions 106 Volume I Building Applications You can now use the following options to gain more control over the point of template instantiation Option fno implicit templates fno implicit inline templates Description Never emit code for non inline templates which are instantiated implicitly i e by use only emit code for explicit instantiations Do not emit code for implicit instantiations of inline templates either The default is to handle inlines differently so that compilations with and without optimization will need the same set
418. stems Key to the table entries e A Expected to give significant performance gain over non intrinsic based code equivalent e B Non intrinsic based source code would be better the intrinsic s implementation may map directly to native instructions but they offer no significant performance gain e C Requires contorted implementation for particular microarchitecture Will result in very poor performance if used 374 Intrinsic Across MMX TM Streaming Streaming Itanium AlllIA Technology SIMD SIMD Architecture Extenions Extensions 2 _mm_add_sd N A N A N A A N A _mm_add_pd N A N A N A A N A _mm_sub_sd N A N A N A A N A _mm_sub_pd N A N A N A A N A _mm_mul_sd N A N A N A A N A mm_mul_pd N A N A N A A N A _mm_sqrt_sd N A N A N A A N A _mm_sqrt_pd N A N A N A A N A _mm_div_sd N A N A N A A N A _mm_div_pd N A N A N A A N A _mm_min_sd N A N A N A A N A _Imm_min_pd N A N A N A A N A _mm_max_sd N A N A N A A N A max_pd _ and_pd andnot_pd or_pd xor pd cmpeq_sd cmpeq_pd cmplt_sd cmplt_pd cmple_sd cmple_pd cm Cm Cm Cm Cm Cm Cm Cm Cm Cm Cm Cm Cm Cm cm cm cm cm pgt_sd pgt_pd pge_sd pge_pd pneq_sd pneq_pd pnit_sd pnit_pd pnie_sd pnle_pd pngt_sd pngt_pd pnge_sd pnge_pd pord_pd pord_sd punord_pd punord_sd comieq_sd comilt_sd N A N A N A N A N A N A N A N A N A N
419. strcmp const char const char char strcpy char s const char ct Description Appends to a string Returns s Compares two strings Return lt 0 if cs lt ct 0 if cs ct or gt 0 if cs gt ct Copies a string Returns s Returns the length of string cs Compare two strings but only specified number of characters Copies a string but only specified number of characters The intrinsic functions listed here are common to IA 32 and the Itanium architecture Intrinsic _abnormal_termination void void _alloca int extern int _bit_scan_forward int x 252 Description Can be invoked only by termination handlers Returns TRUI E if the termination handler is invoked as a result of a premature exit of the corresponding try finally region Allocates the buffers Returns the bit index of the least significant set bit of x If x is 0 the result is undefined Intrinsic extern int _bit_scan_reverse int extern int _bswap int _exception_code void _exception_info void void _enable void _disable int _in_byte int int _in_dwo ord int int _in_w ord int int _inp int int _inpd int int _inpw int int _out_byte int int int _out_dword int int int _out_word int int int _outp int int int _outpd int int int _outpw int int extern int _popcent32 int x Reference Description Returns th
420. struction reduces the penalty of exiting from the spin loop Example of loop with the PAUSE instruction spin_loop pause cmp eax A jne spin_loop In this example the program spins until memory location A matches the value in register eax The code sequence that follows shows a test and test and set In this example the spin occurs only after the attempt to get a lock has failed get_lock mov eax 1 xchg eax A Try to get lock cmp eax 0 Test if successful jne spin_loop Critical Section critical_section code mov A 0 Release lock jmp continue spin_loop pause spin loop hint cmp 0 Aj check lock availability jne spin_loop jmp get_lock continue other cod Note that the first branch is predicted to fall through to the critical section in anticipation of successfully gaining access to the lock It is highly recommended that all spin wait loops include the PAUSE instruction Since PAUSE is backwards compatible to all existing I A 32 processor generations a test for processor type a CPUID test is not needed All legacy processors will execute PAUSE as a NOP but in processors which use the PAUSE as a hint there can be significant performance benefit 284 Reference Integer Intrinsics Using Streaming SIMD Extensions The integer intrinsics are listed in the following table followed by a description of each intrinsic with the most recent mnemonic naming convention
421. t The goal of scalar replacement which is enabled by scalar_rep is to reduce memory references This is done mainly by replacing array references with register references While the compiler replaces some array references with register references when 01 or 02 is specified more aggressive replacement is performed when 03 and scalar_rep are specified For example with 03 the compiler attempts replacement when there are loop carried dependences or when data dependence analysis is required for memory disambiguation The scalar_rep compiler option enables default scalar replacement performed during loop transformations The scalar_rep option disables this scalar replacement Loop Unrolling with unroll The unroll n option is used in the following way unrolln specifies the maximum number of times you want to unroll a loop The following example unrolls a loop at most four times prompt gt icpe unroll4 a cpp To disable loop unrolling specify n as 0 The following example disables loop unrolling prompt gt icpe unroll0 a cpp unroll n omitted lets the compiler decide whether to perform unrolling or not This is the default the compiler uses default heuristics or defines n unro110 n 0 disables the loop unroller The Itanitum compiler currently recognizes only n 0 any other value is ignored Benefits and Limitations of Loop Unrolling The benefits of loop unrolling are as follows Unrolling eliminates
422. t _in_word int _inp int _inpd int _inpw int _out_byte int int _out_dword int int _out_word int int _outp int int _outpd int int _outpw int int MMX TM Technology Intrinsics Implementation Key to the table entries Reference e A Expected to give significant performance gain over non intrinsic based code equivalent e B Non intrinsic based source code would be better the intrinsic s implementation may map directly to native instructions but they offer no significant performance gain e C Requires contorted implementation for particular microarchitecture Will result in very poor performance if used Intrinsic Name _m_empty m_from_int m_to_int m_packsswb m_packssdw m_packuswb _m_punpckhbw m_punpckhwd m_punpckhdgq _m_punpcklbw m_punpcklwd m_punpckldq _m_paddb m_paddw _m_paddd _m_paddsb m_paddsw m_paddusb Alternate Name Across MMX TM All IA Streaming SIMD Extensions Streaming SIMD Extensions 2 _mm_empty N A A _mm_cvtsi32_si64 N A A _mm_cvtsi64_si32 N A A _mm_packs_pil6 N A A _mm_packs_pi32 N A A _mm_packs_pul6 N A A _mm_unpackhi_pi8 N A A _mm_unpackhi_pil6 N A A _mm_unpackhi_pi32 N A A _mm_unpacklo_pi8 amp N A A _mm_unpacklo_pil6 N A A _mm_unpacklo_pi32 N A A _mm_add_pi8 N A A _mm_add_pil6 N A A _mm_add_pi32 N A A _mm_adds_pi8 N A A _mm_adds_pil6 N A A _mm_adds_pu8 N A A Itanium Technology Archi
423. t Same as MT but quotes special OFF Make characters mserialize volatile Impose strict memory access OFF 164 only ordering for volatile data object references mno serialize volatile 164 only The compiler may suppress both OFF run time and compile time memory access ordering for volatile data object references Specifically the rel acq completers will not be issued on referencing loads and stores MTtarget Change the default target rule OFF for dependency generation nobss_init Places variables that are OFF initialized with zeroes in the DATA section Disables placement of zero initialized variables in BSS use DATA no_cpprt Do not link in C run time OFF libraries nodefaultlibs Do not use standard libraries when linking no gcc Do not predefine the OFF __GNUC__ GNUC_MINOR__ and __GNUC_PATCHLEVEL macros 20 Compiler Options Quick Reference Option Description Default nolib_inline Disables inline expansion of standard library functions nostartfiles Do not use standard startup files when linking nostdinc Same as X nostdlib Do not use standard libraries and OFF startup files when linking 0 Same as 01 on JA 32 Same as OFF 02 on Itanium based systems 00 Disables optimizations OFF 01 Enable optimizations Optimizes ON for speed For Itanium compiler 132 O1 turns off software pipelining to reduce code
424. t int len __int64 _m64_dep_zr __int64 s const int pos const int len int64 _m64_dep_zi const int v const int pos const int len int64 _m64_extr __int64 r const int pos const int len int64 _m64_extru __int64 r const int pos const int len int64 _m64_xmal __int64 a int64 b _ int64 c int64 _m64_xmalu __int64 a int64 b _ int64 c int64 _m64_xmah _ int64 a int64 b _ int64 c int64 _m64_xmahu __int64 a int64 b _ int64 c int64 _m64_popcnt __int64 a Corresponding Instruction dep Deposit dep Deposit dep z Deposit dep z Deposit extr Extract extr u Extract xma 1 Fixed point multiply add using the low 64 bits of the 128 bit result The result is signed xma lu Fixed point multiply add using the low 64 bits of the 128 bit result The result is unsigned xma h Fixed point multiply add using the high 64 bits of the 128 bit result The result is signed xma hu Fixed point multiply add using the high 64 bits of the 128 bit result The result is unsigned popcnt Population count 339 Intel C Compiler for Linux Systems User s Guide Intrinsic Corresponding Instruction int64 _m64_shladd __int64 a const int count _ int64 b shladd Shift left and add int64 _m64_shrp __int64 a __int64 b const int count shrp Shift right pair FSR Operation
425. t intrin_op gt Indicates the intrinsics basic operation for example add for addition and sub for subtraction lt suffix gt Denotes the type of data operated on by the instruction The first one or two letters of each suffix denotes whether the data is packed p extended packed ep or scalar s The remaining letters denote the type s single precision floating point d double precision floating point i128 signed 128 bit integer i64 signed 64 bit integer u64 unsigned 64 bit integer i32 signed 32 bit integer u32 unsigned 32 bit integer i16 signed 16 bit integer u16 unsigned 16 bit integer i8 signed 8 bit integer u8 unsigned 8 bit integer A number appended to a variable name indicates the element of a packed object For example roO is the lowest word of r Some intrinsics are composites because they require more than one instruction to implement them The packed values are represented in right to left order with the lowest value being used for scalar operations Consider the following example operation double a 2 1 0 2 0 m128d t _mm_load_pd a The result is the same as either of the following __m1l28d t _mm_set_pd 2 0 1 0 __m1l28d t _mm_setr_pd 1 0 2 0 In other words the xmm register that holds the value t will look as follows The scalar element is 1 0 Due to the nature of the instruction some intrinsics require their arguments to be immediates constant integer literals
426. t type gt new_val int sync_bool_compare_and_swap lt type gt ptr lt type gt old_val lt type gt new_val Atomic Synchronize Operation void __sync_synchronize void Atomic Lock test and set Operation lt type gt sync_lock_test_and_set lt type gt ptr lt type gt val Atomic Lock release Operation void sync_lock_release lt type gt ptr Miscellaneous Intrinsics void get_return_address unsigned int level This intrinsic yields the return address of the current function The level argument must be a constant value A value of 0 yields the return address of the current function Any other value yields a zero return address On Linux systems this intrinsic is synonymous with __ builtin_return_address The name and the argument are provided for compatibility with gcc void __set_return_address void addr This intrinsic overwrites the default return address of the current function with the address indicated by its argument On return from the current invocation program execution continues at the address provided void get_frame_address unsigned int level This intrinsic returns the frame address of the current function The Level argument must be a constant value A value of 0 yields the frame address of the current function Any other value yields a zero return value On Linux systems this intrinsic is synonymous with __builtin_frame_address The name and the argument are provided for c
427. ta type as the pre declared value of R as listed in the table that follows 394 Reference Ivec Logical Operator Overloading with Assignment Return Type Left Side AND OR XOR Right Side Any Ivec Type I128vec1 I128vec1 R f Fe I64vec2 I64vec2 R amp TE 32vec4 I x 32vec4 R S I x 32vec2 I x 32vec2 R amp I x 16vec8 I x 16vec8 I x 16vec4 I x 16vec4 I x 8vec16 I x 8vec16 I x 8vec8 I x 8vec8 R Addition and Subtraction Operators The addition and subtraction operators return the class of the nearest common ancestor when the right side operands are of different signs The following code provides examples of usage and miscellaneous exceptions Syntax Usage for Addition and Subtraction Operators Return nearest common ancestor type 11 6vec4 Isl6vec4 A Tul6vec4 B Il6vec4 C C A B Returns type left hand operand type Isl6vec4 A Tul6vec4 B Explicitly convert B to Isl6vec4 Isl6vec4 A C 395 Tu32vec24 B C A C C A Isl6vec4 B Intel C Compiler for Linux Systems User s Guide Addition and Subtraction Operators with Corresponding Intrinsics Addition Subtraction Corresponding Intrinsics _mm_add_epi6 4 _mm_add_epi32 _mm_add_epil6 _mm_add_epi8 _mm_add_pi32 _mm_add_pil _mm_add_pi8 _mm_sub_epi6 4 mm_sub_epi32 _mm_sub_epil6 _mm_sub_epi8 _mm_sub_pi32 _mm_sub_pil6
428. target Maps to a loop with the cmpxchg instruction to guarantee atomicity 344 Reference Load and Store You can use the load and store intrinsic to force the strict memory access ordering of specific data objects This intended use is for the case when the user suppresses the strict memory access ordering by using the serialize volatile option Intrinsic Prototype Description stl_rel void __stl_rel void dst const Generates an stl rel char value instruction st2_rel void __st2_rel void dst const Generates an st2 rel short value instruction __st4_ rel void __st4_rel void dst const Generates an st4 rel int value instruction __st8_rel void __st8_rel void dst const Generates an st8 rel int64 value instruction __ldl_acq unsigned char __ldl_acq void Generates an 1dl acq Src instruction __ld2_acq unsigned short __ld2_acq void Generates an 1d2 acq src instruction ld4_acq unsigned int __1d4_ acq void Generates an 1d4 acq src instruction ld8_acq unsigned _ int64 __1d8_acq void Generates an 1d8 acgq src instruction Operating System Related Intrinsics The prototypes for these intrinsics are in the ia64intrin h header file ee Intrinsic Description na unsigned _ int64 Gets the value from a hardware register based on getReg const int the index passed in Produces a corresponding mov whichReg r instruction Provides access to the f
429. tecture a ea a ee ee oe ee ee ee 367 Intel C4 m_paddusw _m_psubb m_psubw m_psubd _m_psubsb m_psubsw m_psubusb _m_psubusw m_pmaddwd _m_pslldi m_psllq _m_psllqi _m_psraw m_psrawi _m_psrad m_psradi m_psrlw m_psrlwi m_psrld m_psrlqi m_pand m_pandn m_por m_pxor 368 Compiler for Linux Systems User s Guide _mm_adds_pul6 _mm_sub_pi8 _mm_sub_pil6 _mm_sub_pi32 _mm_subs_pi8 _mm_subs_pil6 _mm_subs_pu8 _mm_subs_pul6 _mm_madd_pil6 _mm_mulhi_pil6 _mm_mullo_pil6 mm_sll_pil6 mm slli_pil6 _mm_sll_pi32 mm slli_pi32 _mm_sll_si64 _mm_slli_si64 _mm_sra_pil6 _mm_srai_pil6 _mm_sra_pi32 _mm_srai_pi32 _mm_srl_pil6 _mm_srli_pil6 _mm_srl_pi32 _mm_srli_si64 _mm_and_si64 _mm_andnot_si64 _mm_or_si64 _mm_xor_si64 N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A PF APF a cea a ce a Vc r gt ra a ca a a ea gt aa ce ce ce ee ea a ce ee ce ee ce ee ee ee ee m pcm m_ pcm m_ pcm m_ pcm m_ pcm peqb peqw peqd pgtb pgtw _m_pcmpgtd mm_set_pi32 mm_set_pil6 mm_set_pi8 mm_setl_pi32 mm_setl_pil6 mm_setl_pi8 mm_setr_pi32 mm_setr_pil mm_setr_pi8 _mm_empty is implemented in Itanium instructi
430. tegers and saturates r0 SignedSaturate a0 ri SignedSaturate al r7 SignedSaturate a7 r8 SignedSaturate b0 r9 SignedSaturate b1 r15 SignedSaturate b7 __m128i _mm_packs_epi32 __m128i a __m128i b Packs the 8 signed 32 bit integers from a and b into signed 16 bit integers and saturates rO SignedSaturate a0 ri SignedSaturate al r2 SignedSaturate a2 r3 SignedSaturate a3 r4 SignedSaturate b0 r5 SignedSaturate b1 r6 SignedSaturate b2 r7 SignedSaturate b3 __m128i _mm_packus_epil6 __m128i a __m128i b Packs the 16 signed 16 bit integers from a and b into 8 bit unsigned integers and saturates r0 UnsignedSaturate a0 rl UnsignedSaturate al r7 UnsignedSaturate a7 r8 UnsignedSaturate b0 r9 UnsignedSaturate b1 r15 UnsignedSaturate b7 int _mm_extract_epil 6 __m128i a int imm Extracts the selected signed or unsigned 16 bit integer from a and zero extends The selector imm must be an immediate r imm 0 a0 imm 1 al imm 7 a7 __m128i _mm_insert_epil6 _m128i a int b int imm Inserts the least significant 16 bits of b into the selected 16 bit integer of a The selector imm must be an immediate rO imm 0 b a0 ri imm 1 b al r7 imm 7 b a7 332 Reference int _mm_movemask_epi8 __m1281i a Creates a 16 bit mask from the most significant bits of the 16 si
431. terface double expml double x long double expmll long double x float expmlf float x FREXP Description The frexp function converts a floating point number x into signed normalized fraction in 1 2 1 multiplied by an integral power of two The signed normalized fraction is returned and the integer exponent stored at location exp Calling interface double frexp double x int exp long double frexp long double x int exp float frexpf float x int exp HYPOT Description The hypot function returns the square root of x y errno ERANGE for overflow conditions Calling interface double hypot double x double y long double hypotl long double x long double y float hypotf float x float y ILOGB Description The i 1ogb function returns the exponent of x base two as a signed int value errno ERANGE for x 0 Calling interface int ilogb double x int ilogbl long double x int ilogbf float x INVSQRT Description The invsqrt function returns the inverse square root Calling interface double invsqrt double x long double invsgqrtl long double x float invsqrtf float x 222 Reference LDEXP Description The 1dexp function returns x 2 where exp is an integer value errno ERANGE for underflow and overflow conditions Calling interface double ldexp double x int exp long double ldexpl long double x int exp float ldexpf float x int ex
432. the code was executed while in the reference run the code was not executed then the code is treated as uncovered On the other hand if the code is covered in the reference run but not covered in the new run the differential coverage source view shows the code as covered Running for Differential Coverage To run the Code Coverage Tool on an application developers must provide the following three items e The application sources e The SPI file generated by Intel Compilers when compiling the application for the instrumented binaries through the prof_genx option e The DPI file generated by the Intel Compiler s profmerge tool that result from merging the dynamic profile information files DYN or the DPT file generated implicitly by Intel Compilers when compiling the application with the prof_use option Once the required files are available the coverage tool may be launched from this command line codecov prj Project_Name spi pgopti spi dpi pgopti dpi The spi and dpi options specify the paths to the corresponding files The Code coverage Tool also has the following additional options for generating a link at the bottom of each HTML page to send an electronic message to a named contact by using mname and maddr options codecov prj Project_Name mname John_Smith maddr js company com 143 Intel C Compiler for Linux Systems User s Guide Test prioritization Tool The Intel compiler Test prioritization Tool enables prof
433. the following order e directory of the source file that contains the include e directories specified by the I option How to Remove Include Directories Use the X option to prevent the compiler from searching the default system areas You can use the X option with the I option to prevent the compiler from searching the default path for include files and direct it to use an alternate path For example to direct the compiler to search the path alt include instead of the default path do the following prompt gt icpe X I alt include prog cpp See also Searching for Include Files Searching for Include Files By default the compiler searches for the standard include files in the directories specified in the CPATH C_INCLUDE_PATH and CPLUS_INCLUDE_PATH environment variables You can indicate the location of include files in the configuration file 75 Intel C Compiler for Linux Systems User s Guide How to Specify an Include Directory Use the Idirectory option to specify an additional directory in which to search for include files For multiple search directories multiple Idi rectory commands must be used Included files are brought into the program with a include preprocessor directive The compiler searches directories for include files in the following order e directory of the source file that contains the include e directories specified by the I option e directories specified in the CPATH C_INCLUDE_PAT
434. the hardware features provided by the Streaming SIMD Extensions SSE when writing programs with the intrinsics The following are four important issues to keep in mind Certain intrinsics such as_mm_loadr_ps and __mm_cmpgt_ss are not directly supported by the instruction set While these intrinsics are convenient programming aids be mindful that they may consist of more than one machine language instruction Floating point data loaded or stored as__m128 objects must be generally 16 byte aligned 267 Intel C Compiler for Linux Systems User s Guide e Some intrinsics require that their argument be immediates that is constant integers literals due to the nature of the instruction e The result of arithmetic operations acting on two NaN Not a Number arguments is undefined Therefore FP operations using NaN arguments will not match the expected behavior of the corresponding assembly instructions Arithmetic Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmintrin h header file Intrinsic Instruction Operation Addition a3 Addition al a2 a3 op op op bl b2 b3 _mm_sub_ss Subtraction a0 al a2 a3 Subtraction a a2 a3 op op op bl b2 b3 Multiplication a0 al a2 a3 op bO Multiplication a a2 a3 op op op bl b2 b3 _mm_div_ss Division ad al a2 a3 _mm_div_ps Division ad al
435. the lower 32 bits of the __m64 object m to an integer 256 Reference __m64 _m_packsswb __m64 ml __m64 m2 Pack the four 16 bit values from m1 into the lower four 8 bit values of the result with signed saturation and pack the four 16 bit values from m2 into the upper four 8 bit values of the result with signed saturation __m64 _m_packssdw __m64 ml __m64 m2 Pack the two 32 bit values from m1 into the lower two 16 bit values of the result with signed saturation and pack the two 32 bit values from m2 into the upper two 16 bit values of the result with signed saturation __m64 _m_packuswb __m64 ml __m64 m2 Pack the four 16 bit values from m1 into the lower four 8 bit values of the result with unsigned saturation and pack the four 16 bit values from m2 into the upper four 8 bit values of the result with unsigned saturation __m64 _m_punpckhbw __m64 ml __m64 m2 Interleave the four 8 bit values from the high half of m1 with the four values from the high half of m2 The interleaving begins with the data from m1 __m64 _m_punpckhwd __m64 ml __m64 m2 Interleave the two 16 bit values from the high half of m1 with the two values from the high half of m2 The interleaving begins with the data from m1 __m64 _m_punpckhdq __m64 ml __m64 m2 Interleave the 32 bit value from the high half of m1 with the 32 bit value from the high half of m2 The interleaving begins with the data from m1 __m64 _m_punpcklbw __m64 ml __m64 m2 I
436. the precision indicated by the variable types declared in the program Controlling Accuracy of the FP Results IPF_fltacc enables disables optimizations that affect floating point accuracy By default IPF_fltacc the compiler may apply optimizations that reduce floating point accuracy You may use IPF_fltacc or mp to improve floating point accuracy but at the cost of disabling some optimizations IPF_fp_relaxed enables disables use of faster but slightly less accurate code sequences for math functions such as divide and square root As compared to strict IEEE precision using this option slightly reduces the accuracy of floating point calculations performed by these functions usually limited to the least significant digit 113 Intel C Compiler for Linux Systems User s Guide Optimizing for Specific Processors Processor Optimization for IA 32 only The tpp 5 6 7 options optimize your application s performance for a specific Intel processor The resulting binary will also run on the other processors listed in the table The Intel C Compiler includes gcc compatible versions of the tpp options These options are listed in the gec Version column Option gcc Version Optimizes for tpp5 mcpu pentium Intel Pentium processors tpp6 mcpu pentiumpro Intel Pentium Pro Intel Pentium II and Intel Pentium II processors tpp7 mcpu pentium4 Intel Pentium 4 processors Intel Pentium M proc
437. the total block coverage of Test 1 Test2 and Test3 e Elimination of Test 1 has no negative impact on the total block coverage Example 2 Minimizing Execution Time Suppose we have the following execution time of each test in the test s_list file Testl dpi 00 00 60 35 Test2 dpi 00 00 10 15 Test3 dpi 00 00 30 45 The following command executes the Test prioritization Tool to minimize the execution time with the mintime option tselect dpi_list tests_list spi pgopti spi mintime 148 Volume II Optimizing Applications Here is a sample output Total number of tests 3 Total block coverage 52 17 Total function coverage 50 00 Total execution time 1 41 35 num elapsedTime RatCvrg BIkCvrg FncCvrg Test Name Options Test2 dpi Test3 dpi In this case the results indicate that the running all tests sequentially would require one hour 45 minutes and 35 seconds while the selected tests would achieve the same total block coverage in only 41 minutes S Note The order of tests when prioritization is based on minimizing time first Test 2 then Test 3 could be different than when prioritization is done based on minimizing the number of tests See the preceding example first Test 3 then Test 2 In Example 2 Test 2 is the test that gives the highest coverage per execution time So it is picked as the first test to run Using Other Options The cutoff option enables the Test
438. then returned int64 _m64_xmah __int64 a __int64 b __int64 c The 64 bit values a and b are treated as signed integers and multiplied to produce a full 128 bit signed result The 64 bit value c is zero extended and added to the product The most significant 64 bits of the sum are then returned int64 _m64_xmahu __int64 a __int64 b __int64 c The 64 bit values a and b are treated as unsigned integers and multiplied to produce a full 128 bit unsigned result The 64 bit value c is zero extended and added to the product The most significant 64 bits of the sum are then returned int64 _m64_popent __int64 a The number of bits in the 64 bit integer a that have the value are counted and the resulting sum is returned int64 _m64_shladd __int64 a const int count __int64 b a is shifted to the left by count bits and then added to b The result is returned __int64 _m64_shrp __int64 a __int64 b const int count a and b are concatenated to form a 128 bit value and shifted to the right count bits The least significant 64 bits of the result are returned 341 Intel C Compiler for Linux Systems User s Guide Lock and Atomic Operation Related Intrinsics The prototypes for these intrinsics are in the ia64intrin h header file Intrinsic unsigned __int64 _InterlockedExchange8 volatile unsigned char Target unsigned __int64 value unsigned _ int64 _InterlockedCompareExchange8_rel volatile un
439. these intrinsics are equivalent both in name and functionality to the set of A 32 based SSE intrinsics To write programs with the intrinsics you should be familiar with the hardware features provided by SSE Keep the following issues in mind e Certain intrinsics are provided only for compatibility with previously defined A 32 intrinsics Using them on Itanium based systems probably leads to performance degradation e Floating point FP data loaded stored as ___m128 objects must be 16 byte aligned e Some intrinsics require that their arguments be immediates that is constant integers literals due to the nature of the instruction Data Types The new data type __m128 is used with the SSE intrinsics It represents a 128 bit quantity composed of four single precision FP values This corresponds to the 128 bit A 32 Streaming SIMD Extensions register The compiler aligns __m128 local data to 16 byte boundaries on the stack Global data of these types is also 16 byte aligned To align integer float or double arrays you can use the declspec alignment Because Itanium instructions treat the SSE registers in the same way whether you are using packed or scalar data there is no __m32 data type to represent scalar data For scalar operations use the __m128 objects and the scalar forms of the intrinsics the compiler and the processor implement these operations with 32 bit memory references But for better performance the packed form sh
440. this section a corresponding value for errno is listed when applicable 214 Reference Other Considerations Some math functions are inlined automatically by the compiler The functions actually inlined may vary and may depend on any vectorization or processor specific compilation options used For more information see Criteria for Inline Expansion of Functions A change of the default precision control or rounding mode may affect the results returned by some of the mathematical functions See Floating point Arithmetic Precision It s necessary to include the c99 compiler option when compiling programs that require support for _Complex data types Trigonometric Functions ACOS The Intel Math library supports the following trigonometric functions Description The acos function returns the principal value of the inverse cosine of x in the range 0 pi radians for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double acos double x long double acosl long double x float acosf float x ACOSD ASIN Description The acosd function returns the principal value of the inverse cosine of x in the range 0 180 degrees for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double acosd double x long double acosdl long double x float acosdf float x Description The asin function returns the principal value of the inverse sine of x in the range pi 2 pi 2
441. threaded version of the code which is then compiled The output is a executable program with the parallelism implemented by threads that execute parallel regions or constructs Targeting a Processor Run time Check While parallelzing a loop the Intel compiler s loop parallelizer OpenMP tries to determine the optimal set of configurations for a given processor At run time a check is performed to determine for which IA 32 processor OpenMP should optimize a given loop See detailed information in the Processor specific Runtime Checks A 32 Systems Performance Analysis For performance analysis of your program you can use the Intel VTune TM Performance Analyzer to show performance information You can obtain detailed information about which portions of the code require the largest amount of time to execute and where parallel performance problems are located 175 Intel C Compiler for Linux Systems User s Guide Parallel Processing Thread Model This topic explains the processing of the parallelized program and adds more definitions of the terms used in parallel programming The Execution Flow As previously mentioned a program containing OpenMP C API compiler directives begins execution as a single process called the master thread of execution The master thread executes sequentially until the first parallel construct is encountered In the OpenMP C API the pragma omp parallel directive defines the parallel construct
442. through LN weLl L2 Changes severity of diagnostics OFF L1 through LN to error Werror Force warnings to be reported as OFF errors Limits the number of errors OFF displayed prior to aborting compilation to n Changes the severity of OFF diagnostics L1 through LN to remark wwL1 L2 Changes severity of diagnostics OFF L1 through LN to warning W1 o1 02 Pass options 01 02 etc tothe OFF linker for processing Wp ol 02 Pass options 01 02 etc tothe OFF preprocessor Wp64 Print diagnostics for 64 bit OFF 132em 164 porting All source files found OFF subsequent to x type will be recognized as one of the following types x type c C source file c C source file c header C header file cpp output C preprocessed file e assembler assemblable file e assembler with cpp Assemblable file that needs to be preprocessed e none Disable recognition and revert to file extension X Removes the standard OFF directories from the list of directories to be searched for include files 29 Intel C Compiler for Linux Systems User s Guide Option Description Default x K W N B P Generates specialized code for 132 132em processor specific codes K W N B and P e K Intel Pentium II and compatible Intel processors e W Intel Pentium 4 and compatible Intel processors e N Intel Pentium 4 and compatible Intel processor
443. ting point values are passed through from A F32vec4 Is32vec2ToF32vec4 F32vec4 A Is32vec2 B rO float BO ri float Bl r2 A2 r3 A3 Floating point Vector Classes Floating point Vector Classes The floating point vector classes F64vec2 F32vec4 and F32vec1 provide an interface to SIMD operations The class specifications are as follows F64vec2 A double x double y F32vec4 A float z float y float x float w F32vec1 B float w The packed floating point input values are represented with the right most value lowest as shown in the following table 414 Reference Single Precision Floating point Elements a ee ac aes Operands Operations E wwe R R fR e Value 127 3 0 Fa2vec1 128 bits F32vec4 RO R1 R2 and R3 F32vec4 returns four packed single precision floating point values RO R1 R2 and R3 Fa2vec returns one single precision floating point value RO Fvec Notation Conventions This reference uses the following conventions for syntax and return values Fvec Classes Syntax Notation Fvec classes use the syntax conventions shown the following examples Fvec_Class R Fvec_Class A operator Ivec_Class B Example 1 F64vec2 R F64vec2 A amp F64vec2 B Fvec_Class R operator Fvec_Class A Fvec_Class B Example 2 F64vec2 R andnot F6 4vec2 A F64vec2 B Fvec_Class R operator Fvec_Class A Example 3 F64vec2 R amp F64vec2 A
444. tion The floor function returns the largest integral value not greater than x as a floating point value This function may be inlined with the Itantum compiler Calling interface double floor double x long double floorl long double x float floorf float x LLRINT Description The 11rint function returns the rounded integer value according to the current rounding direction asa long long int errno ERANGE for values too large Calling interface long long int llrint double x long long int llrintl long double x long long int llrintf float x LLROUND Description The 11 round function returns the rounded integer value asa long long int errno ERANGE for values too large Calling interface long long int llround double x long long int llroundl long double x long long int llroundf float x 229 Intel C Compiler for Linux Systems User s Guide LRINT Description The 1rint function returns the rounded integer value according to the current rounding direction asa long int Calling interface long int lrint double x long int lrintl long double x long int lrintf float x LROUND Description The 1round function returns the rounded integer value asa long int Halfway cases are rounded away from zero errno ERANGE for values too large Calling interface long int lround double x long int lroundl long double x long int lroundf float
445. tion on using Qlocation to specify the location of the GNU assembler and linker Predefined Macros for Interoperability The Intel C Compiler and gcc support the following predefined macros e __GNUC__ GNUC_MINOR __GNUC_PATCHLEVEL You can specify the no gcc option to undefine these macros If you need gcc interoperability cxxlib gcc do not use the no gcc compiler option Prwarnine Not defining these macros results in different paths through system header files These alternate paths may be poorly tested or otherwise incompatible See also Predefined Macros and GNU Environment Variables gcc Built in Functions This version of the Intel C compiler supports the following gcc built in functions __builtin_abs _ builtin_labs __builtin_cos __builtin_cosf __builtin_fabs __builtin_fabsf __builtin_memcmp __builtin_memcpy __builtin_sin __builtin_sinf __builtin_sqrt __builtin_sqrtf __builtin_strcemp __builtin_strlen __builtin_strncemp __builtin_abort __builtin_prefetch __builtin_constant_p _ builtin_printf builtin fprintf __builtin_fscanf __builtin_scanf __builtin_fputs __builtin_memset __builtin_strcat __builtin_strcpy __builtin_strncpy __builtin_exit __builtin_strchr __builtin_strspn __builtin_strcspn 102 __builtin_strstr __builtin_strpbrk builtin _strrchr builtin stricat _ builtin_alloca __builtin_ffs _ builtin_index _builtin_ri
446. tium 4 and Intel Xeon Processor Optimization Reference Manual In addition to the prefetch option the _mm_prefetch intrinsic and PREFETCH compiler directive are also available The intrinsic prefetches data from the specified address on one memory cache line The compiler directive enables a data prefetch from memory 153 Intel C Compiler for Linux Systems User s Guide Key Tuning Techniques Use the following techniques to tune your applications for Itantum based systems 154 Compile your program with the 03 and Qipo options Use profile guided optimization PGO whenever possible Identify hot spots in your code Turn on Optimization Report Check why loops are not software pipelined Use pragma ivdep to indicate there is no dependence You might need to compile with the ivdep_parallel option to absolutely specify no loop carried dependence Use pragma_ swp to enable software pipelining useful for lop sided controls and unknown loop count Use pragma loop count n when needed Use of ansi alias is helpful For example for p q the ANSI rule indicates the pointer and float data do not overlap Add the restrict keyword to insure there is no aliasing Use alias_args to indicate arguments are not aliased Use fno_alias only if pointers get traced back to the same base pointer Use pragma distribute point to split large loops normally this is done automatically For C code do not use
447. tivates Interval Profile Dumping and sets the approximate frequency at which dumps will occur The interval parameter is measured in milliseconds and specifies the time interval at which profile dumping will occur For example if interval is set to 5000 then a profile dump and reset will occur approximately every 5 seconds The interval is approximate because the time check controlling the dump and reset is only performed upon entry to any instrumented function in your application F Note e Setting interval to zero or a negative number will disable interval profile dumping e Setting interval to a very small value may cause the instrumented application to spend nearly all of its time dumping profile information Be sure to set interval to a large enough value so that the application can perform actual work and collect substantial profile information Recommended Usage Call this function at the start of a non terminating application to initiate Interval Profile Dumping Note that an alternative method of initiating Interval Profile Dumping is by setting the environment variable PROF_DUMP_INTERVAL to the desired interval value prior to starting the application The intention of Interval Profile Dumping is to allow a non terminating application to be profiled with minimal changes to the application source code Environment Variable PROF_DUMP_INTERVAL This environment variable may be used to initiate Interval Profile Dumping in an ins
448. to detect the processor and generating the appropriate code Implement intrinsics by processor family not by specific processor The guiding principle for which family A 32 or Itanium processors the intrinsic is implemented on is performance not compatibility Where there is added performance on both families the intrinsic will be identical Intrinsics For Implementation Across All IA The following intrinsics provide significant performance gain over a non intrinsic based code equivalent d d d f d unsigned unsigned unsigned unsigned int64 int64 oub oub int abs int long labs long long __ 164_rotl va long __lrotr unsigned int __rotl unsigned int val rotl unsigned long lue int shift long value int shift ue int shift int __rotr unsigned int val ue int shift 164 _rotr Fh log loat sinf f oub oub 364 loat cosf fl loat tanf f le cos doub le tan doubl le loat oat loat le fabs double e log double float le log10 double f float double float le sin double le le loat int64 value int64 value int shift int shift f Le acos double loat acosf float Le acosh double Loat acoshf float double asin double loat asinf float do
449. to do so ignoring normal heuristic decisions about profitability When the aligned or unaligned qualifier is used with this pragma the loop should be vectorized using aligned or unaligned operations Specify one and only one of aligned or unaligned 164 Volume II Optimizing Applications P E If you specify aligned as an argument you must be absolutely sure that the loop will be vectorizable using this instruction Otherwise the compiler will generate incorrect code The loop in the following example uses the aligned qualifier to request that the loop be vectorized with aligned instructions as the arrays are declared in such a way that the compiler could not normally prove this would be safe to do so Example void foo float a pragma vector aligned for i 0 i lt m itt The compiler has at its disposal several alignment strategies in case the alignment of data structures is not known at compile time A simple example is shown but several other strategies are supported as well If in the loop the alignment of a is unknown the compiler will generate a prelude loop that iterates until the array reference that occurs the most hits an aligned address This makes the alignment properties of a known and the vector loop is optimized accordingly Alignment Strategies Example float a alignment unknown for i 0 i lt 100 i a i a i 1 0f dynamic loop peeling p a amp Ox0f if p
450. tories prof_dir ensures that the profile information is generated in one consistent place For example prompt gt icpe prof_gen prof_dir profdata c al cpp a2 cpp a3 cpp prompt gt icpe al o a2 0 a3 o In place of the second command you could use the linker directly to produce the instrumented program instrumented Execution Run your instrumented program with a representative set of data to create a dynamic information file prompt gt a out The resulting dynamic information file has a unique name and dyn suffix every time you run a o The instrumented file helps predict how the program runs with a particular set of data You can run the program more than once with different input data Feedback Compilation Compile and link the source files with prof_use to use the dynamic information to optimize your program according to its profile prompt gt icpce prof_use ipo al cpp a2 cpp a3 cpp Besides the optimization the compiler produces a pgopti dpi file You typically specify the default optimizations O2 for phase 1 and specify more advanced optimizations with ipo for phase 3 This example used O2 in phase 1 and 02 ipo in phase 3 F Note The compiler ignores the ipo options with prof_gen x With the x qualifier extra information is gathered 133 Intel C Compiler for Linux Systems User s Guide PGO Environment Variables The following table describes environment values to determine the directory to
451. trumented application See the Recommended Usage of _PGOPTI_Set_Interval_Prof_Dump for more information Code coverage Tool The Intel C Compiler Code coverage Tool can be used for both IA 32 and Itanium architectures in a number of ways to improve development efficiency reduce defects and increase application performance The major features of the Intel compiler Code coverage Tool are e Visual presentation of the application s code coverage information with a code coverage coloring scheme e Display ofthe dynamic execution counts of each basic block of the application e Differential coverage or comparison of the profiles of the application s two runs 137 Intel C Compiler for Linux Systems User s Guide Command line Syntax The syntax for this tool is as follows codecov codecov_option where codecov_option isa tool option If you do not use any option the tool will provide the top level code coverage for your whole program Tool Options The tool uses options that are listed in the table that follows Optio prj ref bcol fcol pcol ccol ucol 138 n help spi file dpi file counts nopartial comp demang mname maddr LOr Lor Lor LOr Lor Description Default Prints all the options of the code coverage tool Sets the path name of the static profile information file pgopti spi Spi Sets the path name of the dynamic profil
452. type can hold eight 8 bit values four 16 bit values two 32 bit values or one 64 bit value __m128 Data Types The __m128 data type is used to represent the contents of a Streaming SIMD Extension register used by the Streaming SIMD Extension intrinsics The __m128 data type can hold four 32 bit floating values The __m128d data type can hold two 64 bit floating point values The __m128i data type can hold sixteen 8 bit eight 16 bit four 32 bit or two 64 bit integer values The compiler aligns __m128 local and global data to 16 byte boundaries on the stack To align integer float or double arrays you can use the declspec statement New Data Types Usage Guidelines Since these new data types are not basic ANSI C data types you must observe the following usage restrictions e Use new data types only on either side of an assignment as a return value or as a parameter You cannot use it with other arithmetic expressions etc e Use new data types as objects in aggregates such as unions to access the byte elements and structures e Use new data types only with the respective intrinsics described in this documentation The new data types are supported on both sides of an assignment statement as parameters to a function call and as a return value from a function call 246 Reference Naming and Usage Syntax Most of the intrinsic names use a notational convention as follows _mm_ lt intrin_op gt _ lt suffix gt l
453. tzero_ps mm_prefetch mm_stream_pi mm_stream_ps mm_sfence m_pextrw m_pinsrw _m_pmaxsw _m_pmaxub _m_pminsw m_pminub _m_pmovmskb N A N A N A N A N A N A N A N A _mm_loadl_ps N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A _mm_storel_ps N A N A N A N A N A N A N A N A _mm_setil_ps N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A _mm_extract_pil6 N A N A _mm_insert_pil6 N A N A _mm_max_pil6 N A N A _mm_max_pu8 N A N A _mm_min_pil6 N A N A _mm_min_pu8s N A N A _mm_movemask_pi8 N A N A gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt Reference a a a a a a a a a a 373 Streaming SIMD Extensions 2 Intrinsics Implementation Intel C Compiler for Linux Systems User s Guide _m_pmulhuw _mm_mulhi_pul6 N A N A A _m_pshufw _mm_shuffle_pil N A N A A _m_maskmovq _mm_maskmove_si64 N A N A A _m_pavgb _mm_avg_pu8s N A N A A _m_pavgw _mm_avg_pul6 N A N A A _m_psadbw _mm_sad_pu8 N A N A A Streaming SIMD Extensions 2 operate on 128 bit quantities with 64 bit double precision floating point values The Intel Itantum processor does not support parallel double precision computation so Streaming SIMD Extensions 2 are not implemented on Itanium based sy
454. uble asinh double f loat asinhf float double atan double f loat atanf float double atanh double f f Loat atanhf float loat cabs double double ceil double f loat ceilf float double cosh double f f loat coshf float loat fabsf float double floor double f loat floorf float double fmod double f Loat fmodf float double hypot double double f Loat hypotf float double rint double oat rintf float double sinh double oat sinhf float f loat sqrtf float double tanh double f loat tanhf float char _strset char _int32 void memcmp const void cs const void ct size_t n void memcpy void s const void ct size_t n Reference 365 Intel C Compiler for Linux Systems User s Guide int int int int int int int int int int int int int int int int int 366 void memset void s int c size_t n char Strcat char const char ct strcemp const const char char strcpy char s const char ct size_t strlen const char cs strnemp char char int strncpy char char int void __alloca int _set jmp jmp_buf _exception_code void _exception_info void _abnormal_termination void void _enable void _disable _bswap int _in_byte int _in_dword in
455. uide m128i_mm_sub_epil6 __m128i a __m128i b Subtracts the 8 signed or unsigned 16 bit integers of b from the 8 signed or unsigned 16 bit integers of a rO a0 bO ri al bl r7 a7 b7 __m128i _mm_sub_epi32 __m128i a _ m128i b Subtracts the 4 signed or unsigned 32 bit integers of b from the 4 signed or unsigned 32 bit integers of a rO a0 bO ri al bl r2 a2 b2 r3 a3 b3 m64 _mm_sub_si64 __m64 a _ m64 b Subtracts the signed or unsigned 64 bit integer b from the signed or unsigned 64 bit integer a E ana __m128i _mm_sub_epi64 __m128i a _ m128i b Subtracts the 2 signed or unsigned 64 bit integers in b from the 2 signed or unsigned 64 bit integers in a r0 a0 b0 bl Scab 1 __m128i _mm_subs_epi8 __m128i a __m1281i b Subtracts the 16 signed 8 bit integers of b from the 16 signed 8 bit integers of a using saturating arithmetic rO SignedSaturate a0 bO ri SignedSaturate al b1 ri5 SignedSaturate al5 b15 __m128i _mm_subs_epil6 __m128i a _ m128i b Subtracts the 8 signed 16 bit integers of b from the 8 signed 16 bit integers of a using saturating arithmetic rO SignedSaturate a0 bO ri SignedSaturate al b1 r7 SignedSaturate a7 b7 __m128i _mm_subs_epu8 __m128i a _ m128i b Subtracts the 16 unsigned 8 bit integers of b from the 16 unsigned 8 bit integers of a using saturating arithmetic rO UnsignedSaturate
456. uld only be used to initially port legacy code or in non critical code sections e Any SSE scalar intrinsic _ss variety use packed _ps version if possible e comi and ucomi SSE comparisons these correspond to JA 32 COMISS and UCOMISS instructions only A sequence of Itanium instructions are required to implement these e Conversions in general are multi instruction operations These are particularly expensive _mm_cvtpil6_ps _mm_cvtpul6_ps _mm_cvtpi8_ps _mm_cvtpu8_ps _mm_cvtpi32x2_ps _mm_cvtps_pil6 _mm_cvtps_pi8 e SSE utility intrinsic _mm_movemask_ps If the inaccuracy is acceptable the SIMD reciprocal and reciprocal square root approximation intrinsics rcp and rsqrt are much faster than the true div and sqrt intrinsics Macro Function for Shuffle Using Streaming SIMD Extensions The Streaming SIMD Extensions SSE provide a macro function to help create constants that describe shuffle operations The macro takes four small integers in the range of 0 to 3 and combines them into an 8 bit immediate value used by the SHUFPS instruction Shuffle Function Macro _MM_SHUFFLE z y x w expands to the following value z lt 6 y lt lt 4 x lt lt 2 w You can view the four integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word View of Original and Result Words with Shuffle Function Macro nl n _um s
457. ums of the two double precision floating point values of A and B F64vec2 simd_max F64vec2 A F 64vec2 B RO max A0 BO Rl max A1 B1 Corresponding intrinsic _mm_max_pd Compute the maximums of the four single precision floating point values of A and B F32vec4 R simd_man F32vec4 A F32vec4 B RO max A0O BO Rl max Al1 Bl1 R2 max A2 B2 R3 max A3 B3 Corresponding intrinsic _mm_max_ps Compute the maximum of the lowest single precision floating point values of A and B F32vecl simd_max F32vecl A F32vecl B RO max A0O BO Corresponding intrinsic _mm_max_ss Logical Operators The following table lists the logical operators of the Fvec classes and generic syntax The logical operators for F32vec1 classes use only the lower 32 bits Fvec Logical Operators Return Value Mapping _ Bitwise Operation Operators Generic Syntax AND andnot andnot andnot A The following table lists standard logical operators syntax and corresponding intrinsics Note that there is no corresponding scalar intrinsic for the F32vec1 classes which accesses the lower 32 bits of the packed vector intrinsics 422 Logical Operations for Fvec Classes AND OR XOR ANDNOT Operation Returns 4 floats doubles 1 float 4 floats 2 doubles 1 float 4 floats doubles 1 float 2 doubles Example Syntax Usage F32vec4 g F32vec4 A amp F3
458. un faster if you lower the precision with the pcn option Set n to one of the following values to round the significand to the indicated number of bits e pc32 24 bit significand single precision e pc64 53 bit significand double precision e pc80 64 bit significand long double precision The default value for n is 80 indicating long double precision This option allows full optimization Using this option does not have the negative performance impact of using the Op option because only the fractional part of the floating point value is affected The range of the exponent is not affected The pcn 111 Intel C Compiler for Linux Systems User s Guide option causes the compiler to change the floating point precision control when the main function is compiled The program that uses pcn must use main as its entry point and the file containing main must be compiled with pcn rcd Option The Intel compiler uses the rcd option to improve the performance of code that requires floating point to integer conversions The optimization is obtained by controlling the change of the rounding mode The system default floating point rounding mode is round to nearest This means that values are rounded during floating point calculations However the C language requires floating point values to be truncated when a conversion to an integer is involved To do this the compiler must change the rounding mode to truncation before
459. unctionality is referred to as dynamic dependence testing 163 Intel C Compiler for Linux Systems User s Guide Pragma Scope These pragmas control the vectorization of only the subsequent loop in the program but the compiler does not apply them to any nested loops Each nested loop needs its own pragma preceding it in order for the pragma to be applied You must place a pragma only before the loop control statement pragma vector always Syntax pragma vector always Definition This pragma instructs the compiler to override any efficiency heuristic during the decision to vectorize or not pragma vector always will vectorize non unit strides or very unaligned memory accesses Example pragma ivdep Syntax pragma ivdep Definition This pragma instructs the compiler to ignore assumed vector dependences To ensure correct code the compiler treats an assumed dependence as a proven dependence which prevents vectorization This pragma overrides that decision Only use this when you know that the assumed loop dependences are safe to ignore The loop in this example will not vectorize with the ivdep pragma since the value of k is not known vectorization would be illegal if k lt 0 Example pragma ivdep for i 0 i lt m i afi ali k oc pragma vector Syntax pragma vector aligned unaligned Definition The vector loop pragma means the loop should be vectorized if it is legal
460. unctions to eliminate branches using logical operations max and min functions conditional selects and compares Consider the following example short a 4 b 4 c 4 for i 0 i lt 4 i 6 i a i gt pLi zail BLI This operation is independent of the value of i For each i the result could be either A or B depending on the actual values A simple way of removing the branch altogether is to use the select_gt function as follows Isl6vec4 a b c c select_gt a b a b 387 Intel C Compiler for Linux Systems User s Guide Caching Hints Streaming SIMD Extensions provide prefetching and streaming hints Prefetching data can minimize the effects of memory latency Streaming hints allow you to indicate that certain data should not be cached This results in higher performance for data that should be cached Integer Vector Classes The Ivec classes provide an interface to SIMD processing using integer vectors of various sizes The class hierarchy is represented in the following figure Ivec Class Hierarchy The M64 and M128 classes define the __m64 and ___m128i data types from which the rest of the Ivec classes are derived The first generation of child classes are derived based solely on bit sizes of 128 64 32 16 and 8 respectively for the 1128vecl 164vecl 164vec2 I132vec2 132vec4 Il6vec4 I1l6vec8 I8vecl16 and I8vec 8 classes The latter seven of the these classes require specification of signedness
461. unsigned int for loop indexes HLO may skip optimization due to possible subscripts overflow If upper bounds are pointer references assign it to a local variable whenever possible e Is prefetch distance correct Use pragma prefetch to override the distance when it is needed Volume II Optimizing Applications Parallel Programming For parallel programming the Intel C Compiler supports both the OpenMP 2 0 API and an automatic parallelization capability The following table lists the options that perform OpenMP and auto parallelization support Option openmp openmp_report 0 1 2 openmp_stubs parallel par_threshold n par_report 0 1 2 3 d Note Description Enables the parallelizer to generate multithreaded code based on the OpenMP directives Default OFF Controls the OpenMP parallelizer s diagnostic levels Default openmp_reportl Enables compilation of OpenMP programs in sequential mode The OpenMP directives are ignored and a stub OpenMP library is linked Default OFF Enables the auto parallelizer to generate multithreaded code for loops that can be safely executed in parallel Default OFF Sets a threshold for the auto parallelization of loops based on the probability of profitable execution of the loop in parallel n 0 to 100 n 0 implies always Default n 100 Controls the auto parallelizer s diagnostic levels Default par_report1l When both openmp and paral
462. ur cshrc file and add setenv PATH usr bin lt full path to Intel compiler gt Ff Note To use the Intel compiler your makefile must include the setting CC icc Use the same setting on the command line to instruct the makefile to use the Intel compiler If your makefile is written for gcc the GNU C compiler you will need to change those command line options not recognized by the Intel compiler Then you can compile prompt gt make f my_makefile 42 Volume I Building Applications Compiler Input Files The Intel C Compiler recognizes the file name extensions listed in the following table fil fil fil fil fil fil fil fil fil fil fil fil Filename ename ename ename ename ename ename enam nam nam nam nam nam Coooao Oo 0 oooao oO CC CC Cpp CXX Interpretation Object library When you invoke the compiler with icc the i files are treated as C source files The i files are treated as C sources if you compile with icpc Compiled object module Assembly file Shared object file Assembly file that requires preprocessing C language source file C language source file Building Applications in Eclipse The Intel C Compiler for Linux I A 32 only includes a compiler integration with Eclipse and the C C Development Tools CDT This functionality is an optional part of the compiler installation For more information abo
463. ur new library prompt gt icpc main cpp my_lib so See also Intel Shared Libraries and Compiling for Non shared Libraries Default Libraries The following libraries are supplied with the Intel C Compiler Library Description libguide a For OpenMP implementation libguide so libguide_stats a OpenMP static library for the parallelizer tool with performance libguide_stats so statistics and profile information libompstub a Library that resolves references to OpenMP subroutines when OpenMP is not in use libsvml a Short vector math library libirc a Intel support library for PGO and CPU dispatch libimf a Intel math library libimf so libcprts a Dinkumware C Library libcprts so libcprts so 5 libunwind a Unwinder library libunwind so libunwind so 5 libcxa a Intel run time support for C features libcxa so libcxa so 5 libcxaguard a Used for interoperability support with the cxxlib gcc option libcxaguard so See gcc Interoperability libcxaguard so 5 88 Volume I Building Applications When you invoke the cxxlib gcc option the following replacements occur e libcprts is replaced with libstdc from the gec distribution 3 2 or newer e libcxa and libunwind are replaced by libgcc from the gcc distribution 3 2 or newer The Linux system libraries and the compiler libraries are not built with the align option Therefore if you compile with the align option and make a call to
464. urce position OFF information for inlined code Produces additional debug OFF information for scalar local variables using a feature of the DWARF object module format known as location lists Turns on the three debug options OFF e debug inline_info e debug variable_locations Enable recognition of exported OFF templates Supported in C mode only Specifies a directory name for the OFF exported template search path Directs the compiler to select a OFF specific ABI implementation Inline any function at the compiler s OFF discretion Same as ip The fno exceptions option OFF turns off exception handling table generation resulting in smaller code Any use of exception handling constructs try blocks throw statements will produce an error Exception specifications are parsed but ignored A preprocessor symbol EXCEPTIONS is defined when this option is not used It is undefined when this option is present Do not emit code for implicit OFF instantiations of inline templates For C only Intel C Compiler for Linux Systems User s Guide Option Description Default fno implicit templates Never emit code for non inline templates which are instantiated implicitly i e by use only emit code for explicit instantiations For C only ftls model model Change thread local storage model where model can be the following global dynamic local dynamic initial ex
465. urn data type is the nearest common ancestor of operands C and D For conditional select operations using greater than or less than operations the first and second operands must be signed as listed in the table that follows Conditional Select Operator Overloading R Comparison A and B Cc I32vec2 select_eq I s u 32vec2 I s u 32vec2 I s u 32vec2 Il6vec4 If s uJl6vec4 I s u 16vec4 I s u 1l6vec4 I8vec8 R I s u 8vec8 I s u 8vec8 I s u 8vec8 select_gt Is32vec2 Is32vec2 Ts32vec2 select_ge select_l1t Isl6vec4 Isl6vec4 Isl6vec4 I32vec2 Il6vec4 select_le I8vec8 R Is8vec8 Is8vec8 The following table shows the mapping of return values from RO to R7 for any number of elements The same return value mappings also apply when there are fewer than four return values 403 Intel C Compiler for Linux Systems User s Guide Conditional Select Operator Return Value Mapping Return Value A and B Operands C and D operands AO Available Operators BO Debug The debug operations do not map to any compiler intrinsics for MMX TM instructions They are provided for debugging programs only Use of these operations may result in loss of performance so you should not use them outside of debugging Output The four 32 bit values of A are placed in the output buffer and printed in the
466. ushing denormal results to zero Allocation of Zero initialized Variables By default variables explicitly initialized with zeros are placed in the BSS section But using the nobss_init option you can place any variables that are explicitly initialized with zeros in the DATA section if required 79 Intel C Compiler for Linux Systems User s Guide Precompiled Header Files pch The Intel C Compiler supports precompiled header PCH files to significantly reduce compile times using the following options pch create_pch filename use_pch filename pch_dir dirname T ae Depending on how you organize the header files listed in your sources these options may increase compile times See Organizing Source Files to learn how to optimize compile times using the PCH options The pch option directs the compiler to use appropriate PCH files If none are available they are created as sourcefile pchi This option supports multiple source files such as the ones shown in Example 1 Example 1 command line prompt gt icpc pch sourcel cpp source2 cpp Example 1 output when pchi files do not exist sourcel cpp creating precompiled header file sourcel pchi source2 cpp creating precompiled header file source2 pchi Example 1 output when pchi files do exist sourcel cpp using precompiled header file sourcel pchi source2 cpp using precompiled header file source2 pchi f Note The pch option w
467. ut CDT see http www eclipse org cdt The Intel C Compiler integration with the Eclipse CDT integrated development environment lets you develop build and run your C C projects in a visual interactive environment This section includes the following topics Starting Eclipse Using Online Help in Eclipse Creating a New Project Setting Properties Standard and Managed Make Files 43 Intel C Compiler for Linux Systems User s Guide Starting Eclipse After you have installed the following Intel C Compiler for 32 bit applications Eclipse integrated development environment Java Runtime Environment JRE C C Development Tools CDT you can execute the iccec shell script to start Eclipse from a directory where you have write permission With the default compiler installation execute iccec as follows prompt gt opt intel_cc_80 bin iccec You can also use iccec to pass Eclipse specific parameters such as e data lt path gt sets the location for the Eclipse workspace e showlocation shows the location of the workspace in the Eclipse window title bar For example prompt gt opt intel_cc_80 bin iccec data cpp eclipse showlocation From the Eclipse Help menu select Help Contents gt Workbench User s Guide gt Tasks gt Running Eclipse for the complete list of Eclipse startup parameters 44 Volume I Building Applications Using Online Help in Eclipse The Intel C Compiler integratio
468. where operator isan operator for example amp or Fvec_Class is any Fvec class F64vec2 F32vec4 or F32vec1 R A B are declared Fvec variables of the type indicated Return Value Notation Because the Fvec classes have packed elements the return values typically follow the conventions presented in the Return Value Convention Notation Mappings table F32vec4 returns four single 415 Intel C Compiler for Linux Systems User s Guide precision floating point values RO R1 R2 and R3 F64vec2 returns two double precision floating point values and F32vec1 returns the lowest single precision floating point value RO Return Value Convention Notation Mappings Example 1 Example 2 Example F32vec4 F64vec2 F32vec1 3 Data Alignment Memory operations using the Streaming SIMD Extensions should be performed on 16 byte aligned data whenever possible F32vec4 and F64vec2 object variables are properly aligned by default Note that floating point arrays are not automatically aligned To get 16 byte alignment you can use the alignment __ declspec __declspec align 16 float A 4 Conversions All Fvec object variables can be implicitly converted to ___m128 data types For example the results of computations performed on F32vec4 or F32vec1 object variables can be assigned to__m128 data types __m128d mm A amp B where A B are F64vec2 object variables mi28 mm A amp B w
469. while shifting in the sign bit __m64 _m_psrawi __m64 m int count Shift four 16 bit values in m right the amount specified by count while shifting in the sign bit For the best performance count should be a constant __m64 _m_psrad __m64 m __m64 count Shift two 32 bit values in m right the amount specified by count while shifting in the sign bit __m64 _m_psradi __m64 m int count Shift two 32 bit values in m right the amount specified by count while shifting in the sign bit For the best performance count should be a constant __m64 _m_psrilw __m64 m __m64 count Shift four 16 bit values in m right the amount specified by count while shifting in zeros __m64 _m_psriwi __m64 m int count Shift four 16 bit values in m right the amount specified by count while shifting in zeros For the best performance count should be a constant __m64 _m_psrld __m64 m __m64 count Shift two 32 bit values in m right the amount specified by count while shifting in zeros __m64 _m_psridi __m64 m int count Shift two 32 bit values in m right the amount specified by count while shifting in zeros For the best performance count should be a constant __m64 _m_psrligq __m64 m __m64 count Shift the 64 bit value in m right the amount specified by count while shifting in zeros 262 Reference __m64 _m_psrliqi __m64 m int count Shift the 64 bit value in m right the amount specified by count while shifting in zeros For the best perf
470. wise 0 is returned r a0 b0 0x1 0x0 278 Reference Conversion Operations for Streaming SIMD Extensions The conversions operations are listed in the following table followed by a description of each intrinsic with the most recent mnemonic naming convention The alternate name is provided in case you have used these intrinsics before The prototypes for Streaming SIMD Extensions SSE intrinsics are in the xmmint rin h header file Intrinsic Alternate Corresponding Name Name Instruction _mm_cvt_ss2si _mm_cvtss_si32 CVTSS2SI _mm_cvt_ps2pi _mm_cvtps_pi32 CVTPS2PI _mm_cvtt_ss2si _mm_cvttss_si32 CVITSS2SI _mm_cvtt_ps2pi _mm_cvttps_pi32 CVITPS2PI _mm_cvt_si2ss _mm_cvtsi32_ss_ CVTSI2SS _mm_cvt_pi2ps _mm_cvtpi32_ps CVTTPS2PI _mm_cvtpil6_ps composite _mm_cvtpul6_ps composite _mm_cvtpi8_ps composite _mm_cvtpu8_ps composite _mm_cvtpi32x2_ps composite _mm_cvtps_pil6 composite _mm_cvtps_pi8 composite int _mm_cvt_ss2si __m128 a Convert the lower SP FP value of a to a 32 bit integer according to the current rounding mode E int ad m64 _mm_cvt_ps2pi __m128 a Convert the two lower SP FP values of a to two 32 bit integers according to the current rounding mode returning the integers in packed form ro int a0 ri int al int _mm_cvtt_ss2si __m128 a Convert the lower SP FP value of a to a 32 bit integer with truncation Ne int ad 279 Intel C Compiler for Linu
471. x MODF Description The modf function returns the value of the signed fractional part of x and stores the integral part at x iptr asa floating point number Calling interface double modf double x double iptr long double modfl long double x long double iptr float modff float x float iptr NEARBYINT RINT Description The nearbyint function returns the rounded integral value as a floating point number using the current rounding direction Calling interface double nearbyint double x long double nearbyintl long double x float nearbyintf float x Description The rint function returns the rounded integral value as a floating point number using the current rounding direction Calling interface double rint double x long double rintl long double x float rintf float x 230 Reference ROUND Description The round function returns the nearest integral value as a floating point number Halfway cases are rounded away from zero Calling interface double round double x long double roundl long double x float roundf float x TRUNC Description The t runc function returns the truncated integral value as a floating point number Calling interface double trunc double x long double truncl long double x float truncf float x Remainder Functions The Intel Math library supports the following remainder functions FMOD Description The fmod function r
472. x Systems User s Guide m64 _mm_cvtt_ps2pi __m128 a Convert the two lower SP FP values of a to two 32 bit integer with truncation returning the integers in packed form rO int a0 ri int al m128 _mm_cvt_si2ss __m128 int Convert the 32 bit integer value b to an SP FP value the upper three SP FP values are passed through from a rO float b rl al r2 a2 r3 a3 m1128 _mm_cvt_pi2ps __m128 _ m64 Convert the two 32 bit integer values in packed form in b to two SP FP values the upper two SP FP values are passed through from a rO float b0 ri float bl r2 a2 r3 a3 inline __m128 _mm_cvtpil6_ps __m64 a Convert the four 16 bit signed integer values in a to four single precision FP values rO float a0 rl float al r2 float a2 r3 float a3 __ inline __m128 _mm_cvtpul6_ps __m64 a Convert the four 16 bit unsigned integer values in a to four single precision FP values rO float a0 ri float al r2 float a2 r3 float a3 __inline m128 _mm_cvtpi8_ps __m64 a Convert the lower four 8 bit signed integer values in a to four single precision FP values rO float ad ri float al r2 float a2 r3 float a3 inline m128 _mm_cvtpu8_ps __m64 a Convert the lower four 8 bit unsigned integer values in a to four single precision FP values rO float a0 ri float al r2 float a2 r3 float a3 280 Referenc
473. x0 int _mm_cvttsd_si32 __m128d a Converts the lower DP FP value of a to a 32 bit signed integer using truncate E int a0 m64 _mm_cvtpd_pi32 __m128d a Converts the two DP FP values of a to 32 bit signed integer values rO int a0 rl r int aL m64 _mm_cvttpd_pi32 __m128d a Converts the two DP FP values of a to 32 bit signed integer values using truncate rO int a0 ri int al 309 Intel C Compiler for Linux Systems User s Guide __m1i28d _mm_cvtpi32_pd __m64 a Converts the two 32 bit signed integer values of a to DP FP values rO double a0 rl double al mm_cvtsd_f 64 __m128d a This intrinsic extracts a double precision floating point value from the first vector element of an__m128d It does so in the most efficient manner possible in the context used This intrinsic does not map to any specific SSE2 instruction Floating point Memory and Initialization Operations for Streaming SIMD Extensions 2 This section describes the load set and store operations which let you load and store data into memory The load and set operations are similar in that both initialize __m128d data However the set operations take a double argument and are intended for initialization with constants while the load operations take a double pointer argument and are intended to mimic the instructions for loading data from memory The store operation assigns the initialized data to the address
474. x_Ss 370 Alternate Name Across MMX TM AIl IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Streaming Itanium Technology SIMD Architecture Extensions Streaming SIMD Extensions 2 N A B B N A A A N A B B mao fa h N A B B N A A A N A B B N A A A N A B B N A A A WA be h N A A A Wa be i Twa ha h N A B B N A A A N A B B max_ps and_ps andnot_ps Or_ps XOr_ps Cm cm cm cm cm cm cm cm cm Cm cm cm cm cm cm cm cm cm cm cm cm cm cm cm peq_ss peq_ps plt_ss plt_ps ple_ss ple_ps pgt_ss pgt_ps pge_ss pge_ps pneq_ss pneq_ps pnit_ss pnit_ps pnie_ss pnle_ps pngt_ss pngt_ps pnge_ss pnge_ps pord_ss pord_ps punord_ss punord_ps comiegq_ss comilt_ss N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Z gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A zZ P gt Dlr wm er wo e wo S w gt er er P gt C w gt NAA w gt w gt NAA w gt a poo gt w gt ve gt w gt w w Reference wl wm er Oo re wm em we wm gt iwz ew eo gt l gt 371 Intel C4 Compiler for Linux Systems User s
475. xample the following command compiles the program sum c without expanding the library functions but with inline expansion from interprocedural optimizations IPO prompt gt icpe ip nolib_inline sum cpp For details on IPO see Interprocedural Optimizations MASM Style Inline Assembly The Intel C Compiler supports MASM style inline assembly with the use_msasm option See your MASM documentation for the proper syntax GNU like Style Inline Assembly IA 32 only The Intel C Compiler supports GNU like style inline assembly The syntax is as follows asm keyword volatile keyword asm template asm interface PAR Under the use_msasm compilation flag Gnu asm aliases will only work if you use the __asm__ keyword they will not work correctly if you use the alternate __asm or asm keywords Syntax Element Description asm keyword asm statements begin with the keyword asm Alternatively either asmor___asm__ may be used for compatibility See Caution statement volatile keyword Ifthe optional keyword volatile is given the asm is volatile Two volatile asm statements will never be moved past each other and a reference to a volatile variable will not be moved relative to a volatile asm Alternate keywords __ volatile and __volatile__ may be used for compatibility 362 Syntax Element asm template asm interface output list input list clobber list input spec clobber spec Refer
Download Pdf Manuals
Related Search
Related Contents
Philips SJM2305 User's Manual ODSA : mode d`emploi e le ve - RealTruck.com Tritex II Installation & User Manual PDF file - Primare Anleitungen - Timex.com assets Garmin Forerunner 450CX User's Manual Relayer une journée santé ….. - ARS Basse Manual de instalación Copyright © All rights reserved.
Failed to retrieve file