Home

Intel® MPI Benchmarks User Guide and Methodology

1. gt II MPI data type MPI_BYTE e tvt ovrl e t_pure t CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap Reported throughput None lalltoall_pure The benchmark for the MPI_Ialltoall function that measures pure communication time In the case of np number of processes every process inputs X np bytes x for each process and receives X np bytes x from each process Property Description _ gt _ Measured pattern MPI_Ialltoall MPI_Wait m MPI data type MPI_BYTE maol Reported timings Bare time m Reported throughput None lalltoallv The benchmark for MPI_Ialltoallv that measures communication and computation overlap 68 MPI 3 Benchmarks m Property Description _ _ Measured pattern MPI_Talltoallv IMB_cpu_exploit MPI_Wait MPI data type tovr 1 Et pure amp CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl mine pure E CPU For details see Measuring Communication and Computation Overlap Reported throughput None lalltoallv_pure The benchmark for the MPI_Ialltoallv function that measures pure communication time In the case of np number of processes every process inputs X np bytes x for each process and receives X np bytes x from each process Poes MPI data type MPI_BY
2. Collective benchmark Parallel transfer benchmark 1 0 with explicit offset 1 0 with an individual file pointer I O with a shared file pointer I O with an individual file pointer to one private file for each process opened for MPI_COMM_SELF A placeholder for Read or Write component of the benchmark name MPI 2 Benchmarks Non blocking flavor For example bi S_IWrite_indv is the nonblocking flavor of the S_IWrite_indv benchmark Multi The benchmark runs in the multiple mode IMB MPI 2 Benchmark Classification Intel MPI Benchmarks introduces three classes of benchmarks e Single Transfer e Parallel Transfer e Collective Each class interprets results in a different way NOTE The following benchmarks do not belong to any class e Window measures overhead of one sided communications for the MPI_Win_create MPI_Win_free functions e Open_close measures overhead of input output operations for the MPI_File_open MPI_File_close functions Single Transfer Benchmarks This class contains benchmarks of functions that operate on a single data element transferred between one source and one target For MPI 2 benchmarks the source of the data transfer can be an MPI process or in the case of Read benchmarks an MPI file The target can be an MPI process or an MPI file For I O benchmarks the single transfer is defined as an operation between an MPI process and an individual window or a file e Sin
3. excluded by default PingPing Multi PingPing A nn Multi PingPingSpecificSource PingPingSpecificSource excluded by default mS SOIS excluded by default Sendrecv Multi Sendrecv haaa Exchange Multi Exchange m Bcast Multi Bcast m Allgather Multi Allgather m Allgatherv Multi Allgatherv Scatter Multi Scatter AAA Scatterv Multi Scatterv m aM Gather Multi Gather nuo Gatherv Multi Gatherv O Alltoall ulti Alltoall a Alltoallv Multi Alltoallv m Reduce Multi Reduce es Reduce_scatter Multi Reduce_scatter Damm _ _ _ Allreduce Multi Allreduce 24 MPI 1 Benchmarks Barrier Multi Barrier Classification of MPI 1 Benchmarks Intel MPI Benchmarks introduces the following classes of benchmarks e Single Transfer e Parallel Transfer e Collective benchmarks Each class interprets results in a different way Single Transfer Benchmarks Single transfer benchmarks involve two active processes into communication Other processes wait for the communication completion Each benchmark is run with varying message lengths The timing is averaged between two processes The basic MPI data type for all messages is MPI_BYTE Throughput values are measured in MBps and can be calculated as follows throughput X 27 10 time X 1 048576 time
4. 36 MPI 1 Benchmarks mw SOS Property Description A FFF Measured pattern MPI_Barrier A A gt A gt _ 24 lt Reported timings Bare time AAA Reported throughput 37 MPI 2 Benchmarks Intel MPI Benchmarks provides benchmarks for MPI 2 functions in two components IMB EXT and IMB IO The table below lists all MPI 2 benchmarks available and specifies whether they support the aggregate mode For I O benchmarks the table also lists nonblocking flavors Benchmark Window ulti Window Unidir_Put ulti Unidir_Put Unidir_Get ulti Unidir_Get Bidir_Get ulti Bidir_Get Bidir Put ulti Bidir Put Accumulate Multi Accumulate Benchmark Open_Close Multi Open_Close S_Write_indv Multi S_Write_indv S_Read_indv Multi S_Read_indv S_Write_expl Aggregate Mode Non blocking Mode IMB EXT Supported Supported Supported Supported Supported Aggregate Mode Non blocking Mode IMB IO S_IWrite_indv Supported Multi S_IWrite_indv S_IRead_indv Multi S_IRead_indv Supported S_IWrite_expl 38 MPI 2 Benchmarks Multi S_Write_expl S_Read_expl Multi S_Read_expl P_Write_indv Supported Multi P_Write_indv P_Read_indv Multi P_Read_indv P_Write_expl Supported Multi P_Write_expl P_Read_expl Multi P_Read_expl P_Write_shared Supported Multi P_Write_shared P_Read_shared Multi P_Read_shared P Weite priv
5. 32 1000 64 1000 128 1000 Benchmark Methodology 256 342 1024 2048 4096 8192 16384 32768 65336 131072 262144 524288 1048576 2097152 4194304 1000 1000 1000 1000 1000 1000 1000 1000 640 320 160 80 40 20 10 4 Benchmarking Allreduce processes 2 bytes repetitions 0 1000 4 1000 8 1000 16 1000 32 1000 64 1000 128 1000 256 1000 512 1000 1024 1000 2048 1000 4096 1000 bmn ses t_max usec t_avg psec 107 Intel R MPI Benchmarks User Guide 8192 1000 16384 1000 32768 1000 65536 640 131072 320 262144 160 524288 80 1048576 40 2097152 20 4194304 10 All processes entering MPI_Finalize Sample 2 IMB MPI1 PingPing Allreduce The following example shows the results of the PingPing iste np 6 IMB MPI1 pingping allreduce map 2x3 msglen Lengths multi 0 Lengths file 0 100 1000 10000 100000 1000000 Intel R MPI Benchmark Suite V3 2 2 MPI1 part Date Thu Sep 4 13 26 03 2008 Machine x86_64 System Linux 108 Benchmark Methodology Release 2 6 9 42 ELsmp Version 1 SMP Wed Jul 12 23 32 02 EDT 2006 MPI Version gt 20 29 MPI Thread Environment MPI_THREAD_SINGLI New default behavior from Version 3 2 on the number of iterations per message size is cut down dynamically when a certain run time per message size sa
6. Supported Multi P_Write_priv P_Read_priv Multi P_Read_priv C_Write_indv Supported Multi C_Write_indv C_Read_indv Multi C_Read_indv C_Write_expl Supported Multi C_Write_expl Multi IS_Write_expl S_IRead_expl Multi IS_Read_expl P_IWrite_indv Multi P_IWrite_indv P_IRead_indv Multi P_IRead_indv P_IWrite_expl Multi P_IWrite_expl P_IRead_expl Multi P_IRead_expl P_IWrite_shared Multi P_IWrite_shared P_IRead_shared Multi P_IRead_shared P_IWrite_priv Multi P_IWrite_priv P_IRead_priv Multi P_IRead_priv C_IWrite_indv Multi C_IWrite_indv C_IRead_indv Multi C_IRead_indv C_IWrite_expl Multi C_IWrite_expl 39 Intel R MPI Benchmarks User Guide C_Read_expl Multi C_Read_expl C_Write_shared Supported Multi C_Write_shared C_Read_shared Multi C_Read_shared See Also Benchmark Modes IMB IO Nonblocking Benchmarks Naming Conventions C_IRead_expl Multi C_IRead_expl C_IWrite_shared Multi C_IWrite_shared C_IRead_shared Multi C_IRead_shared MPI 2 benchmarks have the following naming conventions Convention Unidir Bidir expl mn indv aaa shared u priv ACTION 40 Description Unidirectional bidirectional one sided communications These are the one sided equivalents of PingPong and PingPing Single transfer benchmark
7. than one process group can be created Default no multi selection Intel MPI Benchmarks run non multiple benchmark flavors off_cache cache_size cache_line_size Option Use the off_cache flag to avoid cache re usage If you do not use this flag default the communications buffer is the same within all repetitions of one message size sample In this case Intel MPI Benchmarks reuses the cache so throughput results might be non realistic The argument after off cache can be a single number cache_size two comma separated numbers cache_size cache_line size or 1 e cache_size is a float for an upper bound of the size of the last level cache in MB e cache_line_size is assumed to be the size of a last level cache line can be an upper estimate e 1 indicates that the default values from IMB_mem_info h should be used The cache_size and cache_line_size values are assumed to be statically defined in IMB_mem_info h The sent received data is stored in buffers of size 2x MAX cache_size message_size When repetitively using messages of a particular size their addresses are advanced within those buffers so that a single message is at least 2 cache lines after the end of the previous message When these buffers are filled up they are reused from the beginning off_cache is effective for IMB MPI1 and IMB EXT You are not recommended to use this option for IMB IO Examples Use the default values defined in IMB_
8. Benchmarks Intel MPI Benchmarks provides two types of benchmarks for nonblocking collective NBC routines that conform to the MPI 3 standard e benchmarks for measuring the overlap of communication and computation e benchmarks for measuring pure communication time TIP When you run the IMB NBC component only the overlap benchmarks are enabled by default To measure pure communication time specify the particular benchmark name or use the include command line parameter to run the _pure flavor of the benchmarks The following table lists all IMB NBC benchmarks Benchmarks Measuring Communication and Benchmarks Measuring Pure Communication Computation Overlap Enabled by Default Time Disabled by Default Ibcast Ibcast_pure Tallgather Tallgather_pure Tallgatherv Tallgatherv_pure Igather Igather_pure Igatherv Igatherv_pure Iscatter Lseatter p re Iscatterv iscatterv pure Talltoall Talltoall_pure Ialltoallv Talltoallv_pure Treduce Treduce_pure 63 Intel R MPI Benchmarks User Guide Ireduce_scatter Ireduce_scatter_pure Tallreduce Tallreduce_pure Ibarrier Ibarrier_pure See Also Measuring Communication and Computation Overlap Measuring Pure Communication Time Measuring Communication and Computation Overlap Semantics of nonblocking collective operations enables you to run inter process communication in the background while performing computations However the actual overlap depends on the particular MPI lib
9. Description of Windows version 320714 003 3 1 Hour new benchmarks 107 2007 Scatter v Gather v e IMB IO functional fix The following topics were added 320714 004 3 2 e Runtime control as default 08 2008 e Microsoft Visual Studio solution templates 320714 005 3 2 1 The following updates were added 04 2010 14 Introduction 320714 006 320714 007 320714 008 320714 009 320714 010 3 2 2 3 2 3 3 2 4 3 2 4 4 0 Beta e Fix of the memory corruption e Fix in accumulate benchmark related to using the CHECK conditional compilation macro e Fix for integer overflow in dynamic calculations on the number of iterations e Recipes for building IA 32 executable files within Microsoft Visual Studio 2005 and Microsoft Visual Studio 2008 project folders associated with the Intel MPI Benchmarks The following updates were added e Support for large buffers greater than 2 GB for some MPI benchmark 09 2010 e New benchmarks PingPongSpecificSource and PingPingSpecificSource e New options include exclude The following topics were updated and added e Changes in the Intel MPI Benchmarks 3 2 3 e Command line Control 08 2011 e Parameters Controlling IMB e Microsoft Visual Studio 2010 project folder support The following updates were added 06 2012 e Changes of document layout The following updates were added 06 2013 e Merged What s new section Documented new benchmarks that 1
10. Visual Studio 2010 project folder Changes in Intel MPI Benchmarks 3 2 2 This release includes the following updates as compared to the Intel MPI Benchmarks 3 2 1 e Support for large buffers greater than 2 GB for some MPI collective benchmarks Allgather Alltoall Scatter Gather to support large core counts e New benchmarks PingPongSpecificSource and PingPingSpecificSource The exact destination rank is used for these tests instead of MPI_ANY_ SOURCE as in the PingPong and 10 Introduction PingPing benchmarks These are not executed by default Use the include option to enable the new benchmarks For example mpirun n 2 IMB MPI include PingPongSpecificSource PingPingSpecificSource e New options include exclude for better control over the benchmarks list Use these options to include or exclude benchmarks from the default execution list Changes in Intel MPI Benchmarks 3 2 1 This release includes the following updates as compared to the Intel MPI Benchmarks 3 2 e Fix of the memory corruption issue when the command line option msglen is used with the Intel MPI Benchmarks executable files e Fixin the accumulated benchmark related to using the CHECK conditional compilation macro e Fix for the integer overflow in dynamic calculations on the number of iterations e Recipes for building IA 32 executable files within Microsoft Visual Studio 2005 and Microsoft Visual Studio 2008 project folders
11. associated with the Intel MPI Benchmarks Changes in Intel MPI Benchmarks 3 2 Intel MPI Benchmarks 3 2 has the following changes as compared to the previous version e The default settings are different e Microsoft Visual Studio project folders are added and can be used on the Microsoft Windows platforms e Makefiles for the Microsoft Windows nmake utility provided with the Intel MPI Benchmarks 3 1 are removed Run Time Control by Default The improved run time control that is associated with the time flag This is the default value for the Intel MPI Benchmarks executable files with a maximum run time per sample set to 10 seconds by the SECS_PER_SAMPLE parameter in the include file IMB_settings h Makefiles The nmake files for Windows OS were removed and replaced by Microsoft Visual Studio solutions The Linux OS Makefiles received new targets e Target MPI1 default for building IMB MP11 e Target EXT for building IMB EXT e Target 10 for building IMB IO e Target all for building all three of the above 11 Intel R MPI Benchmarks User Guide Microsoft Visual Studio Project Folders Intel MPI Benchmarks 3 2 contains Microsoft Visual Studio solutions based on an installation of the Intel MPI Library A dedicated folder is created for the Microsoft Windows OS without duplicating source files The solutions refer to the source files that are located at their standard l
12. between in the figure below Measured pattern Gi o MPI routines MPI_Isend MPI_Waitall MPI_Recv m S MPI data type MPI_BYTE SOS Reported timings time At in psec 4X 1 048576 time Reported throughput Exchange Pattern MPI_Isend MPI_Isend MPI_Recv MPI_Recv MPI_Waitall MPI_Isend MPI_Isend MPI_Recv MPI_Recv MPI_Waitall MPI_Isend MPI_Isend MPI_Recv MPI_Recv MPI_Waitall carries x bytes Collective Benchmarks The following benchmarks belong to the collective class e Bcast multi Bcast e Allgather multi Allgather e Allgatherv multi Allgatherv e Alltoall multi Alltoall 31 Intel R MPI Benchmarks User Guide e Alltoallv multi Alltoallv e Scatter multi Scatter e Scatterv multi Scatterv e Gather multi Gather e Gatherv multi Gatherv e Reduce multi Reduce e Reduce_scatter multi Reduce_scatter e Allreduce multi Allreduce e Barrier multi Barrier See sections below for definitions of these benchmarks Reduce The benchmark for the MPI_Reduce function It reduces a vector of length L X sizeof float float items The MPI data type is MPI_FLOAT The MPI operation is MPI_SUM The root of the operation is changed round robin Property Description Measured pattern MPI_Reduce MPI data type MPI_FLOAT MPI operation MPI_SUM Root i num_procs in iteration i Reported timings Bare time Reported throughput None Reduce_scatter The benchmark for the M
13. bytes 2X bytes 17 Intel R MPI Benchmarks User Guide NOTE If you do not select the cache flag add 2X cache size to all of the above For IMB IO benchmarks make sure you have enough disk space available e 16MB in the standard mode e Max X OVERALL VOL bytes in the optional mode For instructions on enabling the optional mode see Parameters Controlling Intel MPI Benchmarks Software Requirements To run the Intel MPI Benchmarks you need e cpp ANSI C compiler gmake on Linux OS or Unix OS e Enclosed Microsoft Visual C solutions as the basis for Microsoft Windows OS e MPI installation including a startup mechanism for parallel MPI programs Installing Intel MPI Benchmarks To install the Intel MPI Benchmarks unpack the installation file The installation directory structure is as follows e ReadMe_IMB txt doc documentation directory that contains the User s guide in PDF and HTML Uncompressed Help formats e IMB_Users_Guide pdf e IMB_Users_Guide htm e license license agreement directory that contains the following files e license txt specifies the source code license granted to you e use of trademark license txt specifies the license for using the name and or trademark of the Intel MPI Benchmarks e src program source and Make files e WINDOWS Microsoft Visual Studio solution files For basic instructions on how to use the Intel MPI Benchma
14. ee 66 lallgatherv Pure se anne tad ae eee eda RER Renee 66 E A ON 67 l llred ce sp re abia 67 lallt all see evs AAA a ne Pe aa ee et daly 68 all ll PUE ii ee ee En el ne EEE LE Re 68 all ll Winston ae raten nah nik Din ee 68 Intel R MPI Benchmarks User Guide FalltoallY pura a A a Sp 69 Parer atin wate a E tec col ea eamaeatneeh gaat wage R tat ER RSTERCRERFERERRFELFRERFOCHNNERN 69 Darme PUPS a ds 70 NN 70 Beast pura A A A DE re 70 O AE O Risch AA homie E RA at AEA 71 Gather PU A rs 71 JAM Microcar ERTL RRE RACER RAE Ra PROA ER POETA ENDE RARO OIE 72 Ig therv PU ena os 72 TA las 73 reduce PU ee ne ee ae RE EEE DEE 73 TE dUCS Scar ER ns RT ADE ser HERE RER ys 74 TEQUES SEQUE PU labs 74 CA a aia 75 LSGatter O 76 SC y asceten a A A A gets A A 76 lScattery gt PUNE in in asiendo das 17 IMB RMA BenchMarkS ata a ni 77 IMB RMA Benchmark Modes r ssnnenenennnnnnnnn eee 77 Classification of IMB RMA Benchmarks ceceee eee e ener ene nennen nenn 77 Accumulate see a ao 79 Algete 79 NI A NT 80 A reas techenssdipeted n A A EA EA OORA TENEAN 80 Bidir O OO E 81 COMpParer and SWa pia ttia Dn 81 Exchange di caos 81 Exchange Puts re Gps A A A nee 82 A ON 82 GETS CCU MUA AAA alan 83 A ee 83 GOCE O CA a ee EEE E deta sta oe Meanie santas 84 ONES PUES O AO 84 One geral load adan he 85 Put all local vidriera RRA A AREA ENO RARA RIE 85 Put Oca gen pn ooo ee 85 Truly passive Puto aii td a
15. iterations does not change during the execution You can also set the policy through the iter option See iter Default ITER_POLICY value defined in IMB_settings h The default policy is dynamic time Option Specifies the number of seconds for the benchmark to run per message size The argument after time is a floating point number The combination of this flag with the iter flag or its default alternative ensures that the Intel MPI Benchmarks always chooses the maximum number of repetitions that conform to all restrictions A rough number of repetitions per sample to fulfill the time request is estimated in preparatory runs that use 1 second overhead Default time is activated The floating point value specifying the run time seconds per sample is set in the SECS_PER_SAMPLE variable defined in IMB_settings h IMB_settings_io h The current value is 10 mem Option Specifies the number of GB to be allocated per process for the message buffers benchmarks message If the size is exceeded a warning is returned stating how much memory is required for the overall run not to be interrupted The argument after mem is a floating point number Default the memory is restricted by MAX_MEM_USAGE defined in IMB_mem_info h input lt File gt Option Use the ASCII input file to select the benchmarks For example the IMB_SELECT_EXT file looks as follows IMB benchmark selection fi
16. same locations for multiple transfers The variation of m provides important information about the system and the MPI implementation crucial for application code optimizations For example the following possible internal strategies of an implementation could influence the timing outcome of the above pattern 43 Intel R MPI Benchmarks User Guide e Accumulative strategy Several successive transfers up to M in the example above are accumulated without an immediate completion At certain stages the accumulated transfers are completed as a whole This approach may save time of expensive synchronizations This strategy is expected to produce better results in the aggregate case as compared to the non aggregate one e Non accumulative strategy Every Transfer is completed before the return from the corresponding function The time of expensive synchronizations is taken into account This strategy is expected to produce equal results for aggregate and non aggregate cases Assured Completion of Transfers Following the MPI standard assured completion of transfers is the minimum sequence of operations after which all processes of the file communicator have a consistent view after a write The aggregate and non aggregate modes differ in when the assured completion of data transfers takes place e after each transfer non aggregate mode e after a bunch of multiple transfers aggregate mode For Intel MPI Benchmarks assured complet
17. size except for Barrier Tmax Tmin and Tavg Results for the multiple mode e multi 0 the same as above with min avg over all groups e multi 1 the same for all groups max min avg over single groups Sample 1 IMB MPI1 PingPong Allreduce The following example shows the results of the PingPong and Allreduce benchmark lt gt np 2 IMB MPI1 PingPong Allreduce Intel R MPI Benchmark Suite V3 2 MPI1 part Date Thu Sep 4 13 20 07 2008 Machine x86_64 System Linux Release 2 6 9 42 ELsmp Version 1 SMP Wed Jul 12 23 32 02 EDT 2006 MPI Version 2 0 MPI Thread Environment MPI_THREAD SINGLE New default behavior from Version 3 2 on the number of iterations per message size is cut down dynamically when a certain run time per message size sample is expected to be exceeded Time limit is defined by variable SECS_PER_SAMPLE gt IMB_settings h or through the flag gt time 105 Intel R MPI Benchmarks User Guide Calling sequence was IMB MPI1 PingPong Allreduce Minimum message length in bytes 0 Maximum message length in bytes 4194304 MPI_Datatype MPI_BYTE MPI_Datatype for reductions 3 MPI_FLOAT MPI_Op E MPI_SUM List of Benchmarks to run PingPong Allreduce Benchmarking PingPong processes 2 bytes repetitions t usec Mbytes sec 106 0 1000 1 1000 2 1000 4 1000 8 1000 16 1000
18. throughput MBps One_put_all This benchmark tests the MPI_Put operation using one active process that transfers data to all other processes All target processes are waiting in the MPI_Barrier call while the origin process performs the transfers Property Description N MPI_Put MPI_Win_flush_all where N is the number of target Measured pattern processes MPI data type MPI_BYTE origin and target Reported timings Bare time Reported throughput MBps 84 MPI 3 Benchmarks One_get_all This benchmark tests the MPI_Get operation using one active process that gets data from all other processes All target processes are waiting in the MPI_Barrier call while the origin process accesses their memory Property Description N MPI_Get MPI_Win_flush_all where N is the number of target Measured pattern processes MPI data type MPI_BYTE origin and target Reported timings Bare time Reported throughput MBps Put_all_local This benchmark tests the MPI_Put operation where one active process transfers data to all other processes All target processes are waiting in the MPI_Barrier call while the origin process performs the transfers The completion of the origin process is ensured by the MPI_Win_flush_local_all operation Property Description N MPI_Put MPI_Win_flush_local_all where N is the number of target processes Measured pattern MPI data type MPI_BYTE origin and target Reported timings B
19. where e time is measured in y sec e xis the length of a message in bytes Parallel Transfer Benchmarks Parallel transfer benchmarks involve more than two active processes into communication Each benchmark runs with varying message lengths The timing is averaged over multiple samples The basic MPI data type for all messages is MPI_BYTE The throughput calculations of the benchmarks take into account the multiplicity nmsg of messages outgoing from or incoming to a particular process For the Sendrecv benchmark a particular process sends and receives X bytes the turnover is 2X bytes nmsg 2 For the Exchange benchmark the turnover is 4x bytes nmsg 4 Throughput values are measured in MBps and can be calculated as follows throughput nmsg X 2 10 time nmsg X 1 048576 time where e time is measured in psec e xis the length of a message in bytes 25 Intel R MPI Benchmarks User Guide Collective Benchmarks Collective benchmarks measure MPI collective operations Each benchmark is run with varying message lengths The timing is averaged over multiple samples The basic MPI data type for all messages is MPI_BYTE for pure data movement functions and MPI_FLOAT for reductions Collective benchmarks show bare timings The following table lists the MPI 1 benchmarks in each class Single Transfer Parallel Transfer Collective Beast PingPong Sendrecv Multi Bcast Allgather PingPongSpecificSource Exchange
20. 0 50 50 50 50 50 50 32 16 EGAT t_max t_avg Mb sec 113 Intel R MPI Benchmarks User Guide 16777216 1 Benchmarking P_Write_Indv processes 2 4 MODE NON AGGREGATE bytes rep s t_min psec t_max t_avg Mb sec 0 10 1 10 2 0 4 10 8 0 16 10 32 10 64 10 128 10 256 10 31 2 10 1024 10 2048 10 4096 10 8192 10 16384 10 32768 10 65536 10 131072 10 262144 10 524288 10 1048576 10 114 Benchmark Methodology 2097192 8 4194304 4 8388608 2 16777216 1 All processes entering MPI_Finalize Sample 4 IMB EXT exe The example below shows the results for the window benchmark received after running IMB EXT exe on a Microsoft Windows cluster using two processes The performance diagnostics for Unidir_Get Unidir_Put Bidir_Get Bidir_Put and Accumulate are omitted lt gt N 2 IMB EXT exe Intel R MPI Benchmark Suite V3 2 2 MPI 2 part Date Fri Sep 05 12 26 52 2008 Machine Intel64 Family 6 Model 15 Stepping 6 Genuinelntel System Windows Server 2008 Release 0 6001 Version Service Pack 1 MPI Version 2 0 MPI Thread Environment MPI_THREAD_ SINGLE New default behavior from Version 3 2 on the number of iterations per message size is cut down dynamically when a certain run time per message size sample is expected to be exceeded Time limit is defined by variable SECS_PER
21. 0 This release includes the following updates as compared to the Intel MPI Benchmarks 3 2 4 e Introduced new components IMB NBC and IMB RMA that conform to the MPI 3 0 standard e Introduced a new feature to set the appropriate policy for automatic calculation of iterations You can set the policy using the iter and iter_policy options e Added new targets to the Linux OS Makefiles e NBC for building IMB NBC e RMA for building IMB RMA e Updated Microsoft Visual Studio solutions to include the IMB NBC and IMB RMA targets e Support for the Microsoft Visual Studio 2013 Microsoft Visual Studio 2008 support is removed Changes in Intel MPI Benchmarks 3 2 4 This release includes the following updates as compared to the Intel MPI Benchmarks 3 2 3 e Changes of document layout Changes in Intel MPI Benchmarks 3 2 3 This release includes the following updates as compared to the Intel MPI Benchmarks 3 2 2 e Option msglog to control the message length Use this option to control the maximum and the second largest minimum of the message transfer sizes The minimum message transfer size is always 0 e Thread safety support in the MPI initialization phase Use MPI_Init by default because it is supported for all MPI implementations You can choose MPI_Init_thread by defining the appropriate macro e Option thread_level to specify the desired thread level support for MPI_Init_thread e Support for the Microsoft
22. 0 2013 conform to the MPI 3 0 standard 15 Intel R MPI Benchmarks User Guide The following updates were added 320714 011 4 0 Beta e New option iter_policy 03 2014 Update 1 j e Changes in iter option 320714 012 4 0 Document enhancements 05 2014 Related Information For more information you can see the following related resources Intel MPI Benchmarks Download Intel MPI Library Product 16 Installation and Quick Start This section explains how to install and start using the Intel MPI Benchmarks Memory and Disk Space Requirements The table below lists memory requirements for benchmarks run with the default settings standard mode and with the user defined settings optional mode In this table e Qis the number of active processes e X is the maximal size of the passing message Benchmarks Alltoall Allgather Allgatherv Exchange All other MPI 1 benchmarks IMB EXT IMB 10 Ialltoall Ialltoall_pure Iallgather Iallgatherv Tallgather_pure lallgatherv_pure All other IMB NBC benchmarks Compare_and_swap Exchange_put Exchange_get All other IMB RMA benchmarks Standard Mode Q 8 MB Q 1 4 MB 12 MB 8 MB 80 MB 32 MB Q 8 MB Q 1 4 MB 8 MB 12 B 16 MB 8 MB Optional Mode Q 2X bytes Q 1 X bytes 3X bytes 2X bytes 2 max X OVERALL_VOL bytes 3X bytes Q 2X bytes Q 1 X bytes 2X bytes 12 B 4X
23. 2 Benchm rk Modes u ee ana ee she 43 Assured Completion of Transfers 00 0 0 rene tenes 44 IMB EXT Benchmarks an e 2 la iba 44 Unidir Put au Be He RE uaa re ne tania hanes eh arctan oie een covets 44 Unidir Geb ieteka re Wake ted Pete REEL vad RR ae Wen ered eed FERNER deed Rhein Rare 45 Bidir a wake De ee 46 Bidir iii ih ae A AAA AAA 47 Accumulate ne a ed acacia HET ee 48 WindoW east ala ernennt 49 IMB Blocking Benchmarks ivi nee le 50 S ACTONI NAV sn a ir ed are 51 S LACHONTZERPI A ee ee ee ee 52 P FACHON INAV ae ah rear Ra chy Karen nee abend 53 P ACTION lada as 55 P TACTMONI Shared eseo o od Soe al ad 56 PACTON Priv o tos lacio 58 Ci IACTONI SINAV a a re A Pe A A Pe 59 CL PACTON I Xp atico coe thd ned eons teehee dd dato 59 E TACTILON Shared sissies nt Ai 60 Opena ElO ii dto 60 IMB IO Non blocking Benchmarks ooocococococococononnnnnncncnnnononorornrnrnnnrnn tne tae 61 Exploiting GPU ac a dd das 62 Displaying Results nad ted eee 62 MPI 3 Benchmarks u u2u20 uananannnnnnnnananananan an un un un un un nn nnnananananananunnnnanan an an an an un un un un un nnnnnanananenn 63 IMB NBE BenchMarks viii een ie eed Denen eng 63 Measuring Communication and Computation Overlap 2 444rHnen ernennen nenn nn 64 Measuring Pure Communication Time sereseserenenenennnnennnnnnnenennnnnnn nenn nennen 65 lallgather an anne is A ON 65 lallgather DUPE maca 2 Halbe nike kein 66 k llgather Vis HERE eb io aa
24. 5 Single Transfer Benchmarks 44HnHnenennnnnnnnn eee teens 25 Parallel Transfer Benchmarks 0 anne nennen nennen nn n nennen 25 Collective Benchmarks Teros 0 rel Bra ri rn ale 26 Single Transfer Benchmarks eich sn Hera nee er ne rn era 27 PingPong PingPongSpecificSource uunsesesesnnennnenenennnnenennnnnn nennen nennen nennen nenn 27 PingPing PingPingSpecifiCSOUrCe nn nennen nennen nnennnnnnn nenn nnnn nn 28 Parallel Transfer Benchmarks siisii eiri a nn eee ener nennen en een 29 SAM een a ab o ee 29 Exchange 2 ana ni nn Soa dd 30 Collective Benchmark S ii ee ne nee 31 REUS A A dl ee A ed Be A 32 Reduce scatter aa daa 32 AUF DU CO a A e a he 33 O nn 33 Allga the ias 34 o ON 34 a A 35 A AAN 35 Legal Information CAM pin Ae anv back atin bees Seah Ae TS nek SOAP Aad EA Sea AE ee 35 ACOA KERNE RER FU ad iaa tanaka sama 36 BCaSE airy ein tae a o ei Date eal oneal ce Sk thon a ad eh 36 BGI Gir a a dais die A o diia 36 MPI 2 Benchma arks 2 us2 20000 ann ann ne nn a m ma am nn a ann m a mn ann a aan mann nah ana 38 Naming GONVeNEIONS u ea on 40 IMB MPI 2 Benchmark Classification ururererennnnnnnnnnnnnnnnnn een enna 41 Single Transfer BenchmarkS ooococccncncnccoconononnnnnnnnnno nett Taa 41 Parallel Transfer BenchMarKkS 0 tenet nets 41 Collective Benchmarks nn een ne en ne nn an nen nn nn ao 42 MPI 2 Benchmarks Classification ne ee een 42 MPI
25. C_ I Read_indv C_ I Read_indv E_ I lRead expl C_ I Write_shared C_ I Write_shared Single Transfer Parallel Transfer Collective Other Window Multi_Window Multi C_ I Write_indv Open_close Multi Open_close MPI 2 Benchmarks C_ I Read_shared P_ I Read_shared Malti C_ I Write_shared P_ I Write_priv P_ I Read_priv MPI 2 Benchmark Modes MPI 2 benchmarks can run in the following modes e Blocking nonblocking mode These modes apply to the IMB 10 benchmarks only For details see sections IMB IO Blocking Benchmarks and IMB IO Nonblocking Benchmarks e Aggregate non aggregate mode Non aggregate mode is not available for nonblocking flavors of IMB IO benchmarks The following example illustrates aggregation of M transfers for MB EXT and blocking Write benchmarks Select a repetition count M time MPI Wtime issue M disjoint transfers assure completion of all transfers time MPI_Wtime time M In this example e Mis arepetition count e M 1 inthe non aggregate mode n_sample in the aggregate mode For the exact definition of n_sample see the Actual Benchmarking section e A transfer is issued by the corresponding one sided communication call for IMB EXT and by an MPI IO write call for IMB 10 e Disjoint means that multiple transfers if M gt 1 are to from disjoint sections of the window or file This permits to avoid misleading optimizations when using the
26. Guide lt install dir gt refers to the Intel MPI Library installation directory NOTE If you have already sourced ictvars sh for the Bourne command line shell you can skip the sourcing of the environment variables controlled by compilervars sh and mpivars sh 2 Build the Intel MPI Benchmarks for the target system based on the Intel MIC Architecture host cd lt path to IMB directory gt src host make f make_ict_mic For details on running the resulting executable files on the Intel MIC Architecture see the Intel MPI Library documentation See Also Running Intel MPI Benchmarks Building Intel MPI Benchmarks on Windows OS To build the benchmarks for IMB MP11 IMB IO IMB EXT IMB NBC Or IMB RMA follow these steps 1 Check the environment variable settings for Include Lib and Path Make sure they are set in accordance with this table Intel 64 Architecture Settings 1A 32 Architecture Settings I_MPI_ROOT intel64 include I_MPI_ROOT ia32 include I_MPI_ROOT intel64 lib I_MPI_ROOT S ia32 lib I_MPI_ROOT intel64 kin I_MPI_ROOT ia32 bin NOTE Intel MPI Library 5 0 does not support the IA 32 architecture Use an earlier version of Intel MPI Library to build IA 32 architecture benchmarks 2 Go to the subfolder that corresponds to the Intel MPI Benchmarks component you would like to build and the Microsoft Visual Studio version installed on your system For example to build IMB EXT exe
27. Intel MPI Benchmarks User Guide and Methodology Description Copyright 2004 2014 Intel Corporation All Rights Reserved Document number 320714 012EN Revision 4 0 Contents Legal Information uuuu2nanananananunununununnnnununanananannnnnnnn nn an an an an an un un un un un un nnnananananananunnun ann anann 6 Getting Help and Support uuuauauanananananunununununununnnnununanananananununnnnnnananananan edapena enaa aniisi naniu 8 SUBMItHING ISSUES a se A O a 8 Introduction uuuanananunununnnnananananununununununnnnananananananununnnnanan an an an an un un ee un un un Aana raaa e aa rnana raa 9 Intended AUGIenGe an era ee rl ah RD A Gd eee Pee 9 Introducing Intel R MPI Benchmarks s2sseseeseesnennennnnnnenen nennen nr narran arar narran sense nnnnn nenn 9 Whats NeW nr res NR rinnen een 10 Changes in Intel MPI Benchmarks 4 0 zs2susenenenenennnnennnnnnnnennnnn nennen nn en en 10 Changes in Intel MPI Benchmarks 3 2 4 2zususesenenenennnnennnnnnnnnennnnn nn nn nennen nennen 10 Changes in Intel MPI Benchmarks 3 2 3 2z2s4s4renenenennnnenennennnennnnn nn nn nn nn en een nn 10 Changes in Intel MPI Benchmarks 3 2 2 2z2s4senenenenennnnnnnnnnnnnnnnnnn nenn nennen een nn 10 Changes in Intel MPI Benchmarks 3 2 1 z4rHrererennnnennnnnnnnnnnnn nennen nn en en een 11 Changes in Intel MPI Benchmarks 3 2 2ususesenenenenennnnennnnnnnennnnn nn nn nennen ne
28. Intel Atom Inside Intel Core Intel Inside Intel Insider the Intel Inside logo Intel NetBurst Intel NetMerge Intel NetStructure Intel SingleDriver Intel SpeedStep Intel Sponsors of Tomorrow the Intel Legal Information Sponsors of Tomorrow logo Intel StrataFlash Intel vPro Intel XScale InTru the InTru logo the InTru Inside logo InTru soundmark Itanium Itanium Inside MCS MMX Moblin Pentium Pentium Inside Puma skoool the skoool logo SMARTi Sound Mark Stay With It The Creators Project The Journey Inside Thunderbolt Ultrabook vPro Inside VTune Xeon Xeon Inside X GOLD XMM X PMU and XPOSYS are trademarks of Intel Corporation in the U S and or other countries Other names and brands may be claimed as the property of others Microsoft Windows and the Windows logo are trademarks or registered trademarks of Microsoft Corporation in the United States and or other countries Java is a registered trademark of Oracle and or its affiliates Copyright O 2004 2014 Intel Corporation All rights reserved Intel s compilers may or may not optimize to the same degree for non Intel microprocessors for optimizations that are not unique to Intel microprocessors These optimizations include SSE2 SSE3 and SSSE3 instruction sets and other optimizations Intel does not guarantee the availability functionality or effectiveness of any optimization on microprocessors not manufactured by Intel Microprocessor d
29. Intel sales office or your distributor to obtain the latest specifications and before placing your product order Copies of documents which have an order number and are referenced in this document or other Intel literature may be obtained by calling 1 800 548 4725 or go to http www intel com design literature htm Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products MPEG 1 MPEG 2 MPEG 4 H 261 H 263 H 264 MP3 DV VC 1 MJ PEG AC3 AAC G 711 G 722 G 722 1 G 722 2 AMRWB Extended AMRWB AMRWB G 167 G 168 G 169 G 723 1 G 726 G 728 G 729 G 729 1 GSM AMR GSM FR are international standards promoted by ISO IEC ITU ETSI 3GPP and other organizations Implementations of these standards or the standard enabled platforms may require licenses from various entities including Intel Corporation BlueMoon BunnyPeople Celeron Celeron Inside Centrino Centrino Inside Cilk Core Inside E GOLD Flexpipe i960 Intel the Intel logo Intel AppUp Intel Atom
30. MPI_Put disjoint MPI_Win_fence Unidir_Get This is the benchmark for the MP1_Get Unidir_Get Definition Property Measured pattern MPI_Win_fence Description As symbolized between in the figure below This benchmark runs on two active processes Q 2 45 Intel R MPI Benchmarks User Guide MPI routine MPI data type Reported timings Reported throughput Unidir_Get Pattern PROCESS 1 Mfold MPI Get disjoint MPI Win fence Bidir_Put MPI_Get MPI_BYTE for both origin and target t t M in psec as indicated in the figure below non aggregate M 1 and aggregate M n_sample For details see Actual Benchmarking x t aggregate and non aggregate PROCESS 2 This is the benchmark for the MPI_Put function with bidirectional transfers See the basic definitions below Bidir_Put Definition Property Measured pattern MPI routine MPI data type 46 Description As symbolized between in the figure below This benchmark runs on two active processes Q 2 MPI_ Put MPI_BYTE for both origin and target MPI 2 Benchmarks Reported timings Reported throughput Bidir_Get This is the benchmark for the MPI_Get function definitions and a schematic view of the pattern Bidir_Get Definition Property Measured pattern MPI routine MPI data type Reported timings Reported throughput Bidir_Get Pattern t t M in usec as indica
31. Multi Allgather Allgatherv PingPing Multi PingPong Multi Allgatherv Alltoall PingPingSpecificSource Multi PingPing Multi Alltoall Alltoallv Multi Sendrecv Multi Alltoallv Scatter Multi Exchange Multi Scatter Scatterv Multi Scatterv Gather Multi Gather Gatherv Multi Gatherv Reduce Multi Reduce 26 MPI 1 Benchmarks Reduce_scatter Multi Reduce_scatter lreduce 1ti Allreduce Barrier Multi Barrier Single Transfer Benchmarks The following benchmarks belong to the single transfer class e PingPong e PingPongSpecificSource e PingPing e PingPingSpecificSources See sections below for definitions of these benchmarks PingPong PingPongSpecificSource Use PingPong and PingPongSpecificSource for measuring startup and throughput of a single message sent between two processes PingPong uses the MPI_ANY_SOURCE value for destination rank while PingPongSpecificSource uses an explicit value PingPong Definition K E_ Kooo Property Description EB EEE As symbolized between in the figure Measured pattern below This benchmark runs on two active processes Q 2 jt MPI routines MPI_Send MPI_Recv DL MPI data type MPI_BYTE Fd time At 2 in psec as indicated in the figure Reported timings below 27 Intel R MPI Benchmarks User Guide Reported throughput PingPong Pattern X 1 048576 time PROCESS 1 PROCE
32. PI_Reduce_scatter function It reduces a vector of length L X sizeof float float items The MPI data type is MPI_FLOAT The MPI operation is MPI_SUM In the scatter phase the L items are split as evenly as possible To be exact for np number of processes L r npts where 32 MPI 1 Benchmarks o K Il L np e s L mod np In this case the process with rank i gets e r 1 items when i lt s e x items when izs m Property Description E A nn _ u_ nn Measured pattern MPI_Reduce_scatter mmm MPI data type MPI_FLOAT _ __ _ ___ MPI operation MPI_SUM A ee Reported timings Bare time _ __ ER Reported throughput Allreduce The benchmark for the MPI_Allreduce function It reduces a vector of length L X sizeof float float items The MPI data type is MPI_FLOAT The MPI operation is MPI_SUM Sess Property Description A Measured pattern MPI_Allreduce xj _ 4 4141 MPI data type MPI_FLOAT es MPI operation MPI_SUM a Reported timings Bare time q lt Reported throughput None Allgather The benchmark for the MPI_Allgather function Every process inputs x bytes and receives the gathered X np bytes where np is the number of processes 33 Intel R MPI Benchmarks User Guide mwmw SIS Property Description FFF Measured pattern MPI_Allgather EEE MPI data type I S SIS Reported timings aM Reported throughput Allgathe
33. Root i num_procs in iteration i 73 Intel R MPI Benchmarks User Guide Reported timings Bare time Reported throughput None Ireduce_scatter The benchmark for MPI_Ireduce_scatter that measures communication and computation overlap It reduces a vector of length L X sizeof float float items The MPI data type is MPI_FLOAT The MPI operation is MPI_SUM In the scatter phase the L items are split as evenly as possible To be exact for np number of processes L r npts where e r L npl e s L mod np In this case the process with rank i gets e r 1 items when i lt s e r items when izs Property Description Measured pattern MPI_Ireduce_scatter IMB_cpu_exploit MPI_Wait MPI data type MPI_FLOAT MPI operation MPI_SUM e COVELL e t pure e E CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap Reported throughput None Ireduce_scatter_pure The benchmark for the MPI_Ireduce_scatter function that measures pure communication time It reduces a vector of length L X sizeof float float items The MPI data type is 74 MPI 3 Benchmar ks MPI_FLOAT The MPI operation is MPI_SUM In the scatter phase the L items are split as evenly as possible To be exact for np number of processes L r npts where e r L np e s L mod np In this case the process with rank i gets
34. SS 2 MPI_Send time At 2 MPI Recv MPI Send el PingPing PingPingSpecificSource PingPing and PingPingSpeci ficSource measure startup and throughput of single messages that are obstructed by oncoming messages To achieve this two processes communicate with each other using MPI_Isend MPI_Recv MPI_Wait calls The MPI_Isend calls are issued simultaneously by both processes For destination rank PingPing uses the MPI_ANY_SOURCE value while PingPingSpecifi PingPing Definition Property Measured pattern MPI routines MPI data type Reported timings 28 cSource uses an explicit value Description As symbolized between in the figure below This benchmark runs on two active processes Q 2 MPI_Isend MPI_Wait MPI_Recv MPI BYTE time At in psec MPI 1 Benchmarks Reported throughput X 1 048576 time PingPing Pattern PROCESS 1 PROCESS 2 MPI_Isend request 2 MPI_Isend request r xbytes x fes MPI_Recv MPI_Recv MPI_Wait r MPI_Wait r Parallel Transfer Benchmarks The following benchmarks belong to the parallel transfer class e Sendrecv e Exchange ulti PingPong ulti PingPing ulti Sendrecv ulti Exchange See sections below for definitions of these benchmarks NOTE The definitions of the multiple mode benchmarks are analogous to their standard mode counterparts in the single transfer class Sendrecv The Send
35. TE Reported timings Bare time c Reported throughput None Ibarrier The benchmark for MPI_Ibarrier that measures communication and computation overlap o Property Description lt Measured pattern MPI_Ibarrier IMB_cpu_exploit MPI_Wait Reported timings 69 Intel R MPI Benchmarks User Guide e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap Reported throughput None Ibarrier_pure The benchmark for the MPI_Ibarrier function that measures pure communication time Property Description fa EEE Measured pattern MPI_Ibarrier MPI_Wait lt A AAA Reported timings Bare time AAA Reported throughput None Ibcast The benchmark for MPI_Ibcast that measures communication and computation overlap Property Description ee Measured pattern MPI_Ibcast IMB_cpu_exploit MPI_Wait mm MPI data type A e t OVEL e t_pure e t CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t pure t CPU For details see Measuring Communication and Computation Overlap Reported throughput None Ibcast_pure The benchmark for MPI_Ibcast that measures pure communication time The root process broadcasts x bytes to all other processes The root of the operation is changed round robin 70 MPI 3 Benchmarks moam ____ _
36. _ _ Measured pattern a RR MPI data type MPI_BYTE o Reported timings Bare time A Reported throughput None Igather The benchmark for MPI_Igather that measures communication and computation overlap A Property Description Measured pattern MPI_Igather IMB_cpu_exploit MPI_Wait a MPI data type MPI_BYTE _ _ Root iSnum_procs in iteration i m e t ovrd e t_pure e t_CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap Reported throughput None Igather_pure The benchmark for the MPI_Igather function that measures pure communication time The root process inputs X np bytes x from each process All processes receive x bytes The root of the operation is changed round robin _ a _ Measured pattern Ba mr mmn MPI data type we 71 Intel R MPI Benchmarks User Guide m nn Root i num_procs in iteration i Reported timings Bare time Reported throughput None Igatherv The benchmark for MP1_Igatherv that measures communication and computation overlap m Property Description AAA Measured pattern MPI_Igatherv IMB_cpu_exploit MPI_Wait pl MPI data type MPI_BYTE _ _ gt __ _ _ Root iSnum_procs in iteration i ss t_ovrl e t pure E CPU Reported timings e overlap 100 max 0 min 1 t
37. _NO_DEPRECATE EXT IMB IO WIN_IMB CRT_SECURE_NO_DEPRECATE PIIO IMB MPI1 WIN_IMB CRT_SECURE_NO_DEPRECATE PIl IMB NBC WIN_IMB CRT_SECURE_NO_DEPRECATE NBC IMB RMA WIN_IMB 21 Intel R MPI Benchmarks User Guide CRT_SECURE_NO_DEPRECATE RMA Linker gt Input e x64 S I_MPI_ROOT intel64 lib impi lib Additional e IA 32 I_LMPI_ROOT ia32 lib impi lib Dependencies 8 Go to Build gt Build Solution to create an executable file 9 Run the executable file using Debug gt Start Without Debugging command Running Intel MPI Benchmarks To run the Intel MPI Benchmarks use the following command line syntax mpirun np lt P gt IMB lt component gt arguments where e lt P gt is the number of processes P 1 is recommended for all I O and message passing benchmarks except the single transfer ones e lt component gt is the component specific suffix that can take MP11 EXT IO NBC and RMA values By default all benchmarks run on Q active processes defined as follows Q 1 2 4 8 largest 2 For example if P 11 the benchmarks run on Q 1 2 4 8 11 active processes Single transfer IMB IO benchmarks run with o 1 Single transfer IMB EXT and IMB RMA benchmarks run with QSZ To pass control arguments other than P you can use argc argv Process 0 in MPI_COMM_WORLD reads all command line arguments and broadcasts them to all other processes Control argum
38. _SAMPLE gt IMB_settings h or through the flag gt time 115 Intel R MPI Benchmarks User Guide Calling sequence was MPI_Datatype MPI_Op Window Unidir_Get Unidir Put Bidir Get Bidir_Put Accumulate List of Benchmarks to run master node MPI_Share_Area IMB_3 1 src IMB EXT exe Minimum message length in bytes 0 Maximum message length in bytes 4194304 MPT BYTE MPI_Datatype for reductions MPI_FLOAT MPI_SUM Benchmarking Window processes 2 bytes repetitions 0 16 32 64 1283 296 116 100 100 100 100 100 100 100 100 t min usec t_max psec t_avg usec Benchmark Methodology 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 All processes entering MPI_Finalize The above example listing shows the results of running IMB cluster using two processes 100 100 100 100 100 100 100 100 100 100 80 40 20 10 EXT exe on a Microsoft Windows The listing only shows the result for the Window benchmark The performance diagnostics for Unidir_Get Unidir_Put Bidir_Get Bidir_Put and Accumulate are omitted 117
39. _max usec t avg psec 0 1000 100 1000 1000 1000 10000 1000 100000 419 1000000 41 111 Intel R MPI Benchmarks User Guide All processes entering MPI_Finalize Sample 3 IMB 10 p_write_indv The following example shows the results of the p_write_indv benchmark lt gt IMB IO np 2 p write indy npmin 2 Date Machine System Release Version MPI Version MPI Thread Environment MPI_THREAD_SINGL Thu Sep 4 13 43 34 2008 x86_64 Linux 2 6 9 42 ELsmp 1 SMP Wed Jul 12 23 32 02 EDT 2006 2 0 Gl New default behavior from Version 3 2 on the number of iterations per message size is cut down dynamically when a certain run time per message size sample is expected to be exceeded Time limit is defined by variable SECS_PER_SAMPLE gt IMB_settings h or through the flag gt time Calling sequence was IMB IO p_write_indv Minimum io portion in Maximum io portion in List of Benchmarks to P_Write_Indv npmin 2 bytes 0 bytes 16777216 run 112 Benchmark Methodology proce sses Benchmarking P_Write_Indv 2 MOD bytes rep s t_min psec 0 1 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 E AGGR 50 50 50 50 50 50 50 50 50 50 50 50 50 5
40. _pure t_CPU t_ovrl min t pure t CPU For details see Measuring Communication and Computation Overlap Reported throughput None Igatherv_pure The benchmark for the MPI_Igatherv function that measures pure communication time All processes input x bytes The root process receives X np bytes where np is the number of processes The root of the operation is changed round robin Property aaa Fr o Measured pattern a Pr _ MPI data type oa ft 72 MPI 3 Benchmarks Reported timings Bare time Reported throughput None Ireduce The benchmark for MPI_Ireduce that measures communication and computation overlap Property esa m Measured pattern Bee ER m MPI data type a FFF ee Root m i Snum_procs in iteration i e t_ovrl e t_pure e CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap Reported throughput None Ireduce_pure The benchmark for the MPI_Ireduce function that measures pure communication time It reduces a vector of length L X sizeof float float items The MPI data type is MPI_FLOAT The MPI operation is MPI_SUM The root of the operation is changed round robin Property Description em Measured pattern MPI_Ireduce MPI_Wait O MPI data type MPI_FLOAT a MPI operation MPI_SUM _ _ _ _ _
41. aneous access to the memory of a process by all other processes different ranks choose different targets at each particular step For example while looping through all the possible target ranks the next target is chosen as follows target_rank current_rank num_ranks 79 Intel R MPI Benchmarks User Guide Property Description N MPI_Get MPI_Win_flush_all where N is the number of target Measured pattern processes MPI data type MPI_BYTE origin and target Reported timings Bare time Reported throughput None All_put_all The benchmark tests the scenario when all processes communicate with each other using MPI_Put operation To avoid congestion due to simultaneous access to the memory of a process by all other processes different ranks choose different targets at each particular step For example while looping through all the possible target ranks the next target is chosen as follows target_rank current_rank num_ranks Property Description N MPI_Put MPI_Win_flush_all where N is the number of target Measured pattern processes MPI data type MPI_BYTE origin and target Reported timings Bare time Reported throughput None Bidir_get This benchmark measures the bidirectional MPI_Get operation in passive target communication mode The benchmark runs on two active processes These processes initiate an access epoch to each other using the MPI_Lock function get data from the target close the access e
42. ansfer sizes X an upper bound is set to OVERALL_VOL X The OVERALL_VOL value is defined in IMB_settings h IMB_settings_io h with 4MB and 16MB values respectively Given transfer size X the repetition count for all aggregate benchmarks is defined as follows n_sample MSGSPERSAMPLE X 0 n_sample max 1 min MSGSPERSAMPLE OVERALL_VOL X X gt 0 The repetition count for non aggregate benchmarks is defined completely analogously with MSGSPERSAMPLE replaced by MSGS_NONAGGR A reduced count is recommended as non aggregate run times are usually much longer In the following examples elementary transfer means a pure function MPI_ Send MPI_Put MPI_Get MPI_Accumulate MPI_File_write_XX MPI_File_read_XX without any further function call Assured completion transfer completion is e MPI_Win_fence for IMB EXT benchmarks e atriplet MPT_File_sync MPI_Barrier file_communicator MPI_File_sync for IMB IO Write benchmarks e MPI_wWin_flush MPI_Win_flush_all MPI_Win_flush_local or MPI_Win_flush_local_all for IMB RMA benchmarks e empty for all other benchmarks MPI 1 Benchmarks for i 0 i lt N_BARR i MPI_Barrier MY_COMM time MPI_Wtime for i 0 i lt n sample i execute MPI pattern time MPI_Wtime time n_sample IMB EXT and Blocking I O Benchmarks 102 Benchmark Methodology For aggregate benchmarks the kernel loop looks as follows for i 0 i
43. arations for Benchmarking 44244s4rHrHnenenennnnen nennen nenn nn nn nn nn en nennen 99 Message l O Buffer Lengths 0c nennen nn en nennen nennen nennen nn nnnnnn 101 Buffer PMItialiZAtiOnss u ee ar ee an ar a BIER La Tr rad 101 Warm Up Phase IMB MPI1 IMB EXT IMB NBC and IMB RMA ccccceeeeeeeeeeneenees 101 Synchronization an Ati ls Hi Bun Re Dr ee eee 101 Actual Benchmarking tddi are ee A ah EE 102 CHECKING ReSulES csi Hunt ai tease Enri illa ras 104 O a EEE SED Rede ee 104 Sample 1 IMB MPI1 PingPong Allreduce 4244444nsR seen nennen ennnnn nennen nenn nn 105 Sample 2 IMB MPI1 PingPing Allreduce 224244444n sn Henn nenn nennen nenn nn nnnnn nn 108 Sample 3 IMB 10 p write indV nassen nennen sn nen 112 Sample 4 IMB EXT Xi as 115 Intel R MPI Benchmarks User Guide Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS NO LICENSE EXPRESS OR IMPLIED BY ESTOPPEL OR OTHERWISE TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE MERCHANTABILITY OR INFRINGEMENT OF ANY PATENT COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT A Mission Cri
44. are time Reported throughput MBps Put_local This benchmark measures the combination of MPI_Put and MPI_Win_flush_al1 operations in passive target communication mode The benchmark runs on two active processes The target process is waiting in the MP1_Barrier Call Property Description Measured pattern MPI_Put MPI_Win_flush_local 85 Intel R MPI Benchmarks User Guide MPI data type MPI_BYTE origin and target Reported timings Bare time Reported throughput MBps Truly_passive_put This benchmark verifies whether the MPI implementation supports the truly one sided communication mode In this mode the origin process can complete its access epoch even if the target process is outside the MPI stack The Truly_passive_put benchmark returns two timing values e The time needed for the origin process to complete the MPI_Put operation while the target process is waiting in the MPI stack in the MPI_Barrier call e The time needed for the origin process to complete the MPI_Put operation while the target process performs computations outside the MPI stack before the MPI_Barrier call To ensure measurement correctness the time spent by the target process in the computation function should be comparable to the time needed for successful completion of the MPI_Put operation by the origin process Property Description MPI_Put MPI_Win_flush while the target process performs EIESSUFE Pattern computations before the MPI_Barrie
45. avors have a Write and a Read component The ACTION placeholder denotes a Read or a Write alternatively The Write flavors of benchmarks include a file synchronization with different placements for aggregate and non aggregate modes Figure I O Benchmarks Aggregation for Output 50 MPI 2 Benchmarks Output m fold aggregation M fold elementary I O action output disjointfile sections A t M MPI File sync non aggregate mode t A t M 1 aggregate mode t A t M n sample M choice ofm n_sample input No aggregation S_ ACTION _indv File 1 O performed by a single process This pattern mimics the typical case when a particular master process performs all of the I O See the basic definitions and a schematic view of the pattern below S_ ACTION _indv Definition Property Description As symbolized in figure I O benchmarks Measured pattern aggregation for output Elementary I O action As symbolized in the figure below MPI routines for the blocking mode MPI_File_write MPI_File_read 51 Intel R MPI Benchmarks User Guide nn MPI routines for the nonblocking mode PI File iwrite MPI File iread A A44 4 4141414141 a IMIM etype PI_BYTE _ _ _ _ _ _ _ _ J J0 _ File type PI_BYTE A 2 2 HH MPI data type MPI_BYTE _ ___ A A A _ t in psec as indicated in the f
46. called with initialize 0 concurrently with the particular I O action and always performs the same type and number of operations as in the initialization step Displaying Results Three timings are crucial to interpret the behavior of nonblocking I O overlapped with CPU exploitation e t_pure is the time for the corresponding pure blocking I O action non overlapping with CPU activity e t_CPU is the time the IMB_cpu_exploit periods running concurrently with nonblocking 1 0 would use when running dedicated e t_ovrl is the time for the analogous nonblocking I O action concurrent with CPU activity exploiting t_cPU when running dedicated A perfect overlap means t_ovrl max t_pure t_CPU No overlap means t_ovrl t_pure t_CPU The actual amount of overlap is overlap t_puret tt_CPU t_ovrl min t_pure t_CPU The Intel MPI Benchmarks result tables report the timings t_ovr1 t_pure t_CPU and the estimated overlap obtained by the formula above At the beginning of a run the Mflop s rate is corresponding to the t_CPU displayed 62 MPI 3 Benchmarks Intel MPI Benchmarks provides two sets of benchmarks conforming to the MPI 3 standard e MB NBC benchmarks for nonblocking collective NBC operations e MB RMA one sided communications benchmarks that measure the Remote Memory Access RMA functionality introduced in the MPI 3 standard See Also IMB NBC Benchmarks IMB RMA Benchmarks IMB NBC
47. common file See the basic definitions and a schematic view of the pattern below P_ ACTION _indv Definition 53 Intel R MPI Benchmarks User Guide _ _ _ _ _ _ a o ___ Property A _ _ _ __ _2222 IO Measured pattern _ QQ _ Elementary I O action a SSSA MPI routines for the blocking mode fd MPI routines for the nonblocking mode L SA etype _ _ _ _ gt lt gt gt K KK File type I nn MPI data type EA _ _ _ __ __ _ AAA A Reported timings Reported throughput P_ ACTION _indv Pattern 54 Description As symbolized in figure I O benchmarks aggregation for output As symbolized in the figure below In this figure Nproc is the number of processes le write MPI_File read le iwrite MPI_File iread MPT BYT Tiled view disjoint contiguous blocks t in psec as indicated in the figure I O benchmarks aggregation for output aggregate and non aggregate for the Write flavor x t aggregate and non aggregate for the Write flavor MPI 2 Benchmarks MPI File ACTION X Nproc bytes common tiled file disjoint contiguous blocks P_ACTION_expl P_ ACTION _exp1 follows the same access pattern as P_ ACTION _indv with an explicit file pointer type See the basic definitions and a schematic view of the pattern below P_ ACTION _expl Definition Prop
48. e r 1 items when i lt s e r items when izs m Property Description _ AS Measured pattern MPI_Ireduce_scatter MPI_Wait cm MPI data type MPI_FLOAT _ MPI operation MPI_SUM fo Reported timings Bare time m Reported throughput None Iscatter The benchmark for MPI_Iscatter that measures communication and computation overlap pn Property Description DD Measured pattern MPI_Iscatter IMB_cpu_exploit MPI_Wait gt MPI data type MPI_BYTE Root i Snum_procs in iteration i _ e t ovrl e t_pure E GRU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap 75 Intel R MPI Benchmarks User Guide Reported throughput None Iscatter_pure The benchmark for the MPI_Iscatter function that measures pure communication time The root process inputs X np bytes x for each process All processes receive X bytes The root of the operation is changed round robin Property Description o _ _ Measured pattern MPI_Iscatter MPI_Wait AA MPI data type MPI_BYTE a a Root i num_procs in iteration i _ Reported timings Bare time None Reported throughput Iscatterv The benchmark for MPI_Iscatterv that measures communication and computation overlap __ Property Description m Measured pattern MPI_Iscatterv IMB_cpu_exploit MPI_Wait AAA MPI data ty
49. ence of blank separated strings Each argument is the name of a benchmark in exact spelling case insensitive For example the string IMB MPI1 PingPong Allreduce specifies that you want to run PingPong and Allreduce benchmarks only Default no benchmark selection All benchmarks of the selected component are run npmin Option Specifies the minimum number of processes P_min to run all selected benchmarks on The P_min value after npmin must be an integer Given P_min the benchmarks run on the processes with the numbers selected as follows P_min 2P_min 4P_min largest 2 P_min lt P P NOTE You may set P_min to 1 If you set P_min gt P Intel MPI Benchmarks interprets this value asP_min P Default no npmin selection Active processes are selected as described in the Running Intel MPI Benchmarks section 90 Benchmark Methodology multi outflag Option Defines whether the benchmark runs in the multiple mode The argument after multi is a meta symbol lt outflag gt that can take an integer value of 0 or 1 This flag controls the way of displaying results e Outflag 0 only display maximum timings minimum throughputs over all active groups e Outflag 1 report on all groups separately The report may be long in this case When the number of processes running the benchmark is more than half of the overall number MP I_COMM_WORLD the multiple benchmark coincides with the non multiple one as not more
50. entations Introducing Intel R MPI Benchmarks Intel MPI Benchmarks performs a set of MPI performance measurements for point to point and global communication operations for a range of message sizes The generated benchmark data fully characterizes e performance of a cluster system including node performance network latency and throughput e efficiency of the MPI implementation used The Intel MPI Benchmarks package consists of the following components e MB MPI1 benchmarks for MPI 1 functions e Two components for MPI 2 functionality e MB EXT one sided communications benchmarks e IMB IO input output 1 0 benchmarks e Two components for MPI 3 functionality e MB NBC benchmarks for nonblocking collective NBC operations e MB RMA one sided communications benchmarks These benchmarks measure the Remote Memory Access RMA functionality introduced in the MPI 3 standard Each component constitutes a separate executable file You can run all of the supported benchmarks or specify a single executable file in the command line to get results for a specific subset of benchmarks If you do not have the MPI 2 or MPI 3 extensions available you can install and use IMB MPI1 that uses only standard MPI 1 functions Intel R MPI Benchmarks User Guide What s New This section provides changes for the Intel MPI Benchmarks as compared to the previous versions of this product Changes in Intel MPI Benchmarks 4
51. ents Examples To define MSGSPI ERSAMPLI E as 2000 and ov iter 2000 10 0 ERALL_VOL as 100 use the following command line To define MSGS_NONAGGR as 150 you need to define values for MSGSPERSAMPLE and OVERALL_VOL as shown in the following command line iter 1000 40 To define MSGSP 7150 ERSAMPLI E as 2000 and set line see iter policy iter 2000 multiple_np iter_policy Option the multiple_np policy use the following command Use this option to set a policy for automatic calculation of the number of iterations Use one of the following arguments to override the default ITER_POLICY value defined in IMB_settings h Policy dynamic multiple_np 92 Description Reduces the number of iterations when the maximum run time per sample see time is expected to be reached Using this policy ensures faster execution but may lead to inaccuracy of the results Reduces the number of iterations when the message size is growing Using this policy ensures the accuracy of the results but may lead to longer execution time You can control the execution time through the time option Benchmark Methodology Automatically chooses which policy to use e applies multiple_np to collective operations where one of the ranks acts as RN the root of the operation for example MPI_Bcast e applies dynamic to all other types of operations off The number of
52. ents can define various features such as time measurement message length and selection of communicators For details see Command Line Control See Also Command Line Control Parameters Controlling Intel MPI Benchmarks Running Benchmarks in Multiple Mode Intel MPI Benchmarks provides a set of elementary MPI benchmarks You can run all benchmarks in the following modes e standard default the benchmarks run in a single process group e multiple the benchmarks run in several process groups 22 Installation and Quick Start To run the benchmarks in the multiple mode add the multi prefix to the benchmark name In the multiple mode the number of groups may differ depending on the benchmark For example if PingPong is running on N gt 4 processes N 2 separate groups of two processes are formed These process groups are running PingPong simultaneously Thus the benchmarks of the single transfer class behave as parallel transfer benchmarks when run in the multiple mode See Also Classification of MPI 1 Benchmarks Classification of MPI 2 Benchmarks MPI 3 Benchmarks 23 MPI 7 Benchmarks IMB MPI1 component of the Intel MPI Benchmarks provides benchmarks for MPI 1 functions IMB MPI1 contains the following benchmarks Standard Mode Multiple Mode I YS ST PingPon Multi PingPon g g g g ooo Multi PingPongSpecificSource PingPongSpecificSource excluded by default m
53. ependent optimizations in this product are intended for use with Intel microprocessors Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice Notice revision 20110804 Getting Help and Support Your feedback is very important to us To receive technical support for the tools provided in this product and technical information including FAQ s and product updates you need to register for an Intel R Premier Support account at the Registration Center This package is supported by Intel R Premier Support Direct customer support requests at https premier intel com General information on Intel R product support offerings may be obtained at http www intel com software products support The Intel R MPI Benchmarks home page can be found at http www intel com go imb When submitting a support issue to Intel R Premier Support please provide specific details of your problem including e The Intel R MPI Benchmarks package name and version information e Host architecture for example Intel R 64 architecture e Compiler s and versions e Operating system s and versions e Specifics on how to reproduce the problem Include makefiles command lines small test cases and build instructions Submitting Issue
54. erty Description As symbolized in figure I O benchmarks Measured pattern aggregation for output nn nn As symbolized in the figure below In this Elementary 1 0 action figure Nproc is the number of processes mSS MPI routines for the blocking mode PI_File_write_at MPI_File_read_at a _ gt _ MPI routines for the nonblocking mode MPI_File_iwrite_at MPI_File_iread_at _ __ __ _ _ _ _ _ ___ _ etype MP1_BYTE Bon File type PI_BYTE BEE MPI data type PI_BYTE 55 Intel R MPI Benchmarks User Guide t in psec as indicated in the figure I O benchmarks aggregation for output aggregate and non aggregate for the Write Reported timings flavor x t aggregate and non aggregate for the Reported throughput Write flavor P_ ACTION _expl Pattern sassen PR 1 1 PR PR 44 MPI File ACTION at J ri RE APEN common file disjoint contiguous blocks P_ ACTION _shared Concurrent access to a common file by all participating processes with a shared file pointer See the basic definitions and a schematic view of the pattern below P_ ACTION _shared Definition Property Description As symbolized in figure O benchmarks Measured pattern aggregation for output S SIO As symbolized in the figure below In this figure Nproc is the number of processes E
55. etion of the origin process is ensured by the MPI_Win_flush_local_all operation Since local completion of the MPI_Get operation is semantically equivalent to a regular completion the benchmark flow is very similar to the One_get_all benchmark NOTE This benchmark is not enabled in IMB RMA by default Specify the benchmark name in the command line or use the include command line parameter to run this benchmark Property Description N MPI_Get MPI_Win_flush_local_all where N is the number of target processes Measured pattern MPI data type MPI_BYTE origin and target Reported timings Bare time 83 Intel R MPI Benchmarks User Guide Reported throughput MBps Get_local This benchmark measures the combination of MPI_Get and MPI1_Win_flush_al1 operations in passive target communication mode The benchmark runs on two active processes The target process is waiting in the MPI_Barrier call Since local completion of the MPI_Get operation at the origin side is semantically equivalent to a regular completion performance results are expected to be very close to the Unidir_Get benchmark results NOTE This benchmark is not enabled in IMB RMA by default Specify the benchmark name in the command line or use the include command line parameter to run this benchmark Property Description Measured pattern MPI_Get MPI_Win_flush_local MPI data type MPI_BYTE origin and target Reported timings Bare time Reported
56. fdef EXT Set info for all MPI_Win_create calls opt_info MPI_INFO_NULL endif The Intel MPI Benchmarks use no assumptions and imposes no restrictions on how this routine is implemented View IMB 10 The file view is determined by the following settings e disp 0 e datarep native type filetypeas defined in the benchmark definitions above 100 Benchmark Methodology e info as defined in the Info section above Message I O Buffer Lengths IMB MPI1 IMB EXT Set in IMB_settings h and used unless the msglen flag is selected IMB IO Set in IMB_settings_io h and used unless the msglen flag is selected Buffer Initialization Communication and I O buffers are dynamically allocated as void and used as MPI_BYTE buffers for all benchmarks except Accumulate see Memory Requirements To assign the buffer contents a cast to an assignment type is performed This facilitates result checking which may become necessary Besides a sensible data type is mandatory for Accumulate Intel MPI Benchmarks sets the buffer assignment type assign_type in IMB_settings h IMB_settings_io h Currently int is used for IMB IO float for IMB EXT The values are set by a CPP macro as follows T For IMB EXT benchmarks define BUF VALUE rank i 0 1 rank 1 float i For IMB IO benchmarks define BUF_VALUE rank i 10000000 1 rank i 10000000 In every initialization com
57. fies their properties Benchmark Type Aggregated Mode Unidir_put Single Transfer Supported Unidir_get Single Transfer Supported Bidir_put Single Transfer Supported Bidir_get Single Transfer Supported One_put_all Multiple Transfer N A One_get_all Multiple Transfer N A All_put_all Parallel Transfer N A All_get_all Parallel Transfer N A Put_local Single Transfer Supported Put_all_local Multiple Transfer N A Exchange_put Parallel Transfer N A 78 MPI 3 Benchmarks Exchange_get Parallel Transfer N A Accumulate Single Transfer Supported Get_accumulate Single Transfer Supported Fetch_and_op Single Transfer Supported Compare_and_swap Single Transfer Supported Truly_passive_put Single Transfer N A Get_local Single Transfer Supported Get_all_local Multiple Transfer N A The output format differs from the regular Single Transfer output For details see Truly_passive put Accumulate This benchmark measures the MPI_Accumulate operation in passive target communication mode The benchmark runs on two active processes The target process is waiting in the MPI Barrier call Property Description Measured pattern MPI_Accumulate MPI_Win_flush MPI data type MPI_FLOAT origin and target MPI operation MPI_SUM Reported timings Bare time Reported throughput MBps All_get_all The benchmark tests the scenario when all processes communicate with each other using the MPI_Get operation To avoid congestion due to simult
58. ge_Put This benchmark tests the scenario when each process exchanges data with its left and right neighbor processes using the MPI_Put operation MA Property Description EE E SS Measured pattern 2 MPI_Put 2 MPI_Win_flush m MPI data type MPI_BYTE origin and target m Reported timings Bare time None Reported throughput Fetch_and_op This benchmark measures the MPI_Fetch_and_op operation in passive target communication mode The benchmark runs on two active processes The target process is waiting in the MPI_Barrier call mm Property Description fj Measured pattern MPI_Fetch_and_op MPI_Win_flush gt MPI data type MPI_FLOAT origin and target _ MPI operation MPI_SUM mM Reported timings Bare time 82 MPI 3 Benchmarks Reported throughput MBps Get_accumulate This benchmark measures the MPI_Get_Accumulate operation in passive target communication mode The benchmark runs on two active processes The target process is waiting in the MPI_Barrier Call Property Description Measured pattern MPI_Get_Accumulate MPI_Win_flush MPI data type MPI_FLOAT origin and target MPI operation MPI_SUM Reported timings Bare time Reported throughput MBps Get_all_local This benchmark tests the MPI_Get operation where one active process obtains data from all other processes All target processes are waiting in the MPI_Barrier call while the active process performs the transfers The compl
59. gle transfer IMB EXT benchmarks only run with two active processes e Single transfer IMB IO benchmarks only run with one active process Parallel Transfer Benchmarks This class contains benchmarks of functions that operate on several processes in parallel The benchmark timings are produced under a global load The number of participating processes is arbitrary In the Parallel Transfer more than one process participates in the overall pattern 41 Intel R MPI Benchmarks User Guide The final time is measured as the maximum of timings for all single processes The throughput is related to that time and the overall amount of transferred data sum over all processes Collective Benchmarks This class contains benchmarks of functions that are collective as provided by the MPI standard The final time is measured as the maximum of timings for all single processes The throughput is not calculated MPI 2 Benchmarks Classification Unidir_Get Multi_Unidir_Get Accumulate Unidir_Put idir Put idir_Put S_ I Write_indv P_ I Write_indv C_ I Write_ S_ I Write_indv P_ I Write_indv S_ I Read_indv P_ I Read_indv Multi P_ I Write_expl S_ I Write_expl P_ I Read_expl S_ I Read_expl Multi P_ I Write_shared 42 C_ I Write_ Read_expl ule Unidir Put Multi_Accumulate indv indv
60. he P_ ACTION _expl benchmark 59 Intel R MPI Benchmarks User Guide See Also P ACTTON _expl C_ ACTION _shared The benchmark of a collective access from all processes to a common file with a shared file pointer This benchmark is based on the following MPI routines e MPI_File_read_ordered MPI_File_write_ordered for the blocking mode e MPI_File_ _ordered_begin MPI_File_ _ordered_end for the nonblocking mode All other parameters and the measuring method are the same as for the P_ ACTION _shared benchmark See Also P_ ACTION _shared Open_Close The benchmark for the MPI_File_open MPI_File_close functions All processes open the same file To avoid MPI implementation optimizations for an unused file a negligible non trivial action is performed with the file See the basic definitions of the benchmark below Open_Close Definition Property Description Measured pattern MPI_File_open MPI_File_close etype MPI_BYTE File type MPI_BYTE t At in usec as indicated in the figure Reported timings below Reported throughput None Open_Close Pattern 60 MPI 2 Benchmarks all active processes MPI File open MPI File write 1 byte File MPI File close IMB 10 Non blocking Benchmarks Intel MPI Benchmarks implements blocking and nonblocking modes of the IMB IO benchmarks as different benchmark flavors The Read and Write components of the blocking benchmark name a
61. iba 86 Unidir a aa 86 UI dit PU ar A AA E AA Aa aria 87 Benchmark Methodology uzuzusuununnnnananananununununununnnnnnunananananunnnnnnananananananunununununnnnnnanananenn 88 Control FOW seneta old ea 88 Command line Con asa 89 Benchmark Selection Argument ooocccccocococonononnnnononnncnencnnnnrornrnrnrnrnrnrnrrrrrrrrrrnnraranen 90 O RN 90 multi Ut ag Opt iapa ta pl A bee ene ERE 91 off_cache cache_size cache_line_Size Opti0N oocococcccncncncononcncnn nn cnn na nenne nennen nenn 91 SIter OREION a nr earned seas heed cab yeasdraegelenenadeeneandabered tee Ra 92 Legal Information iter poliGy Option AAA A A a ee 92 HME OPHION tiaa A nE NE NE Ra REENEN ate een cdad nu A A Man E nE SE E vende eter LENE 93 MEM OPINA RR ana Ra pre 93 input lt FiIlIG gt O PON rn eng 93 msglen lt Eile gt Option na ana ru 94 MAP PXQ gt OPtION rn a ddr 94 include benchmark1 benchmark2 T ooonncnncnnncccnnnno nana nana nn nan nn an nn aran ernennen nennen 95 exclude benchmark1 benchmark2 ccccccccccccee cece cess sees sees seen ann nnnn nn nennen nn 95 msglog lt minlog gt lt Maxlog gt icons een e e db aT 95 MN O 96 Parameters Controlling Intel MPI Benchmarks oocococccccococncnnnnncncnoronononnnnnnr nano nennen nern nenn 96 Hard Codead Settings 4 22 5 une Ra Hr HN Ord 99 Communicators Active ProcesseS uuusunsnsnenenenennnnnnnnenennnne nenn nnnnnnnnnn nn nennen en een nen 99 Other Prep
62. igure I O benchmarks aggregation for output aggregate and non aggregate for the Write flavor Reported timings x t aggregate and non aggregate for the Reported throughput Write flavor S_ ACTION _indv Pattern PROCESS 1 PROCESS2 N MPI File ACTION No I O action x bytes rhe S_ ACTION _expl This benchmark mimics the same situation as S_ ACTION _indv with a different strategy to access files See the basic definitions and a schematic view of the pattern below S_ ACTION _expl Definition Property Description As symbolized in figure O benchmarks Measured pattern aggregation for output 52 MPI 2 Benchmarks SS Elementary I O action As symbolized in the figure below EE ooo MPI routines for the blocking mode PI File write at MPI File e read at cs MPI routines for the nonblocking mode MPI_File_iwrite_at MPI_File_iread_at es etype MPI_BYTE Te File type PI_BYTE p MPI data type MPI_BYTE t in usec as indicated in the figure I O benchmarks aggregation for output aggregate and non aggregate for the write flavor Reported timings x t aggregate and non aggregate for the Reported throughput Write flavor S_ ACTION _expl pattern PROCESS 1 PROCESS 2 N MPI File ACTION at Nol O action X bytes FILE P_ ACTION _indv This pattern accesses the file in a concurrent manner All participating processes access a
63. ion means the following e For IMB EXT benchmarks MPI_Win_fence e For IMB IO Write benchmarks a triplet MPI_File_sync MPI_Barrier file_communicator MPI_File_sync This fixes the non sufficient definition in the Intel MPI Benchmarks 3 0 IMB EXT Benchmarks This section provides definitions of IMB EXT benchmarks The benchmarks can run with varying transfer sizes x in bytes The timings are averaged over multiple samples See the Benchmark Methodology section for details In the definitions below a single sample with a fixed transfer size X is used The Unidir and Bidir benchmarks are exact equivalents of the message passing PingPong and PingPing respectively Their interpretation and output are analogous to their message passing equivalents Unidir_Put This is the benchmark for the MPI_Put function The following table and figure provide the basic definitions and a schematic view of the pattern Unidir_Put Definition Property Description 44 MPI 2 Benchmarks Measured pattern MPI routine MPI data type Reported timings Reported throughput Unidir_Put Pattern PROCESS 1 As symbolized between in the figure below This benchmark runs on two active processes Q 2 MPI Put MPI_BYTE origin and target t t M in usec as indicated in the figure below non aggregate M 1 and aggregate M n_sample For details see Actual Benchmarking x t aggregate and non aggregate Mfold
64. lallgather The benchmark for MPI_Iallgather that measures communication and computation overlap Property Description Measured pattern MPI_Iallgather IMB_cpu_exploit MPI_Wait MPI data type MPI_BYTE e E ov l e t pure e ECPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap Reported throughput None 65 Intel R MPI Benchmarks User Guide lallgather_pure The benchmark for the MPI_Iallgather function that measures pure communication time Every process inputs x bytes and receives the gathered X np bytes where np is the number of processes Property Description Measured pattern MPI_Iallgather MPI_Wait MPI data type MPI_BYTE Reported timings Bare time Reported throughput None lallgatherv The benchmark for MPI_Iallgatherv that measures communication and computation overlap Property Description Measured pattern MPI_Iallgatherv IMB_cpu_exploit MPI_Wait MPI data type MPI_BYTE e tovil e t pure e E CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU For details see Measuring Communication and Computation Overlap Reported throughput None lallgatherv_pure The benchmark for the MPI_Iallgatherv function that measures pure communication time Every process inputs x bytes and receives the gathered X np bytes where np is the number of pr
65. lated by the following formula overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min t_pure t_CPU See Also IMB NBC Benchmarks Measuring Pure Communication Time 64 MPI 3 Benchmarks Measuring Pure Communication Time To measure pure execution time of nonblocking collective operations use the _pure flavor of the IMB NBC benchmarks The benchmark methodology is consistent with the one used for regular collective operations e Each rank performs the predefined amount of iterations and calculates the mean value e The basic MPI data type for all messages is MPI_BYTE for pure data movement functions and MPI_FLOAT for reductions e If the operation requires the root process to be specified the root process is selected round robin through iterations These benchmarks are not included into the default list of IMB NBC benchmarks To run a benchmark specify the particular benchmark name or use the include command line parameter For example mpirun np 2 IMB NBC lalltoall pure mpirun np 2 IMB NBC include lallgather_pure Ialltoall_pure Displaying Results Pure nonblocking collective benchmarks show bare timing values Since execution time may vary for different ranks three timing values are shown maximum minimum and the average time among all the ranks participating in the benchmark measurements See Also IMB NBC Benchmarks Measuring Communication and Computation Overlap Command Line Control
66. le 93 Intel R MPI Benchmarks User Guide Every line must be a comment beginning with or it must contain exactly one IMB benchmark name Window Unidir_Get Unidir_Put Bidir Get Bidir Put Accumulate With the help of this file the following command runs only Unidir_Get and Accumulate benchmarks of the IMB EXT component mpirun IMB EXT input IMB_SELECT_EXT msglen lt File gt Option Enter any set of non negative message lengths to an ASCII file line by line and call the Intel MPI Benchmarks with arguments msglen Lengths The Lengths value overrides the default message lengths For IMB IO the file defines the I O portion lengths map PxQ Option Numbers processes along rows of the matrix 0 P Q 2 P Q 1 P 1 P 1 2P 1 Q 1 P 1 QP 1 For example to run Multi PingPongbetween two nodes of size P with each process on one node communicating with its counterpart on the other call mpirun np lt 2P gt IMB MPIl map lt P gt x2 PingPong 94 Benchmark Methodology include benchmark1 benchmarkz Specifies the list of additional benchmarks to run For example to add PingPongSpecificSource and PingPingSpecificSource benchmarks call mpirun np 2 IMB MPI1 include PingPongSpecificSource PingPingSpecificSource exclude benchmark1 benchmark2 Specifies the list of benchmarks to be exclude from the run For example to exclude Allt
67. lementary I O action 56 MPI 2 Benchmarks nn MPI routines for the blocking mode PI_File_write_at MPI_File_read_at _ oem MPI routines for the nonblocking mode PI_File_iwrite_at MPI_File_iread_at O eee etype PI_BYTE nm File type MPI_BYTE AAA tee a MPI data type PI_BYTE t in usec as indicated in the figure I O benchmarks aggregation for output aggregate and non aggregate for the Write flavor Reported timings x t aggregate and non aggregate for the Reported throughput Write flavor P_ ACTION _shared Pattern MPI File ACTION shared X Nproc bytes 7 some order of blocks random shared file disjoint contiguous blocks 57 Intel R MPI Benchmarks User Guide P_ ACTION _ priv This pattern tests the case when all participating processes perform concurrent I O to different private files This benchmark is particularly useful for the systems that allow completely independent I O operations from different processes The benchmark pattern is expected to show parallel scaling and obtain optimum results See the basic definitions and a schematic view of the pattern below P_ ACTION _ priv Definition Property Measured pattern Elementary I O action MPI routines for the blocking mode MPI routines for the nonblocking mode etype File type MPI data type Reported timings Reported throughput P_ ACTION priv Patte
68. lt N_BARR i MPI_Barrier MY_COMM Negligible integer offset calculations time MPI_Wtime for i 0 i lt n_sample i xecute elementary transfer assure completion of all transfers time MPI_Wtime time n_sample For non aggregate benchmarks every single transfer is safely completed for i 0 i lt N_BARR i MPI_Barrier MY_COMM Negligible integer offset calculations time MPI_Wtime for i 0 i lt n_sample i xecut lementary transfer assure completion of transfer time MPI_Wtime time n_sample Non blocking I O Benchmarks A nonblocking benchmark has to provide three timings e t_pure blocking pure I O time e t_ovri nonblocking I O time concurrent with CPU activity e t_CPU pure CPU activity time The actual benchmark consists of the following stages e Calling the equivalent blocking benchmark as defined in Actual Benchmarking and taking benchmark time as t_pure e Closing and re opening the particular file s e Re synchronizing the processes e Running the nonblocking case concurrent with CPU activity exploiting t_CPU when running undisturbed taking the effective time as t_ovrl 103 Intel R MPI Benchmarks User Guide The desired CPU time to be matched approximately by t_CPU is set in IMB_settings_io h define TARGET_CPU_SECS 0 1 unit seconds Checking Results To check whether your MPI implementa
69. marks all active processes Mfold MPI Accumulate x bytes gt rank 0 disjoint A t M MPI Win fence t t M A t M M Window This is the benchmark for measuring the overhead of an MPI_Win_create MPI_Win_fence MPI_Win_free combination In the case of an unused window a negligible non trivial action is performed inside the window It minimizes optimization effects of the MPI implementation The MPI_win_fence function is called to properly initialize an access epoch This is a correction as compared to earlier releases of the Intel MPI Benchmarks See the basic definitions and a schematic view of the pattern below Window Definition Property Description Measured pattern MPI _Win_create MPI_Win_fence MPI_Win_free t At M in psec as indicated in the figure Reported timings below Reported throughput None Window Pattern 49 Intel R MPI Benchmarks User Guide all active processes MPI Win create size X MPI Win fence MPI Put 1 byte Window MPI Win free IMB IO Blocking Benchmarks This section describes blocking I O benchmarks The benchmarks can run with varying transfer sizes X in bytes The timings are averaged over multiple samples The basic MPI data type for all data buffers is MPI_BYTE In the definitions below a single sample with a fixed I O size x is used Every benchmark contains an elementary I O action denoting a pure read or write Thus all benchmark fl
70. mem usage per process gt msglen lt Lengths_file gt map lt PxQ gt input lt filename gt include benchmark1 benchmark2 xclude benchmark1 benchmark2 msglog lt minlog gt lt maxlog gt benchmark1 benchmark2 The command line is repeated in the output The options may appear in any order Examples Use the following command line to get out of cache data for PingPong mpirun np 2 IMB MPI1 pingpong off_cache 1 Use the following command line to run a very large configuration restrict iterations to 20 max 1 5 seconds run time per message size max 2 GBytes for message buffers mpirun np 512 IMB MPI1 npmin 512 alltoallv iter 20 time 1 5 mem 2 Other examples 89 Intel R MPI Benchmarks User Guide mpirun np 8 IMB IO mpirun np 10 IMB MPI1 PingPing Reduce mpirun np 11 IMB EXT npmin 5 mpirun np 14 IMB IO P_Read_shared npmin 7 mpirun np 3 IMB EXT input IMB_SELECT_EXT mpirun np 14 IMB MPIl multi 0 PingPong Barrier map 2x7 mpirun np 16 IMB MPIl msglog 2 7 include PingPongSpecificsource PingPingSpecificsource exclude Alltoall Alltoallv mpirun np 4 IMB MPIl msglog 16 PingPong PingPing PingPongSpecificsource PingPingSpecificsource mpirun np 16 IMB NBC include Ialltoall_pure Ibcast_pure mpirun np 8 IMB RMA multi 1 Put_local Benchmark Selection Arguments Benchmark selection arguments are a sequ
71. mem_info h off_cache 1 2 5 MB last level cache default line size off_cache 2 5 16 MB last level cache line size 128 off_cache 16 128 The off_cache mode might also be influenced by eventual internal caching with the Intel MPI Library This could make results interpretation complicated 91 Intel R MPI Benchmarks User Guide Default no cache control Data may come out of cache iter Option Use this option to control the number of iterations By default the number of iterations is controlled through parameters MSGSPERSAMPLE OVERALL_VOL MSGS_NONAGGR and ITER_P OLICY defined in IMB_settings h You can optionally add one or more arguments after the iter flag to override the default values defined in IMB_settings h Use the following guidelines for the optional arguments e To override the MSGSPERSAMPLE value e To override the ov the MSGSPERSAMPLI use a single integer ERALL_VOL use two comma separated integers The first integer defines E value The second integer overrides the OVERALL_VOL value e To override the MSGS_NONAGGR value use three comma separated integer numbers The first integer defines the MSGSPERSAMPLE value The second integer overrides the OVERALL_VOL value The third overrides the MSGS_NONAGGR value e To override the iter_policy argument enter it after the integer arguments or right after the iter flag if you do not use any other argum
72. mple is expected to be exceeded Time limit is defined by variable SECS_PER_SAMPLE gt IMB_settings h or through the flag gt time Calling sequence was IMB MPIl pingping allreduce map 3x2 msglen Lengths multi 0 Message lengths were user defined MPI_Datatype MPI_BYTE MPI_Datatype for reductions MPI_FLOAT MPI_Op MPI_SUM List of Benchmarks to run Multi PingPing Multi Allreduce Benchmarking Multi PingPing 3 groups of 2 processes each running simultaneously Group 0 0 3 Group 1 1 4 109 Intel R MPI Benchmarks User Guide Group 2 2 5 bytes rep s t min usec t_max ysec t_avg usec Mbytes sec 0 1000 100 1000 1000 1000 10000 1000 100000 419 1000000 41 Benchmarking Multi Allreduce 3 groups of 2 processes each running simultaneously Group 0 0 3 Group 1 d 4 Group 2 2 5 tbytes repetitions t_min usec t_max usec t_avg psec 0 1000 100 1000 1000 1000 10000 1000 100000 419 1000000 41 Benchmarking Allreduce processes 4 rank order rowwise 110 Benchmark Methodology 2 additional processes waiting in MPI_Barrier bytes repetitions t min psec t_max usec t_avglusec 0 1000 100 1000 1000 1000 10000 1000 100000 419 1000000 41 j Benchmarking Allreduce processes 6 rank order rowwise 0 3 1 4 F F 2 5 j bytes repetitions t_min usec t
73. munication buffers are seen as typed arrays and initialized as follows assign_type buffer i BUF_VALUE rank i where rank is the MPI rank of the calling process Warm Up Phase IMB MPI1 IMB EXT IMB NBC and IMB RMA Before starting the actual benchmark measurement for IMB MP11 IMB EXT IMB NBC and IMB RMA the selected benchmark is executed N_WARMUP times with a sizeof assign_type message length The N_WARMUP value is defined in IMB_settings h see Parameters Controlling Intel MPI Benchmarks for details The warm up phase eliminates the initialization overheads from the benchmark measurement Synchronization Before the actual benchmark measurement is performed the constant N_BARR is used to regulate calls to MPI_Barrier MPI_COMM_WORLD 101 Intel R MPI Benchmarks User Guide The N_BARR constant is defined in IMB_settings h and IMB_settings_io h with the current value of 2 See figure Control flow of IMB to ensure that all processes are synchronized Actual Benchmarking To reduce measurement errors caused by insufficient clock resolution every benchmark is run repeatedly The repetition count is as follows For IMB MP11 IMB NBC and aggregate flavors of IMB EXT IMB IO and IMB RMA benchmarks the repetition count is MSGSPERSAMPLE This constant is defined in IMB_settings h IMB_settings_io h with 1000 and 50 values respectively To avoid excessive run times for large tr
74. ne which benchmarks can run in the aggregate mode see Classification of MB RMA Benchmarks Classification of IMB RMA Benchmarks All the IMB RMA benchmarks fall into the following categories Single Transfer In these benchmarks one process accesses the memory of another process in unidirectional or bidirectional manner Single Transfer IMB RMA benchmarks only run on two active processes Throughput values are measured in MBps and can be calculated as follows throughput X 2 10 time X 1 048576 time 77 Intel R MPI Benchmarks User Guide where e time is measured in psec e xis the length of a message in bytes Multiple Transfer In these benchmarks one process accesses the memory of several other processes Throughput values are measured in MBps and can be calculated as follows throughput X 2 10 time N X 1 048576 time N where e time is measured in psec e xis the length of a message in bytes e Nis the number of target processes NOTE The final throughput value is multiplied by the amount of target processes since the transfer is performed to every process except the origin process itself Parallel Transfer This class contains benchmarks that operate on several processes in parallel These benchmarks show bare timing values maximum minimum and the average time among all the ranks participating in the benchmark measurements The table below lists all IMB RMA benchmarks and speci
75. nn 11 Changes in Intel MPI Benchmarks 3 1 2z2444HsHnenenennnnen nennen nenn nn nn nennen nennen 12 Changes in Intel MPI Benchmarks 3 0 2z4s4sHrenenenennnnnnnnnn eee etree nennen 13 Notational Gonventions a Hr nee 13 Document Version Information sis sepini ae ea pa eee ener nennen en en een 14 Related Information vota a La DE ee u ee Pier 16 Installation and Quick Start uu 2 u 20 u02 une nn ann nn ann nn nn m m nn aa ann nn an a mn an ann a nn nam na na a 17 Memory and Disk Space Requirements cece renee nn nennen nennen nenn 17 Software Requirements icia sirena are irn Be rl neh here 18 Installing Intel MPI Benchmarks z2z444444 renee nenn nenn nennen 18 Building Intel R MPI Benchmarks ussssessnsensenensnennnenensnnennenensnnnnnnnensnnennnnsnsnnnnnnnen nennen 19 Building Intel MPI Benchmarks on Linux OS 00 0 cece nennen nennen nn 19 Building Intel MPI Benchmarks on Windows OS uesssnenenenennnnnnnnnnn nn nn nn nennen nennen 20 Running Intel MPI BenchmarkS ann nennen een nenn nennen nen nnnnnnnn nennen nenn nennen 22 Running Benchmarks in Multiple Mode 4 444444He ener nena rennen 22 MPI 1 Benchmarks u u202uananannnnnnnnnnananananununununun un nn nun nn ananananannnnnnanan an an an an un un un iiaiai Ninana 24 Classification of MPI 1 Benchmarks ur4rHrerennnnennnnnnnnnnnnnnnn nenn nennen nennen nennen nennen 2
76. oall and Allgather call mpirun np 2 IMB MPI1 xclude Alltoall Allgather msglog lt minlog gt lt maxlog gt This option allows you to control the lengths of the transfer messages This setting overrides the MINMSGLOG and MAXMSGLOG values The new message sizes are O 2 M15103 ssy 2 MmaxLog For example try running the following command line mpirun np 2 IMB MPIl msglog 3 7 PingPong Intel MPI Benchmarks selects the lengths 0 8 16 32 64 128 as shown below Benchmarking PingPong processes 2 bytes repetitions t psec Mbytes sec 0 1000 0 70 0 00 8 1000 0313 10 46 16 1000 0 74 20 65 32 1000 0 94 32 61 64 1000 0 94 65 14 128 1000 1 06 115 16 Alternatively you can specify only the maxlog value Benchmarking PingPong 95 Intel R MPI Benchmarks User Guide processes 2 bytes repetitions t usec Mbytes sec 0 1000 0 69 0 00 1 1000 Oe Ja 1 33 2 1000 Osh 2009 4 1000 072 328 8 1000 0783 10 47 thread_level Option This option specifies the desired thread level for MPI_Init_thread See description of MPI_Init_thread for details The option is available only if the Intel MPI Benchmarks is built with the USE_MPI_INIT_THREAD macro defined Possible values for lt level gt are single funneled serialized and multiple Parameters Controlling Intel MPI Benchmarks Parameters controlling the default settings of the Intel MPI Benchmarks are
77. ocation within the Intel MPI Benchmarks directory structure As such solutions are highly version dependent see the information in the corresponding ReadMe txt files that unpack with the folder You are recommended to learn about the Microsoft Visual Studio philosophy and the run time environment of your Windows cluster Changes in Intel MPI Benchmarks 3 1 This release includes the following updates as compared to the Intel MPI Benchmarks 3 0 e New control flags e Better control of the overall repetition counts run time and memory exploitation e A facility to avoid cache re usage of message buffers as far as possible e A fix of IMB IO semantics e New benchmarks e Gather e Gatherv e Scatter Scatterv e New command line flags for better control e off_cache Use this flag when measuring performance on high speed interconnects or in particular across the shared memory within a node Traditional Intel MPI Benchmarks results included a very beneficial cache re usage of message buffers which led to idealistic results The flag off_cache allows avoiding cache effects and lets the Intel MPI Benchmarks use message buffers which are very likely not resident in cache e iter time Use these flags for enhanced control of the overall run time which is crucial for large clusters where collectives tend to run extremely long in the traditional Intel MPI Benchmarks settings CAUTI ON In the Intel MPI Benchmarks the
78. ocesses Unlike Tallgather_pure this benchmark shows whether MPI produces overhead Property Description Measured pattern MPI_Iallgatherv MPI_Wait 66 MPI 3 Benchmar ks _ nn MPI data type MPI_BYTE Pa Reported timings Bare time fa Reported throughput None lallreduce The benchmark for MPI_Iallreduce that measures communication and computation overlap mmm Property Description AAA Measured pattern MPI_Tallreduce IMB_cpu_exploit MPI_Wait ee MPI data type MP I_FLOAT MPI_SUM MPI operation __ _ E OVEL eo t_pure e E CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl min e pure E CPU For details see Measuring Communication and Computation Overlap Reported throughput None lallreduce_pure The benchmark for the MPI_Iallreduce function that measures pure communication time It reduces a vector of length L X sizeof float float items The MPI data type is MPI_FLOAT The MPI operation is MPI_SUM Property Description _ gt Measured pattern MPI_Iallreduce MPI_Wait cmm MPI data type MPI_FLOAT A MPI operation MPI_SUM m Reported timings Bare time 67 Intel R MPI Benchmarks User Guide Reported throughput None lalltoall The benchmark for MPI_Ialltoall that measures communication and computation overlap pm Property Description a Measured pattern MPI_Ialltoall IMB_cpu_exploit MPI_Wait
79. parameter value using the iter_policy Or iter flag The maximum repetition count for all IMB MPI 1 benchmarks You can override this parameter value using the iter flag The maximum repetition count for non aggregate benchmarks relevant only for IMB EXT You can override this parameter value using the time flag For all sizes smaller than OVERALL_VOL the repetition count is reduced so that not more than OVERALL_VOL bytes are processed all in all This permits you to avoid unnecessary repetitions for large message sizes Finally the real repetition count for message size X is MSGSPERSAMPLE X 0 min MSGSPERSAMPLE 97 Intel R MPI Benchmarks User Guide max 1 OVERALL_VOL X X gt 0 Note that OVERALL_VOL does not restrict the size of the maximum data transfer 2 scLoe OVERALL VOL You can override this parameter value using the mem flag Number of iterations is dynamically set so that SECS_PER_SAMPLE 10 this number of run time seconds is not exceeded per message length Number of MPI_Barrier N_BARR 2 2 Baer for synchronization CPU seconds as float to run concurrently with TARGET_CPU_SECS 0 01 seconds 0 1 seconds nonblocking benchmarks currently irrelevant for IMB MPI1 In the example below the IMB_settings_io h file has the IMB_OPTIONAL parameter enabled so that user defined parameters are u
80. pe MPI_BYTE _ __ Root i num_procs in iteration i o e E OVEL e t_pure e C CPU Reported timings e overlap 100 max 0 min 1 t_pure t_CPU t_ovrl mint pure E CPU For details see Measuring Communication and Computation Overlap SE Reported throughput None 76 MPI 3 Benchmarks Iscatterv_pure The benchmark for the MPI_Iscatterv function that measures pure communication time The root process inputs X np bytes x for each process All processes receive X bytes The root of the operation is changed round robin Property Description Measured pattern MPI_Iscatterv MPI_Wait MPI data type MPI_BYTE Root i num_procs in iteration i Reported timings Bare time Reported throughput None IMB RMA Benchmarks Intel MPI Benchmarks provides a set of remote memory access RMA benchmarks that use the passive target communication mode to measure one sided operations compliant with the MPI 3 standard IMB RMA Benchmark Modes When running the IMB RMA benchmarks you can choose between the following modes e Standard default or multiple mode You can enable the multiple mode for all IMB RMA benchmarks using the multi command line parameter For details see Running Benchmarks in Multiple Mode e Aggregate or non aggregate mode For details on these modes see the MPI 2 Benchmark Modes chapter Some IMB RMA benchmarks support the non aggregate mode only To determi
81. peration is changed round robin Property Description Measured pattern MPI_Gather MPI data type MPI_BYTE Root i num_procs in iteration i Reported timings Bare time Reported throughput None Gatherv The benchmark for the MPI_Gatherv function All processes input x bytes The root process receives X np bytes where np is the number of processes The root of the operation is changed round robin Property Description Measured pattern MPI_Gatherv 35 Intel R MPI Benchmarks User Guide gt MPI data type m Root EEE Reported timings q A gt zzIIIIIIIIIPI 2n EA A lt K gt Reported throughput Alltoall The benchmark for the MPI_Alltoall function In the case of np number of processes every process inputs X np bytes x for each process and receives X np bytes x from each process Ir mn IOIO Property Description _ _ lt Measured pattern MPI_Alltoall ft MPI data type lt i m Reported timings _ _ __Q nQ Q eQQQQQ e A Reported throughput Bcast The benchmark for MPI_Bcast The root process broadcasts X bytes to all other processes The root of the operation is changed round robin q m _ _ _ _____ __z__ Property Description DL Measured pattern MPI_Bcast MPI data type _ _ Reported timings 11 Reported throughput Barrier The benchmark for the MPI_Barrier function
82. poch and call the MPI_Barrier function Property Description Measured pattern MPI_Get MPI_Win_flush MPI data type MPI_BYTE origin and target 80 MPI 3 Benchmarks Reported timings Bare time Reported throughput MBps Bidir_put This benchmark measures the bidirectional MPI_Put operation in passive target communication mode The benchmark runs on two active processes These processes initiate an access epoch to each other using the MPI_Lock function transfer data close the access epoch and call the MPI_Barrier function Property Description Measured pattern MPI_Put MPI_Win_flush MPI data type MPI_BYTE origin and target Reported timings Bare time Reported throughput MBps Compare_and_swap This benchmark measures the MPI_Compare_and_swap operation in passive target communication mode The target process is waiting in the MPI_Barrier call Property Description Measured pattern MPI_Compare_and_swap MPI_Win_flush MPI data type MPI_INT origin and target Reported timings Bare time Reported throughput MBps Exchange_Get This benchmark tests the scenario when each process exchanges data with its left and right neighbor processes using the MPI_Get operation Property Description 81 Intel R MPI Benchmarks User Guide nn Measured pattern 2 MPI_Get 2 MPI_Win_flush MPI data type MPI_BYTE origin and target Reported timings Bare time o Reported throughput None Exchan
83. r call MPI data type MPI_BYTE origin and target Reported timings Bare time Reported throughput None Unidir_get This benchmark measures the MPI_Get operation in passive target communication mode The benchmark runs on two active processes The target process is waiting in the MPI_Barrier call Property Description 86 MPI 3 Benchmarks Measured pattern MPI_Get MPI_Win_flush MPI data type MPI_BYTE origin and target Reported timings Bare time o Reported throughput MBps Unidir_put This benchmark measures the MPI_Put operation in passive target communication mode The benchmark runs on two active processes The target process is waiting in the MPI_Barrier call Property Description _ _ __ ___ Measured pattern MPI_Put MPI_Win_flush _ _ o MPI data type MPI_BYTE origin and target a Reported timings Bare time m Reported throughput MBps 87 Benchmark Methodology This section describes e Different ways to manage Intel MPI Benchmarks control flow e Command line syntax for running the benchmarks e Sample output data you can receive Control Flow Intel MPI Benchmarks provides different ways to manage its control flow e Hard coded control mechanisms For example the selection of process numbers for running the central benchmarks See the Hard coded Settings section for details e Preprocessor parameters The required control parameters are ei
84. rary implementation You can measure a potential overlap of communication and computation using IMB NBC benchmarks The general benchmark flow is as follows Measure the time needed for a pure communication call Start a nonblocking collective operation 3 Start computation using the IMB_cpu_exploit function as described in the IMB IO Nonblocking Benchmarks chapter To ensure correct measurement conditions the computation time used by the benchmark is close to the pure communication time measured at step 1 4 Wait for communication to finish using the MP1_Wait function NP Displaying Results The timing values to interpret the overlap potential are as follows e t_pure is the time of a pure communication operation non overlapping with CPU activity e t_CPU s the time the IMB_cpu_exploit function takes to complete when run concurrently with the nonblocking communication operation e t_ovrl is the time of the nonblocking communication operation takes to complete when run concurrently with a CPU activity e Ift_ovrl max t_pure t_CPU the processes are running with a perfect overlap e If t_ovrl t_pure t_CPU the processes are running with no overlap Since different processes in a collective operation may have different execution times the timing values are taken for the process with the biggest t_ovr1 execution time The IMB NEC result tables report the timings t_ovrl t_pure t_CPU and the estimated overlap in percent calcu
85. re replaced for nonblocking flavors by IRead and IWrite respectively The definitions of blocking and nonblocking flavors are identical except for their behavior in regard to e Aggregation The nonblocking versions only run in the non aggregate mode e Synchronism Only the meaning of an elementary transfer differs from the equivalent blocking benchmark Basically an elementary transfer looks as follows time MPI_Wtime for i 0 i lt n_sample i Initiate transfer Exploit CPU Wait for the end of transfer time MPI_Wtime time n_sample The Exploit CPU section in the above example is arbitrary Intel MPI Benchmarks exploits CPU as described below 61 Intel R MPI Benchmarks User Guide Exploiting CPU Intel MPI Benchmarks uses the following method to exploit the CPU A kernel loop is executed repeatedly The kernel is a fully vectorizable multiplication of a 100x100 matrix with a vector The function is scalable in the following way IMB_cpu_exploit float desired_time int initialize The input value of desired_time determines the time for the function to execute the kernel loop with a slight variance At the very beginning the function is called with initialize 1 and an input value for desired_time This determines an Mflop s rate and a timing t_CPU as close as possible to desired_time obtained by running without any obstruction During the actual benchmarking IMB_cpu_exploit is
86. recv benchmark is based on MPI_Sendrecv In this benchmark the processes form a periodic communication chain Each process sends a message to the right neighbor and receives 29 Intel R MPI Benchmarks User Guide a message from the left neighbor in the chain The turnover count is two messages per sample one in one out for each process In the case of two processes Sendrecv is equivalent to the PingPing benchmark of IMB1 x For two processes it reports the bidirectional bandwidth of the system as obtained by the optimized MPI_Sendrecv function Sendrecv Definition Property Description BEE Measured pattern As symbolized between in the figure below MPI routines MPI_Sendrecv MPI data type MPI_BYTE time At in usec as indicated in the figure Reported timings below Reported throughput 2X 1 048576 time Sendrecv Pattern se E PR PR I PR I 1 r oo t MPi_x bytes MPI_ x bytes MPL gt LS Sendrecv Sendrecv Sendrecv Exchange Exchange is a communication pattern that often occurs in grid splitting algorithms boundary exchanges The group of processes is similar to a periodic chain and each process exchanges data with both left and right neighbor in the chain The turnover count is four messages per sample two in two out for each process For two Isend messages separate buffers are used 30 MPI 1 Benchmarks Exchange Definition ih A Property Description k lt As symbolized
87. rks see ReadMe_IMB txt See Also Building Intel MPI Benchmarks 18 Installation and Quick Start Building Intel R MPI Benchmarks This section describes how to build the Intel MPI Benchmarks on different operating systems Building Intel MPI Benchmarks on Linux OS To build the benchmarks for Linux OS do the following 1 Set the cc variable to point to the appropriate compiler wrapper mpiicc or mpicc 2 Run one or more makefile commands listed below Command Description Remove legacy binary object files and make clean i executable files Build the executable file for the IMB MPI1 make MPI1 component _ Build the executable file for one sided make EXT A i communications benchmarks make IO Build the executable file for O benchmarks Build the executable file for IMB NBC make NBC benchmarks clica Build the executable file for IMB RMA benchmarks make all Build all executable files available To build the benchmarks for Intel Many Integrated Core Architecture Intel MIC Architecture follow these steps 1 Build the Intel MPI Benchmarks for the host system host source lt install dir gt composer_xe bin compilervars sh intel64 host source lt install dir gt intel64 bin mpivars sh host cd lt path to IMB directory gt src host make f make_ict where lt install dir gt composer_xe refers to the Intel Composer XE installation directory and 19 Intel R MPI Benchmarks User
88. rn 58 Description As symbolized in figure I O benchmarks aggregation for output As symbolized in the figure below In this figure Nproc is the number of processes PI_Fi PL BY PL BY Pl BY File _write MP1_File_ read le_iwrite MPI_File_iread At in usec aggregate and non aggregate for the Write flavor x At aggregate and non aggregate for the Write flavor MPI 2 Benchmarks MPI File ACTION X Nproc bytes private file for each process C_ ACTION _indv C_ ACTION _indv tests collective access from all processes to a common file with an individual file pointer Below see the basic definitions and a schematic view of the pattern This benchmark is based on the following MPI routines e MPI_File_read_all MPI_File_write_all for the blocking mode e MPI_File_ _all_begin MPI_File_ _all_end for the nonblocking mode All other parameters and the measuring method are the same as for the P_ ACTION _indv benchmark See Also P_ ACTION _indv C_ ACTION _expl This pattern performs collective access from all processes to a common file with an explicit file pointer This benchmark is based on the following MPI routines e MPI_File_read_at_all MPI_File_write_at_all for the blocking mode e MPI_File_ _at_all_begin MPI_File_ _at_all_end for the nonblocking mode All other parameters and the measuring method are the same as for t
89. rv The benchmark for the MP1_A11gatherv function Every process inputs x bytes and receives the gathered X np bytes where np is the number of processes Unlike Allgather this benchmark shows whether MPI produces overhead Property Description _ _ J _ _ __ _ gt _ Measured pattern MPI_Allgatherv x _ _ _AA eo MPI data type BEER Reported timings AAA Reported throughput Scatter The benchmark for the MPI_Scatter function The root process inputs X np bytes x for each process All processes receive x bytes The root of the operation is changed round robin rn Property Description a Measured pattern MPI_Scatter BEE BEER MPI data type MPI_BYTE fe Root i num_procs in iteration i __ _ TT Reported timings Bare time _ EEE Reported throughput None 34 MPI 1 Benchmarks Scatterv The benchmark for the MPI_Scatterv function The root process inputs X np bytes x for each process All processes receive x bytes The root of the operation is changed round robin Property Description Measured pattern MPI_Scatterv MPI data type MPI_BYTE Root i num_procs in iteration i Reported timings Bare time Reported throughput None Gather The benchmark for the MPI_Gather function The root process inputs X np bytes x from each process All processes receive x bytes The root of the o
90. s 1 Go to https premier intel com 2 Log in to the site Note that your username and password are case sensitive 3 Click on the Submit Issue link in the left navigation bar 4 Choose Development Environment tools SDV EAP from the Product Type drop down list If this is a software or license related issue choose the Intel R Cluster Studio XE Linux or the Intel R Cluster Studio XE Windows from the Product Name drop down list 5 Enter your question and complete the required fields to successfully submit the issue NOTE Notify your support representative prior to submitting source code where access needs to be restricted to certain countries to determine if this request can be accommodated Introduction This guide presents the Intel MPI Benchmarks 4 0 The objectives of the Intel MPI Benchmarks are e Provide a concise set of benchmarks targeted at measuring the most important MPI functions e Set forth a precise benchmark methodology e Report bare timings rather than provide interpretation of the measured results Show throughput values if and only if these values are well defined Intel MPI Benchmarks is developed using ANSI C plus standard MPI Intel MPI Benchmarks is distributed as an open source project to enable use of benchmarks across various cluster architectures and MPI implementations Intended Audience This guide is intended for users who want to measure performance of MPI implem
91. sed I O sizes of 32 and 64 MB and a smaller repetition count are selected extending the standard mode tables You can modify the optional values as required define FILENAME IMB_out define IMB_OPTIONAL ifdef IMB_OPTIONAL define MINMSGLOG 25 define MAXMSGLOG 26 define MSGSPERSAMPLE 10 define MSGS_NONAGGR 10 define OVERALL VOL 16 1048576 define SECS _PER_ SAMPLE 10 define TARGET_CPU_SECS 0 1 unit seconds define N_BARR 2 else 98 Benchmark Methodology Do not change anything below this line define MINMSGLOG 0 define MAXMSGLOG 24 define MSGSPERSAMPLE 50 define MSGS_NONAGGR 10 define OVERALL VOL 16 1048576 define TARGET_CPU_SECS 0 1 unit seconds define N_BARR 2 endif If IMB_OPTIONAL is deactivated Intel MPI Benchmarks uses the default standard mode values Hard Coded Settings The sections below describe Intel MPI Benchmarks hard coded settings Communicators Active Processes Other Preparations for Benchmarking Message I O Buffer Lengths Buffer Initialization Warm Up Phase MPI 1 EXT Synchronization Actual Benchmarking Communicators Active Processes Communicator management is repeated in every select MY_COMM step If it exists the previous communicator is freed When running O lt P processes the first Q ranks of MPI_COMM_WORLD are p
92. set by preprocessor definition in files IMB_settings h for IMB MPI1 and IMB EXT benchmarks and IMB_settings_io h for IMB IO benchmarks Both include files have identical structure but differ in the predefined parameter values To enable the optional mode define the IMB_OPTIONAL parameter in the IMB_settings h IMB_settings_io h After you change the settings in the optional section you need to recompile the Intel MPI Benchmarks The following table describes the Intel MPI Benchmarks parameters and lists their values for the standard mode Parameter Values In Values In Description IMB_ settings h IMB_settings_io h P Set to initialize Intel T MPI Benchmarks rhe AAN NOL EB Not set by MPI_Init_thread instead of MPI_Init IMB_OPTIONAL Not set Not set Set to activate optional settings The second smallest data MINMSGLOG 0 0 transfer size is max un de 24888 the smallest size is always 0 96 Benchmark Methodology MAXMSGLOG ET MSGSPERSAMPLE ER POLICY MSGS_NONAGGR OV ERALL_ VOL 22 24 imode_dynamic 1000 50 100 10 40 Mbytes 16 1048576 where unit sizeof float for reductions unit 1 for all other cases You can override this parameter value using the msglog flag The largest message size used is J AAMER You can override this parameter value using the msglog flag The policy used for calculating the number of iterations You can override this
93. ted in the figure below non aggregate M 1 and aggregate M n_sample For details see Actual Benchmarking x t aggregate and non aggregate with bidirectional transfers Below see the basic Description As symbolized between in the figure below This benchmark runs on two active processes Q 2 MPI_Get MPI_BYTE for both origin and target t t M in usec as indicated in the figure below non aggregate M 1 and aggregate M n_sample For details see Actual Benchmarking x t aggregate and non aggregate 47 Intel R MPI Benchmarks User Guide PROCESS 1 Mfold MPI Put xb A tM disjoint MPI Win fence t M A t M M Accumulate PROCESS 2 Mfold MPI Put disjoint MPI Win fence This is the benchmark for the MPI_Accumulate function It reduces a vector of length L x sizeof float of float items The MPI data type is MPI_FLOAT The MPI operation is MPI_SUM See the basic definitions and a schematic view of the pattern below Accumulate Definition Property Measured pattern MPI data type MPI operation Root Reported timings Reported throughput Accumulate Pattern 48 Description As symbolized between in the figure below This benchmark runs on two active processes Q 2 MPI_FLOAT MPI_SUM t t M in usec as indicated in the figure below non aggregate M 1 and aggregate M n_sample For details see Actual Benchmarking None MPI 2 Bench
94. ther command line arguments or parameter selections in the settings h setting_io h include files See Parameters Controlling Intel MPI Benchmarks for details Intel MPI Benchmarks also offers different modes of control e Standard mode In this mode all configurable sizes are predefined and should not be changed This ensures comparability for result tables e Optional mode In this mode you can set these parameters at your choice You can use this mode to extend the result tables to larger transfer sizes The following graph shows the control flow inside the Intel MPI Benchmarks Control flow of Intel MPI Benchmarks For all_selected_benchmarks For all selected process numbers Select MPI communicator MY COMM to run the benchmark For all selected _transfer message sizes X Initialize message resp I O buffers Other preparations MY COMM Synchronize processes of MY COMM Execute benchmark transfer size X Output results 88 Benchmark Methodology Command line Control You can control all the aspects of the Intel MPI Benchmarks through the command line The general command line syntax is the following IMB MPI1 h elp npmin lt NPmin gt multi lt MultiMode gt off_cache lt cache_size cache_line_size gt iter lt msgspersample overall_vol msgs_nonaggr iter_policy gt iter_policy lt iter_policy gt time lt max_runtime per sample gt mem lt max
95. tical Application is any application in which failure of the Intel Product could result directly or indirectly in personal injury or death SHOULD YOU PURCHASE OR USE INTEL S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES SUBCONTRACTORS AND AFFILIATES AND THE DIRECTORS OFFICERS AND EMPLOYEES OF EACH HARMLESS AGAINST ALL CLAIMS COSTS DAMAGES AND EXPENSES AND REASONABLE ATTORNEYS FEES ARISING OUT OF DIRECTLY OR INDIRECTLY ANY CLAIM OF PRODUCT LIABILITY PERSONAL INJURY OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN MANUFACTURE OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS Intel may make changes to specifications and product descriptions at any time without notice Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them The information here is subject to change without notice Do not finalize a design with this information The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications Current characterized errata are available on request Contact your local
96. time flag has been implemented as default e mem 12 Introduction Use this flag to determine an a priori maximum per process memory usage of the Intel MPI Benchmarks for the overall message buffers Miscellaneous Changes In the Exchange benchmark the two buffers sent by MPT_Tsend are separate The command line is repeated in the output Memory management is completely encapsulated in the functions IMB_v_alloc IMB_v_free Changes in Intel MPI Benchmarks 3 0 This release includes the following updates as compared to the Intel MPI Benchmarks 2 3 e A callto the MPI_Init_thread function to determine the MPI threading environment The MPI threading environment is reported each time an Intel MPI Benchmark application is executed e A call to the function MPI_Get_version to report the version of the Intel MPI library implementation that the three benchmark applications are linking to e New Alltoallv benchmark e New command line flag h elp to display the calling sequence for each benchmark application e Removal of the outdated Makefile templates There are three complete makefiles called Makefile make_ict and make_mpich The make_ict option uses the Intel Composer XE compilers This option is available for both Intel and non Intel microprocessors but it may result in additional optimizations for Intel microprocessors e Better command line argument checking clean message and break on most invalid argumen
97. tion is working correctly you can use the CPP flag DCHECK Activate t he CPP flag DCHECK through the CPPFLAGS variable and recompi le the Intel MPI Benchmarks executable files Every message passing result from the Intel MPI Benchmarks are checked against the expected outcome Output tables contain an additional column called Defects that displays the difference as floating point numbers NOTE The DCHI ECK results are not valid as real benchmark data Deactivate DCHI ECK and recompile to get the proper results Output The benchmark output includes the following information e General information machine system release and version are obtained by IMB_g_info c e The calling sequence command line flags are repeated in the output chart e Results for the non multiple mode After a benchmark completes three time values are available extended over the group of active processes e Tmax Tmin e Tavg The time the maximum time the minimum time the average time unit is u Single Transfer Benchmarks Display X message size bytes T Tmax usec bandwidth X 1 048576 T Parallel Transfer Benchmarks Display Tmax 104 X message size Tmax Tmin and Tavg bandwidth based on time Benchmark Methodology NOTE IMB RMA benchmarks show only bare timings for Parallel Transfer benchmarks Collective Benchmarks Display X message
98. ts Intel s compilers may or may not optimize to the same degree for non Intel microprocessors for optimizations that are not unique to Intel microprocessors These optimizations include SSE2 SSE3 and SSSE3 instruction sets and other optimizations Intel does not guarantee the availability functionality or effectiveness of any optimization on microprocessors not manufactured by Intel Microprocessor dependent optimizations in this product are intended for use with Intel microprocessors Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice Notice revision 20110804 Notational Conventions The following conventions are used in this document 13 Intel R MPI Benchmarks User Guide Style Description This type style Commands arguments options file names HIS_TYPE_STYLE Environment variables lt this type style gt Placeholders for actual values items Optional items item item Selectable items separated by vertical bar s Document Version Information Revision ee i Document Number Number Description Revision Date 320714 001 2 3 Initial version 10 2004 The following topics were added 320714 002 3 0 e Descriptions of environment 06 2006 amendments e TheAlltoallv The following updates were added e
99. ut into one group and the remaining P Q get MPI_COMM_NULL The group of MY_COMM calls the active processes group Other Preparations for Benchmarking Window IMB EXT and IMB RMA 1 2 An Info is set and MPI_Win_create is called creating a window of size x for MY_COMM For IMB EXT MPI_Win_fence is called to start an access epoch NOTE IMB RMA benchmarks do not require MPI_Win_fence since they use the passive target communication mode 99 Intel R MPI Benchmarks User Guide File IMB 10 To initialize the IMB IO file follow these steps 1 Select a file name This parameter is located in the IMB_settings_io h include file In the case of a multi lt MPI command gt a suffix _g lt groupid gt is appended to the name If the file name is per process a second event suffix _ lt rank gt is appended 2 Delete the file if it exists open the file with MPI_MODE_DELETE_ON_CLOSE and close it 3 Select a communicator to open the file MPI_COMM_SELF for S_benchmarks and P_ ACTION _priv 4 Select a mode MPI_MODE_CREATE MPI_MODE_RDWR 5 Select an info routine as explained below Info Intel MPI Benchmarks uses an external function User_Set_Info which you implement for the current system The default version is include mpi h void User_Set_Info MPI_Info opt_info ifdef MPIIO Set info for all MPI_File_open calls opt_info MPI_INFO_NULL endif i
100. with the Visual Studio 2010 go to IMB EXT_VS_2010 3 Open the vcproj or vcxproj file in Visual Studio The executable file for one of the Intel MPI Benchmarks components is created e IMB EXT exe e IMB IO exe IMB MPIl exe e IMB NBC exe 20 Installation and Quick Start e IMB RMA exe From the Solution Platforms drop down list choose the required architecture x64 or Win32 From the Solution Configurations drop down list choose Release Highlight the project folder in the Solution Explorer Go to Project gt Properties to open Configuration Properties dialog box Make sure you have something like the following settings Setting Character Set Debugger to launch Command Command Arguments Additional Include Directories Warning Level Preprocessor Definition Value Use Multi Byte Character Set Local Windows Debugger Notes General gt Project Defaults Debugging x64 Depending on your system configuration you may select other debuggers S I_MPI_ROOT intel64 bin mpiexec exe IA 32 I_MPI_ROOT ia32 bin mpiexec exe n 2 TargetPath to C C gt General TargetPath should be quoted as in n 2 TargetPath x64 I_MPI_ROOT intel64 include IA 32 I_MPI_ROOT ia32 include Level 1 W1 C C gt Preprocessor IMB EXT WIN_IMB CRT_SECURE

Intel® MPI Benchmarks User Guide and Methodology

Contents

Download Pdf Manuals

Related Search

Related Contents

Intel&reg; MPI Benchmarks User Guide and Methodology

Contents

Download Pdf Manuals

Related Search

Related Contents

Intel® MPI Benchmarks User Guide and Methodology