Home
Mellanox OFED Linux User`s Manual
Contents
1. The Manager and Runtime library are installed in opt mellanox fca directo ry The FCA Manager will not be started automatically To start FCA Manager now type etc init d fca managerd start There should be single process of FCA Manager r To start FCA Manager automatically after boot etc init d fca managerd i Check opt mellanox fca share doc fca READ Hja Prepar mxm Prepari openshmem Prepari Prepar TOME Iro mvapich2 gcc Prepari openmpi _ oa Prepar mpitest Prepar mpitest Prepari mlnxofe TIONS mvapich2 gcc S openmpi gcc ia oa d docs H HH H ns tal H H H Jl Jl Jl H H H service H H H H H H type HH HH HH H H H for quick star H H H unning per fabric HH HH HH t instr H HH H uctions 30 Mellanox Technologies Rev 2 0 2 0 5 Device 05 00 0 05 00 0
2. libmge devel THHHBHHHHBHHHBHHBHHHHHHHHHHHHBHHHHBHHHHHHHHBHHHBSHRHHI Preparing THHHBHHHBHHHHHBHHBHHHHHHBHHHHHHHHBHBHHHHHHHBHBHHHHE libibem THHHBHHHBHHHHHBHBHHHHHHBHHHHHHHBHBHHHHHHHBHBHHHHE Preparing THHHBHHHHHHHHHBHBHHHHHHBHHHHHHHBHBHHHHHHHBHBHHH HE libibem THHHBHHHHBHHHBHBHHHHHHHHBHHHBRHHHHBHHHHHHHHBHHHBHHHHI Mellanox Technologies 27 Rev 2 0 2 0 5 Preparin libibcm devel Preparin libibcm devel Preparin libibuma Prepar Preparin Preparin Preparin Preparin Preparin Preparin Preparin Preparin Preparin Preparin ibsim Preparin ibacm Preparin librdmac Preparin librdmac Preparin librdmac librdmac librdmac libibuma libibmad libibmad Go Go Go Goa Goa Goa Goa Goa Go Goa m Goa m Go libibumad devel Preparing libibumad devel libibumad static libibumad static libibmad devel libibmad devel libibmad static libibmad static m utils Preparing m devel Preparing m devel Preparing opensm libs Preparing opensm libs Preparing opensm
3. 55 4 5 1 Enabling Time Stamping 55 45 2 Getting Time Stamping cscs cee ete He E ERAT HEN Eee E Tuer 58 4 6 Atomic Operations maesan nenti nina n 59 4 6 1 Enhanced Atomic Operations 000 cee ees 59 4 7 Ethernet Tunneling Over IPoIB Driver eIPoIB 60 4 7 1 Enabling the eIPoIB 60 4 7 2 Configuring the Ethernet Tunneling Over IPoIB Driver 60 4 7 3 Setting Performance Tuning eens 62 4 8 Contiguous Pagess scu esser Ep Ore SR Vg nen Palomar een 62 4 9 Shared Memory 63 Chapter 5 Flow OS 5 1 Flow Domains 65 Chapter 6 Single Root IO Virtualization 1 68 6 1 System Requirements 68 6 2 Setting Up a p M ret tef Pate 68 6 3 Enabling SR IOV and Para Virtualization on the Same Setup 72 6 4 Assigning a Virtual Function to a Virtual 73 6 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server 73 6 5 Uninstalling SR IOV 74 6 6 Burning Firmware with
4. Affected Switch Relevant Description Commands h Print the help menu hh Print an extended help menu d evice Specify the device to which the Flash is connected lt device gt guid lt GUID gt burn sg GUID base value 4 GUIDs are automatically assigned to the follow ing values guid gt node GUID guid 1 gt portl guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value guids burn sg 4 GUIDs must be specified here The specified GUIDs are assigned lt GUIDs gt the following values repectively node port1 port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 176 Mellanox Technologies Rev 2 0 2 0 5 Table 29 mstflint Switches Sheet 2 of 3 Affected Switch Relevant Description Commands mac MAC burn sg MAC address base value Two MACs are automatically assigned to the following values mac gt portl 1 gt port2 Note This switch is applicable only for Mellanox Technologies Ethernet products macs burn sg Two MACs must be specified here The specified MACs are lt MACs gt assigned to port and port2 repectively Note This switch is applicable only for Mellanox Technologies Ethernet products blank guids burn Burn the image with blank GUIDs and MACS where app
5. 99 Chapter 10 OpenSM Subnet 101 OI OVetVie Ws c Oe iei pe gei tete perm dynes 101 10 2 opensm Description Y 101 10 2 1 opensm Syntax osi Be Oe a ek PEN p VS 101 10 2 2 Environment 108 10 2 3 Signalifig s bee cele eens S qus R XE e 108 10 2 4 Running o pensm Jes edet ene ee dp as s 108 10 3 osmtest Description hen 109 a ede e ee 109 10 3 2 Running osmt st ici lua RARE RAO ERREUR RE AR e 111 TOA Partitions ss vis ere rc e ae edes eda Aa MEL MES 112 10 41 FileFormat oA RE rH ger 112 10 5 Routing Algorithms orere erresen aai a E e nee 113 10 5 1 Effect of Topology Changes 00 00 ccc cee eect eh 115 Mellanox Technologies 5 J Rev 2 0 2 0 5 10 5 2 Min Hop Algorithm seene rieus ct 115 10 5 3 UPDN Algorithm o secs fetus ies bre ber eC VA DEL due oP eats 115 10 5 4 Fat tree Routing 1 llis 116 10 5 5 LASH Routing 1 enna 118 10 5 6 DOR Routing 1 119 10 5 7 Torus 2QoS Routing 1 120 10 6 Quality of Service Management in 5 128 10 61
6. 202 Appendix D mlx5 Module Parameters 203 Mellanox Technologies 7 Rev 2 0 2 0 5 List of Figures Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards 17 Figure 2 Consolidation Over InfiniBand 52 Figure 3 Example ofa Virtual Network eee 62 Figure 4s QoS Manager a eto Y Pee ere pe etalon 128 Figure 5 Example QoS Deployment on InfiniBand Subnet 137 8 Mellanox Technologies Rev 2 0 2 0 5 List of Tables Table 1 Document Revision History cece 10 Table 2 Abbreviations and Acronyms rare 11 Table3 Glossary 2 22 debe he DA be Lue wae Sebo 12 Table4 Reference Documents E s 13 Table 5 Software and Hardware Requirements 21 Table 6 mlnxofedinstall Return Codes 25 Butter Vales uie ARTI ate fare 63 Table 8 Parameters Used to Control Error Cases Contiguity 63 Table 9 Flow Specific Parameters 67 Table 10 Useful MPI Links I il 79 Fable Uk Parameters eae ER EC Rer PCR ATE
7. 75 Chapter7 HPC Peatures soo Rye e ERE x RS TU 7 1 Shared Memory 76 7 1 1 Mellanox ScalableSHMEM 76 7 1 2 Running SHMEM with FCA 77 7 1 3 Running ScalableSHMEM with 77 7 1 4 Running SHMEM with Contiguous 78 7 1 5 Running ScalableSHMEM Application 78 7 2 Message Passing 78 V2 OVERVIEW umts 78 7 2 2 Prerequisites for Running MPI 2 0 eee eee tte ene 79 4 Mellanox Technologies Rev 2 0 2 0 5 7 2 3 MPI Selector Which MPI Runs IIR 80 7 2 4 Compiling MPI Applications 81 7 3 MellanoX Messaging hn 81 7 3 1 Compiling OpenMPI with MXM 0 ccc cet 81 7 3 2 Enabling MXM in OpenMPI eee 82 7 3 3 Tuning MXM Settings 0 cece cette eens 82 7 3 4 Configuring Multi Rail Support nasaus aeaaea 82 7 3 5 Configuring MXM over the Ethernet Fabric 83 7 4 Fabric Collective Accelerator 83 TS Scalable sek ok 83 7 921 Installimg ScalableU Ce ee hie SE OR Iss
8. oi eens eae eee Cae ae EU CU ose 128 10 6 2 Advanced QoS Policy File 00 llis 129 10 6 3 Simple QoS Policy 130 10 6 4 Policy File Syntax Guidelines lees 130 10 6 5 Examples of Advanced Policy 130 10 6 6 Simple QoS Policy Details and 1 133 10 6 7 SL2VL Mapping VL 135 10 6 8 Deployment Example 0 2 0 0 000 ccc 137 10 7 QoS Configuration Examples 137 10 7 1 Typical HPC Example MPI and 138 10 7 2 EDC SOA 2 tier IPoIB and 138 10 7 3 EDC 3 tier IPoIB RDS SRP 2 0 eee ees 139 10 8 Adaptive Routing 0 cece emen 140 10 81 Overview ons eves Sas Peek Ss E tee 140 10 8 2 Installing the Adaptive Routing 141 10 8 3 Running Subnet Manager with Adaptive Routing Manager 141 10 8 4 Querying Adaptive Routing 142 10 8 5 Adaptive Routing Manager Options File 142 10 9 Congestion Control aia ies neui 145 10 9 1 Congestion Control Overview lisse 145 10 9 2 Running OpenSM with Congestion Con
9. H H HH HH H H HH UP H H HH HH T H H HH H E H H H HH H 4 T H H HH H ate H H HH T Ww H H HH H E H H HH HH th H HH H HH H HHHH HH HH Ww H H HH H H HH HH H H HH H UP H H HH HH T H H HH H H H H HH HH T T H H HH UL H H HH E T 1 H H H HH H H HH HH T H H HH ate H H HH HH E T H H HH H E H H H HH HH ate H H HH E T H H HH HH H H HH H H H HH HH H H HH H te Mellanox Technologies Rev 2 0 2 0 5 Rev 2 0 2 0 5 Installation cc mgr Prepar dump pr Prepar ar mgr Prepar ibdump Prepari infinib Prepar qperf Prepar fca INFO u Iia ss rige Ie HOME and diags Ig pdating IMPORTANT NOTE
10. If you need to burn an Expansion ROM image please refer to Burning the Expan sion ROM Image on page 183 b you have downloaded from Mellanox Technologies Web site http www mella gt The following steps are also appropriate in case you wish to burn newer firmware that n nox com gt Support gt Firmware Download Step 1 Start mst host1 mst start Step 2 Identify your target InfiniBand device for firmware update 1 Get the list of InfiniBand device names on your machine host1 mst status MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre I2C module is not loaded MST devices dev mst mt25418 pciconf0 PCI configuration cycles access bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is A0 dev mst mt25418 pci cr0 direct access bus dev fn 02 00 0 bar 0xdef00000 size 0x100000 Chip revision is A0 dev mst mt25418 pci_msix0 PCI direct access bus dev fn 02 00 0 bar 0xdeefe000 size 0x2000 dev mst mt25418 pci uar0 PCI direct access bus dev fn 02 00 0 bar 0xdc800000 size 0x800000 2 Your InfiniBand device is the one with the postfix pci cr0 In the example listed above this will be dev mst mt25418 pci cro Step3 Burn firmware Mellanox Technologies 35 J Rev 2 0 2 0 5 Installation 1 Burning a firmware binary image using mst flint that is already installed on your machine Please refer to MSTFLINT README txt under
11. Mellanox Technologies 177 Rev 2 0 2 0 5 Table 29 mstflint Switches Sheet 3 of 3 Affected Switch Relevant Description Commands dual_image burn Make the burn process burn two images on Flash The current default failsafe burn process burns a single image in alternating locations Print version info Table 30 mstflint Commands Command Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Verify the entire Flash bb Burn Block Burn the given image as is without running any checks sg Set GUIDs ri lt out file gt Read the firmware image on the Flash into the specified file dc lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase sector rw lt addr gt Read one DWORD from Flash ww lt addr gt lt data gt Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt size data Write a data block to Flash without sector erase rb lt addr gt size out file Read a data block from Flash swreset SW reset the target InfniScale IV device This command is sup ported only in the In Band access method Possible command return values are 0 successful completion 1 error has occurred
12. Output Files Table 23 lists the various flags of the command Table 23 ibstatus Flags and Options Optional Do Flag If Not Description Specified h Optional Print the help menu lt device gt Optional All devices Print information for the specified device May specify more than one device 160 Mellanox Technologies Rev 2 0 2 0 5 Table 23 ibstatus Flags Options Optional Default Flag dator If Not Description Specified lt port gt Optional but All ports of Print information for the specified port only of the requires speci the specified specified device fying device device name Examples 1 List the status of all available InfiniBand devices and their ports gt ibstatus Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 0007 3896 base lid 0x3 sm lid 0x3 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Infiniband device mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid 0 1 sm lid 0 1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Infiniband device mthca0 port 1 status default gid e80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 State 3 phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mthca0 port 2 status default gid e80 0000 00
13. proc scsi tgt groups Default devices modprobe ib srpt 198 Mellanox Technologies Rev 2 0 2 0 5 echo add mgmt gt proc scsi tgt trace level echo add mgmt dbg gt proc scsi tgt trace level echo add out of mem proc scsi tgt trace level kkkkkkkkkkkkkkkkkkkkkkk End srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkk B 3 How to Unload Shutdown 1 Unload ib srpt modprobe r ib srpt 2 Unload scst and its dev_handlers first modprobe r scst vdisk scst 3 Unload ofed etc rc d openibd stop Mellanox Technologies 199 Rev 2 0 2 0 5 Appendix C mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modprobe conf options mlx4 core parameter lt value gt and or options mlx4 ib parameter lt value gt The following sections list the available m1x4 parameters 200 Mellanox Technologies Rev 2 0 2 0 5 1 Mellanox Technologies 201 mlx4 core Parameters Here are the updated parameters of the mlx4 core set 4k mtu debug level msi x enable sys tune block loopback num vfs probe vf log num mgm entry size high rate steer fast drop enable 64b log num mac log num vlan log mtts per seg port type array log num gp log num srq log rdmarc per gp log num cq log num mcg log num mpt log num mtt enable gos internal err reset Obsolete Attempt to set 4K MTU to all ConnectX port
14. service cpuspeed stop 9 2 4 2 Kernel Idle Loop Tuning The mlx4_en kernel module has an optional parameter that can tune the kernel idle loop for bet ter latency This will improve the CPU wake up time but may result in higher power consump tion To tune the kernel idle loop set the following options in the etc modprobe d mlx4 conf file e For MLNX OFED 2 0 x options mlx4 core enable sys tune 1 For MLNX EN 1 5 10 options mlx4 en enable sys tune 1 9 2 4 3 OS Controlled Power Management Some operating systems can override BIOS power management configuration and enable c states by default which results in a higher latency To resolve the high latency issue please follow the instructions below 1 Edit the boot grub grub conf file or any other bootloader configuration file 2 Addthe following kernel parameters to the bootloader command intel idle max cstate 0 processor max_cstate 1 3 Reboot the system Example title RH6 2x64 root hd0 0 kernel wvmlinuz RH6 2x64 2 6 32 220 e16 x86 64 root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0 console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max_cstate 1 Mellanox Technologies 95 J Rev 2 0 2 0 5 Performance 9 2 5 9 2 6 9 2 6 1 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default T
15. 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps Ini CHEM 5 0 Gbps gt ibportstate C mthca0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 Down DhyshinkState au NE Polling ENSUpPpO rnt 1X or 4X 1X or 4X po 4X LinkSpeedSupported 2 5 Gbps aa 2 5 Gbps 2 5 Gbps Change the speed of a port First query for current configuration gt ibportstate C 1 4 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 InfiniBand Fabric Diagnostic Utilities 164 Mellanox Technologies Rev 2 0 2 0 5 THINKS Initialize ee E E M MM LinkUp Lio ACHE NSONOESCS ioca sonscsoocou 1X or 4X babel 5 1X or 4X AX 11 5 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps 5 0 Gbps Now change the enabled link speed gt ibportstate C mlx4 0 D 0 1 speed 2 ibportstate C mlx4 0 D 0 1 speed 2 Initial PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 2 5 Gbps After PortInfo set Port info DR path slid 65535 dlid 65535 0 port 1 Mk peed ewan 5 0 Gbps IBA e
16. 7 5 2 2 Controlling FCA Offload in ScalableUPC using Environment Variables gt To enable FCA module under ScalableUPC export GASNET FCA ENABLE CMD LINE 1 gt To set FCA verbose level 9 export GASNET FCA VERBOSE CMD LINE 10 gt To set the minimal number of processes threshold to activate FCA export GASNET NP CMD LINE 1 ScalableUPC contains modules configuration file http modules sf net which can be found at opt mellanox bupc 2 2 etc bupc modulefile pu Mellanox Technologies 85 J Rev 2 0 2 0 5 HPC Features 7 5 3 Various Executable Examples The following are various executable examples gt Torun ScalableUPC application without FCA support 9 upcrun np 128 fca enable 0 executable filename Torun UPC applications with FCA enabled for any number of processes export GASNET FCA ENABLE CMD LINE 1 GASNET NP CMD LINE 0 upcrun np 64 executable filename gt run UPC application on 128 processes verbose mode o upcrun np 128 fca enable 1 fca np 10 fca verbose 5 executable filename gt To run UPC application offload to FCA Barrier and Broadcast only 9 upcrun np 128 fca ops executable filename 86 Mellanox Technologies Rev 2 0 2 0 5 Mellanox Technologies 87 J Rev 1 5 3 3 0 0 Working With VPI 8 Working With VPI VPI allows ConnectX ports to be independently configured as either IB
17. FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE RDMA over Converged Ethernet SDP Sockets Direct Protocol SL Service Level Mellanox Technologies 11 Rev 2 0 2 0 5 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description SRP SCSI RDMA Protocol MPI Message Passing Interface EoIB Ethernet over Infiniband QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane Virtual SCSI Host Bus adapter uDAPL User Direct Access Programming Library Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man agers in particular It is included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Glossary Sheet 1 of 2 Channel Adapter CA An IB device that terminates an IB link and executes transport func Host Channel Adapter tions This may be an HCA Host CA or a TCA Target CA HCA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated c
18. To enable SR IOV and Para Virtualization on the same setup 1 Create a bridge vim etc sysconfig network scripts ifcfg bridge0 DEVICE bridge0 TYPE Bridge 105 10 NETMASK 255 255 0 0 BOOTPROTO static ONBOOT yes NM CONTROLLED no DELAY 0 Step 2 Change the related interface in the example below bridge0 is created over eth5 DEVICE eth5 BOOTPROTO none STARTMODE on HWADDR 00 02 c9 2e 66 52 TYPE Ethernet NM_CONTROLLED no ONBOOT yes BRIDGE bridge0 Step3 Restart the service network 72 Mellanox Technologies Rev 2 0 2 0 5 Step 4 Attach a virtual NIC to ifconfig a eth6 Link encap Ethernet HWaddr 52 54 00 E7 77 99 inet addr 13 195 15 5 Bcast 13 195 255 255 Mask 255 255 0 0 inet6 addr fe80 5054 ff fee7 7799 64 Scope Link UP BROADCAST RUNNING MULTICAST MTU 1500 Metric 1 RX packets 481 errors 0 dropped 0 overruns 0 frame 0 TX packets 450 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 22440 21 9 KiB TX bytes 19232 18 7 KiB Interrupt 10 Base address 0xa000 Step 5 Add the MAC 52 54 00 E7 77 99 to the sys class net eth5 fdb table on HV Before cat sys class net eth5 fdb 33 33 00 000202 33 33 2 66 52 01 00 5 00 00 01 33 33 00 00 00 01 echo 52 54 00 E7 77 99 gt sys class net eth5 fdb After cat sys class net eth5 fdb 799 33 33 00 0002102 33 33 2 2 66 52 01 00 5 00 00 01 33
19. 4 1 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D available from http www t10 org The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp tl0 drafts spc3 spc3r21b pdf Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf Basic functionality task management and limited error handling 4 1 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or change the value of SRP LOAD in etc infiniband openib conf to yes For the changes to take effect run etc init d openibd restart de srp sg tablesize This is the maximum number of gather scatter entries per I O gt When loading the ib_srp module it is possible to set the module parameter default 12 4 1 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 4 1 2 4 explains how to do this automatically 38 Mellanox Technologies Rev 2 0 2 0 5 Make sure that the module is loaded SRP Initiator is reachable SRP Target and that an SM is running To establish a connection with an SRP Target and create an SRP SCSI device for that target u
20. SymbolError PME 0 0 er a oe a 0 ROVEN ODS AEE T 0 RGvRemotePhysEDEODS i asm qa ae 0 CSN T 0 3 Dome ons traint ME 0 5 oop aoneanann 0 vi OESTE 0 EXCBUEOveIasUnbIGnOS V TET TET 0 MAIMES Budo oY ced 0 ras tere eee es 0 REVDAEA eter 0 IME PRES Fees 0 0 11 14 ibcheckerrs Validates an IB port node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold file using the same format as the dump can be specified using the t lt file gt option Synopsis ibcheckerrs h b v G T threshold file s N nocolor C ca name ca port t timeout ms lid guid port Output Files Table 28 lists the various flags of the command Table 28 ibcheckerrs Flags and Options InfiniBand Fabric Diagnostic Utilities Optional Default Flag Pipe tar If Not Description y Specified h help Optional Print the help menu b Optional Print in brief mode Reduce the output to show only if errors are present not what they are v erbose Optional Increase verbosity level
21. extra mlnx ofa kernel drivers net ethernet mellanox mlx4 mlx4 core ko mlx4 ib driver Cum lib modules kernel version extra mlnx ofa kernel drivers infiniband hw mlx4 mlx4 ib ko e mlx core driver Cum lib modules kernel version extra mlnx ofa kernel drivers net ethernet mellanox mlx5 core mlx5 core ko e mlx ib driver lib modules kernel version extra mlnx ofa kernel drivers infiniband hw mlx5 1 5 ib ko m IPoIB lib modules uname r updates kernel drivers infiniband ulp ipoib ib ipoib ko e iSER lib modules uname r updates kernel drivers infiniband ulp iser ib iser ko IPoIB lib modules uname r updates kernel drivers net eipoib eth ipoib ko RP lib modules uname r updates kernel drivers infiniband ulp srp ib srp ko RDS lib modules uname r updates kernel net rds rds ko C v B lib modules uname r updates kernel net rds rds rdma ko 5 lib modules uname r updates kernel net rds rds tcp ko Kernel s modules location may vary depending on the kernel s configuration For example lib modules uname r extra kernel drivers net ethernet mellanox mlx4 mlx4 The package kernel ib devel include files are placed under usr src ofa kernel include These include files should be used when building kernel modules that use the stack Note that the include files if needed are backported to your kernel Ther
22. ibsrpdm c d dev infiniband umad0 id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf 5 pkey ffff service id 0002c90200226cf4 root lab104 echo id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf 5 pkey ffff service 10 0002 90200226 4 gt sys class infiniband srp srp mthca0 1 add target OR Youcan edit etc infiniband openib conf to load the SRP driver and SRP High Avail ability HA daemon automatically that is set SRP LOAD yes and SRPHA ENABLE yes To set up and use the HA feature you need the dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of an SRP Target setup file kkkkkkkkkkkkkkkkkkkkkkk srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk bin sh modprobe scst scst_threads 1 modprobe scst_vdisk scst_vdisk_ID 100 echo echo echo echo echo echo echo echo open vdisk0 dev cciss c1d0 BLOCKIO gt proc scsi_tgt vdisk vdisk open vdiskl dev sdb BLOCKIO gt proc scsi_tgt vdisk vdisk open vdisk2 dev sdc BLOCKIO gt proc scsi tgt vdisk vdisk open vdisk3 dev sdd BLOCKIO proc scsi tgt vdisk vdisk add vdisk0 0 proc scsi tgt groups Default devices add vdiskl 1 proc scsi tgt groups Default devices add vdisk2 2 proc scsi tgt groups Default devices add vdisk3 3
23. port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment use SRP Targets port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF end port group port group name Virtual Servers The syntax of the port name is as follows node description Pnum node description is compared to the NodeDescription of the node and Pnum is a port number on that node port name vs1 HCA 1 P1 vs2 HCA 1 P1 end port group using partitions defined in the partition policy port group name Partitions partition Partl pkey 0x1234 end port group using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for all the nodes in the subnet port group name CAs and SM node type CA SELF end port group Mellanox Technologies 131 OpenSM Subnet Manager 132 Mellanox Technologies Rev 2 0 2 0 5 use Storage targets destination Storage service id 0x10000000000001 0x10000000000008 0x10000000000FFF gos level name WholeSet end gos match rule gos match rule source Storage use match by source group only gos level name DEFAULT end gos match rule gos match rule use match by all parameters gos class 7 9 11 source Virtual Servers destination Storage service id 0x0000000000010000 0x000000000001FFFF pkey 0x0F00 0x0FFF gos level name WholeSet end gos match rule end qos mat
24. qos ulps default Sdp port num 30000 Sdp port num 10000 20000 sdp rds ipoib pkey 0x0001 ipoib any service id 0x6234 any pkey 0x0ABC srp target port guid 0x1234 any target port guid 0 0 end qos ulps E i i H H H H default SL SL for application running on top of SDP when a destination TCP IPport is 30000 default SL for any other application running on top of SDP SL for RDS traffic SL for IPoIB on partition with pkey 0x0001 default IPoIB partition pkey 0x7FFF match any PR MPR query with a specific Service ID match any PR MPR query with a specific PKey SRP when SRP Target is located on a specified IB port GUID 6 match any PR MPR query with a specific target port GUID Similar to the advanced policy definition matching of PR MPR queries is done in order of appearance in the QoS policy file such as the first match takes precedence except for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the
25. tmp initrd ib lib modules Step6 load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd ib sbin Step 7 If you plan to give your IB device a static IP address then copy ifconfig Otherwise skip this step hostl cp sbin ifconfig tmp initrd ib sbin 8 Ifyou plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below host1 cp path to DHCP client v3 1 3 dhclient tmp initrd ib sbin host1 cp path to DHCP client v3 1 3 gt dhclient script tmp initrd ib sbin host1 mkdir p tmp initrd ib var state dhcp host1 touch tmp initrd ib var state dhcp dhclient leases host1 cp bin uname tmp initrd ib bin host1 cp usr bin expr tmp initrd ib bin host1 cp sbin ifconfig tmp initrd ib bin host1 cp bin hostname tmp initrd ib bin Step 9 Create a configuration file for the DHCP client as described in Section 4 3 3 1 and place it under tmp initrd ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number For a ConnectX devic
26. A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription ll Fabric Setup Defines how the SL2VL and VLArb tables should be setup In OFED this part of the policy is ignored SL2VL VLArb tables should be config ured in the OpenSM options file opensm opts Ill QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Path Bits are not implemented in OFED Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the fol lowing fields e SRC and DST to lists of port groups e Service ID to a list of Service ID values or ranges e QoS Class to a list of QoS Class values or ranges CMA Features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma resolve add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Ser
27. MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 3 valid lids dumped 4 Dump all Lids with valid out ports of the switch with portguid 004016 gt ibroute G 0x000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0 0002 023 Switch portguid 0x0002c902fffff00a 47396 Infiniscale III Mellanox Technologies 0x0003 000 Switch portguid 0x000b8cffff004016 47396 Infiniscale III Mellanox Technologies 0x0006 023 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 Mellanox Technologies 167 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities 0x0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 024 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 5 Dump all non empty mlids of switch with Lid 3 ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0 2 18102083 010232507 23405407989 1294 MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 0xc022 0xc023 0xc024 0xc040 0xc041 0xc042 12 valid mlids dumped 11 12 smpquery
28. QoS support has to be turned on in order that SL VL mappings are used ae LMC gt 015 not supported by the LASH routing If this is specified the default routing algorithm is invoked instead b For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do mesh analysis This will add an additional phase that analyses the mesh to try to determine the dimension and size of a mesh If it determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs 10 5 6 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses short est paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the orderi
29. Specify the expected link speed Skip the execution of the given stage Applicable to the following stages dup guids lids links sm nodes info all default None o output path out dir screen num errs h help V version Output Files Specify the directory where the output files will be placed Specify the threshold for printing errors to Screen default 5 Placed default var tmp ibdiagnet2 Print this help message Print the version of the tool Table 19 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 19 ibdiagnet of ibutils2 Output Files Output File Description ibdiagnet2 lst Fabric links in LST format ibdiagnet2 sm Subnet Manager ibdiagnet2 pm Ports Counters ibdiagnet2 fdbs Unicast FDBs ibdiagnet2 mcfdbs Multicast FDBx ibdiagnet2 nodes_info Information on nodes ibdiagnet2 db_csv ibdiagnet internal database 152 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 0 2 0 5 ibdiagnet run performs following stages Fabric discovery Duplicated GUIDs detection Links in INIT state and unresponsive links detection Counters fetch Error counters check Routing checks Link width and speed checks Return Codes 0 Success 1 Failure with description 11 4 ibdiagnet of ibutils IB Net Diagnostic This version of ibdiagnet is included in the ibut
30. d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 a Optional Apply query to all ports 1 Optional Loop ports r Optional Reset the counters after reading them C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port R Optional Reset the counters t Optional Override the default timeout for the solicited lt timeout_ms gt MADs msec V ersion Optional Show version info lt lid guid gt Optional LID or GUID port reset_ mask Examples periquery r 32 1 read performance counters and reset periquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports perfquery R 32 2 Ox0fff reset only error counters of port 2 perfquery R 32 2 0 000 reset only non error counters of port 2 1 Read local port s performance counters perfquery 172 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 0 2 0 5 2 Read performance counters from LID 2 all ports 3 Read then reset performance counters from LID 2 port 1 Mellanox Technologies 173 Rev 2 0 2 0 5
31. mca scoll fca enable np 0 gt To disable FCA mca scoll fca enable 0 mca coll fca enable 0 For more details on FCA installation and configuration please refer to the FCA User Manual found in the Mellanox website 7 1 3 Running ScalableSHMEM with MXM MellanoX Messaging library provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hardware This includes a variety of enhancements that take advantage of Mellanox networking hardware including Multiple transport support including RC XRC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication Mellanox Technologies 77 Rev 2 0 2 0 5 HPC Features These enhancements significantly increase the scalability and performance of message com muni cations in the network alleviating bottlenecks within the parallel communication libraries 7 1 4 Running SHMEM with Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over contiguous pages It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr gt To activate MLNX_OFED 2 0 and the contiguous pages allocator with SHMEM Run the following argument to enable compo
32. 0 5 HPC Features 7 Features 7 1 Shared Memory Access The Shared Memory Access SHMEM routines provide low latency high bandwidth communi cation for use in highly parallel scalable programs The routines in the SHMEM Application Pro gramming Interface API provide a programming model for exchanging data between cooperating parallel processes The SHMEM can be used either alone or in combination with MPI routines in the same parallel program The SHMEM parallel programming library is an easy to use programming model which uses highly efficient one sided communication APIs to provide an intuitive global view interface to shared or distributed memory systems SHMEM s capabilities provide an excellent low level interface for PGAS applications A SHMEM program is of a single program multiple data SPMD style All the SHMEM pro cesses referred as processing elements PEs start simultaneously and run the same program Commonly the PEs perform computation on their own sub domains of the larger problem and periodically communicate with other PEs to exchange information on which the next communi cation phase depends The SHMEM routines minimize the overhead associated with data transfer requests maximize bandwidth and minimize data latency the period of time that starts when a PE initiates a transfer of data and ends when a PE can use the data SHMEM routines support remote data transfer through put operations da
33. 0 5 OpenSM Subnet Manager Queries can be matched Source port group whether a source port is a member of a specified group Destination port group same as above only for destination port PKey QoS class Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Service ID disregarding rest of the query fields However if a certain query has only Service ID which means that this 1s the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 10 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a match rule has only one criterion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below 10 6 4 Policy File Syntax Guidelines Leading and trailing blanks as well as empty li
34. 00 11 22 33 44 55 loc 5 action 2 All packets that contain the above destination MAC address are to be steered into rx ring 2 its underlying QP with priority 5 within the ethtool domain ethtool U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2 packets that contain the above destination IP address and source port are to be steered into rx ring 2 e ethtool eth5 Shows all of ethtool s steering rule When configuring two rules with the same priority the second rule will overwrite the first one so this ethtool interface 1s effectively a table MLXA Driver Support The mlx4 driver supports only a subset of the flow specification the ethtool API defines Asking for an unsupported flow specification will result with an invalid value failure The following are the flow specific parameters Table 5 Flow Specific Parameters ether tep4 udp4 ip4 Mandatory dst src ip dst ip Optional vlan src ip dst ip src src ip dst ip vlan port dst port vlan RFS RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUs that are used by flow s owner applications This domain allows the RFS mechanism to use the flow steering infrastructure to support the RFS logic by implementing the ndo rx flow steer which in turn calls the underlying flow steering mechanism with the RFS domain All of the rest This domain is of the lowest priority I
35. 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via qos vlarb high qos vlarb low etc 10 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the cor rect path SL for connection setup Applications that use rdma cm for connection setup will auto matically meet this requirement If a change in fabric topology causes changes in path SL values required to route without credit loops in general all applications would need to repath to avoid message deadlock Since torus 2QoS has the ability to reroute after a single switch failure without changing path SL values repathing by running applications is not required when the fabric is routed with torus 2QoS Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover provided that all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that prevent routing that is free of credit loops and will log warnings and refuse to route If no fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that case all paths that do not tran
36. 5 and 0 this spanning tree has a branch that crosses the dateline However just as for unicast crossing a dateline on a 1D ring here the ring for 2 that is broken by a failure cannot contribute to a torus credit loop Second this spanning tree is no longer optimal even for Mellanox Technologies 123 Rev 2 0 2 0 5 OpenSM Subnet Manager multicast groups that encompass the entire fabric That unfortunately is compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appro priately selecting a root switch In the 2D 6 5 torus example assume now that the switch at 3 2 1 the root for a pristine fabric fails Torus 2QoS will generate the following master spanning tree for that case 4 3 2 1 y 0 x 0 1 2 3 4 5 Assuming the y dateline was between y 4 and y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as it occurs on a 1D ring the ring for x 3 that is broken by a failure as in the above example 10 5 7 3 Torus Topology Discovery The algorithm used by torus 2QoS to construct the torus topology from the undirected graph rep resenting the fabric requires that the radix
37. 7 the burn command was aborted because firmware is current Examples 178 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 0 2 0 5 1 Find Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE gt sbin lspci d 15b3 634a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 15 Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 di 2 Venfy the ConnectX firmware using its ID using the results of the example above mstflint d 04 00 0 v ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634 0x000 20 DDR OK 0x000
38. Cases Contiguity Parameters Description MLX MR ALLOC TYPE Configures the allocator type ALL Default Uses all possible allocator and selects most efficient allocator ANON Enables the usage of anonymous pages and disables the alloca tor CONTIG Forces the usage of the contiguous pages allocator If contig uous pages are not available the allocation fails MLX MR MAX LOG2 CONTIG BS Sets the maximum contiguous block size order IZE Values 12 23 Default 23 MLX MR MIN LOG2 CONTIG BS Sets the minimum contiguous block size order IZE Values 12 23 Default 12 4 9 Shared Memory Region Shared Memory Region MR enables sharing MR among applications by implementing the Register Shared MR verb which is part of the IB spec Sharing MR involves the following steps Mellanox Technologies 63 J Rev 2 0 2 0 5 Driver Features 1 Request to create a shared MR The application sends a request via the ibv_reg_mr API to create a shared MR The application supplies the allowed sharing access to that MR and if the MR was created successfully a unique MR ID is returned as part of the struct ibv_mr which can be used by other applications to register with that MR The underlying physical pages must not be Last Recently Used LRU or Anonymous To disable that you need to turn on the IBV_ACCESS ALLOCATE MR bit as part of the sharing bits Usage Turns on via the ibv reg mr
39. Ethernet controller Mellanox Technologies MT26448 ConnectX EN 10GigE PCIe 2 0 5GT s rev b0 Link Width is not 8x PCI Link Speed 5Gb s Device 07 00 0 07 00 0 Ethernet controller Mellanox Technologies MT27500 Family ConnectX 3 Link Width 8x PCI Link Speed 5Gb s Installation finished successfully The firmware version on dev mst mt26448 pci cr0 2 9 1000 is up to date Note To force firmware update use force fw update flag The firmware version on dev mst mt4099 pci cr0 2 11 500 is up to date Note To force firmware update use force fw update flag In case your machine has the latest firmware no firmware update will occur and the installation script will print at the end of installation a message similar to the following The firmware version on dev mst mt26448_pci_cr0 2 9 1000 is up to date Note To force firmware update use force fw update flag The firmware version on dev mst mt4099 pci 2 11 500 is up to date Note To force firmware update use force fw update flag In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file 4 In case the installation script performed firmware updates to your network adapter hardware it will ask you to r
40. Example on Windows flint dev mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Removing the Expansion ROM Image Remove the expansion ROM image Run flint dev mst device name drom SX When removing the expansion ROM image you also remove Flexboot from the boot device list Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB Installing the DHCP Server To add IPoIB support in DHCP client server please refer to docs dhcp README Configuring the DHCP Server For ConnectX Family Devices When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions The value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecimal digits Extracting the Port GUID Method To obtain the port GUID run the following commands The following MFT commands assume that the Mellanox Firmware Tools package has been installed on the client machine ae host1 mst start host1 mst status The device name will be of the form dev mst mt dev id pci
41. I q r D with illegal turn at switch I and with hop I q using a VL with bit 1 set In contrast to the earlier examples the second hop after the illegal turn q r can be used to construct a credit loop encircling the failed switches 10 5 7 2 Multicast Routing Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically available in current switches there is no way to use SL VL values to separate multicast traffic from unicast traffic Thus torus 2QoS must generate multicast routing such that credit loops can not arise from a combination of multicast and unicast path segments It turns out that it is possi ble to construct spanning trees for multicast routing that have that property For the 2D 6x5 torus example above here is the full fabric spanning tree that torus 2QoS will construct where x is the root switch and each is a non root switch 122 Mellanox Technologies Rev 2 0 2 0 5 B I l l 1 I l l 2 x l l l 1 1 I 1 0 B x 0 1 2 4 5 For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns However to construct a credit loop the union of multicast routing on this span ning tree with DOR unicast routing can only pr
42. InfiniBand and RoCE interfaces MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta e OpenSM InfiniBand Subnet Manager Utilities Diagnostic tools Performance tests Firmware tools MFT Source code for all the OFED software modules for use under the conditions men tioned in the modules LICENSE files e Documentation 1 2 3 Firmware The ISO image includes the following firmware items Firmware images mlx format for ConnectX 3 network adapters Firmware configuration INI files for Mellanox standard network adapter cards and custom cards FlexBoot for ConnectX 3 HCA devices 1 2 4 Directory Structure The ISO image of MLNX OFED LINUX contains the following files and directories mlnxofedinstall This is MLNX OFED LINUX installation script uninstall sh This is the MLNX OFED LINUX un installation script e RPMS folders Directory of binary RPMs for a specific CPU architecture firmware Directory of the Mellanox IB firmware images including Boot over IB e src Directory of the OFED source tarball e docs Directory of Mellanox OFED related documentation 1 3 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protocols ULPs interface with the hardware and with the kernel and user space The application level also shows the versatility of markets that Mellanox OFED applies to 16 Mellanox Technologies Rev 2 0 2 0 5 Figu
43. May be used several times for additional verbosity vvv v v G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 174 Mellanox Technologies Rev 2 0 2 0 5 Table 28 ibcheckerrs Flags and Options Optional Default Flag If Not Description y Specified T Optional Use specified threshold file threshold fil e 5 Optional Show the predefined thresholds N nocolor Optional color mode Use mono mode rather than color mode C ca name Optional Use the specified channel adapter or router P ca port Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout_ms gt MADs msec lt lid guid gt Mandatory Use the specified port s or node s LID GUID with with G flag G option lt port gt Mandatory Use the specified port without G flag Examples 1 Check aggregated node counter for LID 0x2 gt ibcheckerrs 2 warn counter SymbolErrors 65535 threshold 10 lid 2 port 255 warn counter LinkRecovers 255 threshold 10 lid 2 port 255 warn counter LinkDowned 12 threshold 10 lid 2 port 255 warn counter RcvErrors 565 threshold 10 lid 2 port 255 warn counter XmtDiscards 441 threshold 100 lid 2 port 255 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port all FAILED 2 Check port coun
44. Operation Prerequisites Make sure that your client is connected to the server s e The FlexBoot image is already programmed on the adapter card see Section A 2 ForInfiniBand ports only Start the Subnet Manager as described in Section A 4 The DHCP server should be configured and started see Section 4 3 3 1 IPoIB Config uration Based on DHCP on page 46 Configure and start at least one of the services iSCSI Target see Section A 10 and or TFTP see Section A 5 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot to be the first on the boot device priority list see Section A 6 On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 Note also that the driver waits up to 90 seconds for each port to come up ad If MLNX FlexBoot iPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by the Subnet Manager In case sensing the port protocol fails the port will be configured as an InfiniBand port ade 186 Mellanox Technologies Rev 2 0 2 0 5 For ConnectX Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ2 00 0 open Link
45. State University 78 Mellanox Technologies Rev 2 0 2 0 5 These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 6 lists some useful MPI links Table 6 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mpi org MVAPICH 2 MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections Section 7 2 2 Prerequisites for Running MPI on page 79 Section 7 2 3 MPI Selector Which MPI Runs on page 80 Section 7 2 4 Compiling MPI Applications on page 81 7 2 2 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login i e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote computers and or servers 7 2 2 1 SSH Configuration The following steps describe how to configure password less access over SSH 1 Generate an ssh key on the initiator machine host1 host1 ssh keygen t rsa Generating public private rsa key pair Enter file in which to save the key home username ssh id rsa Enter passphrase empty for
46. Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distrib uting the paths between layers LASH is an alternative deadlock free topology agnostic routing algo rithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node 5 DOR Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh 6 Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2QoS provides deadlock free routing while supporting two quality of service QoS levels Additionally it can route around multiple failed fabric links or a single failed fabric switch without introducing deadlocks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled by A or ucast_cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being down A very common case that is ha
47. a new seed specification For maximum resiliency no seed specification should share a switch with any other seed specifi cation Multiple seed specifications should use dateline configuration to ensure that torus 2QoS can grant path SL values that are constant regardless of which seed was used to initiate topology discovery portgroup max ports max ports This keyword specifies the maximum number of parallel inter switch links and also the maximum number of host ports per switch that torus 2QoS can accommodate The default value is 16 Torus 2QoS will log an error message during topology discovery if this parameter needs to be increased If this keyword appears multiple times the last instance prevails port order pl p2 p3 This keyword specifies the order in which CA ports on a destination switch are visited when computing routes When the fabric contains switches connected with multiple parallel links routes are distributed in a round robin fashion across such links and so changing the order that CA ports are visited changes the distribution of routes across such links This may be advantageous for some specific traffic patterns The default is to visit CA ports in increasing port order on destination switches Duplicate values in the list will be ignored EXAMPLE Mellanox Technologies 127 Rev 2 0 2 0 5 OpenSM Subnet Manager Look for a 2D since x radix is one 4x5 torus 1 5 is
48. active port in the card mpirun x PORTS mlx4 0 1 lt gt 7 4 Fabric Collective Accelerator The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI process to the server CPUs As a system wide solution FCA does not require any additional hardware The FCA manager creates a topol ogy based collective tree and orchestrates an efficient collective operation using the CPUs in the servers that are part of the collective operation FCA accelerates MPI collective operation perfor mance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime FCA is built on the following main principles Topology aware Orchestration The MPI collective logical tree is matched to the physical topology The collective logical tree is constructed to assure Maximum utilization of fast inter core communication Distribution of the results Communication Isolation Collective communications are isolated from the rest of the traffic in the fabric using a private virtual network VLane eliminating contention with other types of traffic After MLNX OFED installation FCA can be found at opt mellanox fca folder For further information on configuration instructions please refer to the FCA User Manual 7 5 ScalableUPC Unified Parallel C UPC is an extension of the C prog
49. as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option help Display this usage info then exit 10 3 2 Running osmtest To run osmtest in the default mode simply enter host1 osmtest The default mode runs all the flows except for the Quality of Service flow see Section 10 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file host1 osmtest f c Mellanox Technologies 111 Rev 2 0 2 0 5 OpenSM Subnet Manager 10 4 10 4 1 Immediately afterwards run the following command to test opensm host1l osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions con
50. at all HWTSTAMP FILTER NONE time stamp any incoming packet HWTSTAMP FILTER ALL return value time stamp all packets requested plus some others HWTSTAMP FILTER SOME PTP vi UDP any kind of event packet HWTSTAMP FILTER PTP V1 L4 EVENT vi UDP Sync packet HWTSTAMP FILTER V1 14 SYNC vl UDP Delay req packet HWTSTAMP FILTER PTP V1 L4 DELAY REQ PTP v2 UDP any kind of event packet HWTSTAMP FILTER PTP V2 L4 EVENT v2 UDP Sync packet HWTSTAMP FILTER V2 14 SYNC v2 UDP Delay req packet HWTSTAMP FILTER PTP V2 L4 DELAY REQ xi 1 4 rJ 802 AS1 HWTSTAMP FIL 802 AS1 HWISTAMP FIL 802 AS1 HWISTAMP FILT hernet any kind of event packet R V2 L2 EVENT hernet Sync packet V2 L2 SYNC hernet Delay req packet V2 L2 DELAY REQ Ed 9 Cj Ei ct p ct wc v2 802 AS1 any layer any kind of event packet HWTSTAMP FILTER PTP V2 EVENT v2 802 AS1 any layer Sync packet HWTSTAMP FILTER PTP V2 SYNC v2 802 AS1 any layer Delay req packet HWTSTAMP FILTER PTP V2 DELAY REQ Note for receive side time stamping currently only HWTSTAMP FILTER NONE and HWTSTAMP FILTER ALL are supported 4 5 2 Getting Time Stamping Once time stampi
51. condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus Mellanox Technologies 121 Rev 2 0 2 0 5 OpenSM Subnet Manager 5 I I I 1 1 4 DH HT 1 1 1 1 1 3 I u 1 1 1 1 1 2 SSS SS 1 I I 1 I 1 Ea A SO E 1 1 1 1 1 I I I I 1 1 x 0 2 2 3 4 5 Suppose switches T and have failed and consider the path from S to D Torus 2QoS will gen erate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that torus 2QoS cannot route without deadlock two failed switches adjacent in a dimension that is not the last dimension routed by DOR here the failed switches are O and T 5 M M I I I I I 4 R R R R 4 4 I I I I I I L L t D 4 l I I I I I 2 1 q r 4 I I I I I 1 m S n 0 T p I I I 1 I I y 0 I I I I 1 x 0 1 2 E 4 5 In a pristine fabric torus 2QoS would generate the path from S to D as S n O T r D With failed switches and T torus 2QoS will generate the path S n
52. d lt torus dimensions 9 path crosses dateline d returns 0 or 1 81 path crosses dateline d lt lt d For a 3D torus that leaves one SL bit free which torus 2QoS uses to implement two QoS levels Torus 2QoS also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows ore Si Oe lt Ge gil tes cdir port reports which torus coordinate direction a switch port points in and returns 0 1 or 2 sl2vl iport oport sl 0 1 amp sl gt gt cdir oport Thus on a pristine 3D torus i e in the absence of failed fabric switches torus 2QoS consumes 8 SL values SL bits 0 2 and 2 VL values VL bit 0 per QoS level to provide deadlock free rout ing on a 3D torus 2005 routes around link failure by taking the long way around any 1D ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z 120 Mellanox Technologies Rev 2 0 2 0 5 I I I p 1 I 1 2 I I I 1 l l l I y 0 4 4 1 1 I 1 I 1 x 0 1 2 3 4 5 For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has f
53. device TSO through ConnectX LSO capability to defragment large data grams to MTU quantas Dualoperation mode datagram and connected Large MTU support through connected mode IPoIB also supports the following software based enhancements Giant Receive Offload NAPI Ethtool support 4 3 2 IPoIB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Datagram except for Connect IBTM adapter card which uses IPoIB with Connected mode as default This can be changed to become Connected mode by editing the file etc infiniband openib conf andsetting SET IPOIB CM yes Mellanox Technologies 45 J Rev 2 0 2 0 5 Driver Features The sET CM parameter is set to auto by default to enable the Connected mode for Con nect IB card and Datagram for all other ConnectX cards After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode 4 3 3 IPolB Configuration Unless you have run the installation script mlnxofedinstall with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port like any other network adapter card i e you need to prepare a file called ifcfg ib n for each port The fi
54. docs 2 Burning a firmware image from a mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 host1 mlxburn dev dev mst mt25418 pci cr0 fw mnt firmware fw 25408 fw 25408 rel mlx Step 4 Reboot your machine after the firmware burning is completed 2 5 Uninstalling Mellanox OFED Use the script usr sbin ofed uninstall sh to uninstall the Mellanox OFED package The script 1 part of the ofed scripts RPM 36 Mellanox Technologies Rev 2 0 2 0 5 3 Configuration Files For the complete list of configuration files please refer to MLNX OFED configuration files txt Mellanox Technologies 37 J Rev 2 0 2 0 5 Driver Features 4 Driver Features 4 1 SCSI RDMA Protocol 4 1 1 Overview As described in Section 1 3 4 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services Section 4 1 2 describes the SRP Initiator included in Mellanox OFED for Linux This package however does not include an SRP Target
55. down TX 0 0 RX 0 RXE 0 Link status The socket is not connected Waiting for link up on netO ok After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from For ConnectX InfiniBand Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on 192 00 0 open ILink doun TX O TXE O RX O RXE 0 1 Link status The socket is not connected Maiting for link up on netO ok DHCP netO 02 02 c9 0c 78 112 ok neto 11 3 12 2 7255 255 255 9 Next server 11 3 12 121 Filename pxeilinux O Root path vtftpbootv tf ftp 7711 3 12 1217pxeilinux 9 Next FlexBoot attempts to boot as directed by the DHCP server A 8 Command Line Interface CLI 8 1 Invoking the CLI When the boot process begins the computer starts its Power On Self Test POST sequence Shortly after completion of the POST the user will be prompted to press CTRL B to invoke Mel lanox FlexBoot CLI The user has few seconds to press CTRL B before the message disappears see figure Press Ctrl B for the iPXE command line _ Alternatively you may skip invoking CLI right after POST and invoke it instead right after FlexBoot starts booting Once the CLI is invoked you will see the following prompt 1 gt A 8 2 Operation Mellanox Technologies 187 Rev 2 0 2
56. ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr show Optional Show send and receive errors timeouts and others 162 Mellanox Technologies Rev 2 0 2 0 5 Table 24 ibportstate Flags and Options Continued Optional Default Flag Seite e If Not Description oe Specified v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv V V V V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0 08 1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C ca gt Optional Use the specified channel adapter or router P ca port Optional Use the specified port t timeout ms Optional Override the default timeout for the solicited MADs msec dest dr path lid Optional Destination s directed path LID or guid gt GUID lt portnum gt Optional Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapter
57. for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File executed by users with expertise in the boot process Improper application of this pro The following procedure modifies critical files used in the boot procedure It must be cedure may prevent the diskless machine from booting 1 Back up your current initrd file Step 2 Make a new working directory and change to it host1 mkdir tmp initrd en host1 cd Jtmp initrd en Step3 Normally the initrd image is zipped Extract it using the following command host1 gzip dc initrd image cpio id The initrd files should now be found under tmp initrd_en Step 4 Create a directory for the ConnectX EN modules and copy them hostl mkdir p tmp initrd en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers hostl cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en hostl cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en 5 load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd en sbin 6 If you plan to give your Ethernet device a static IP address then copy ifconfig Otherwise skip this step hostl cp sbin ifconfig tmp initrd en sbin Mellanox Technologies 193 Rev 2 0 2 0 5 Step 7 Now you can add the commands for loading the copied modules into
58. forwarding tables will be loaded sadb file S file name gt This option specifies the name of the SA DB dump file Mellanox Technologies 103 Rev 2 0 2 0 5 OpenSM Subnet Manager from where SA database will loaded root_guid file a lt path to file gt Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one to a line cn guid file u lt path to file gt Set the compute nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line io guid file G path to file gt Set the I O nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line max reverse hops H hop count Set the max number of hops the wrong way around an I O node is allowed to do connectivity for I O nodes on top swithces ids guid file m path to file Name of the map file with set of the IDs which will be used by Up Down routing algorithm instead of node GUIDs format guid id per line guid routing order file X path to file Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the guids provided in the given file one to a line torus config path to file This option defines the file name for the extra configuration info needed for the torus 2QoS routing engine The default name is etc opensm torus 20QoS conf once o This option cau
59. identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 3 2 Configuring the DHCP Server on page 184 46 Mellanox Technologies Rev 2 0 2 0 5 4 3 3 1 1 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate config uration file needs to be created By default the DHCP server looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of the Mel lanox OFED for Linux installation The DHCP server must run on a machine which has loaded the IPoIB module To run the DHCP server from the command line enter dhcpd IB network interface name d Example hostis dhcpd 150 d 4 3 3 1 2 DHCP Client Optional A DHCP client can be used if you need to prepare diskless machine with an IB driver See Step 8 under Example Adding an IB Driver to initrd Linux de In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient cf client conf file IB network interface name Exam
60. in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos type high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sent is high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded If the 255 value 1 used the low priority VLs may be starved ae A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to achieve effective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 136 Mellanox Technologies Rev 2 0 2 0 5 Below is an example of SL2VL and VL Arbitration configuration on subnet qos ca max vls 15 qos ca high limit 6 qos ca vlarb high 0 4 qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 COs GE 0131 25 Sohn SoG Ue 0 310 3E 302359 3026 7 qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 qos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 eor eius mw 0 1020900 9050 9 lO tL 302 379310 7 3 In this example there are 8 VLs configured on su
61. in this mode enter host1 etc init d opensmd start 10 3 osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administra tor osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 10 3 2 osmtest has the following test flows Multicast Compliancy test Event Forwarding test Service Record registration test e RMPP stress test Small SA Queries stress test 10 3 1 Syntax osmtest OPTIONS where OPTIONS are i 10 This option directs osmtest to run a specific flow Flow Description create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file S run service registration deregistration and lease test e run event forwarding test f flood the SA with queries according to the stress mode m multicast flow q QoS info dump VLArb and SLtoVL tables t run trap 64 65 flow this flow requires running of external tool Default all flows except QoS Mellanox Technologies 109 Rev 2 0 2 0 5 OpenSM Subnet Manager W d p 1 8 M wait debug max lid guid port This option speci
62. installed InfiniBand and Ethernet network adapters and automat ically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages Pre existing configuration files will be saved with the extension conf rpmsave de Ifyou need to install Mellanox OFED on an entire homogeneous cluster a common strategy 15 to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh e If your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the mlnx add kernel support sh script located under the docs directory Usage mlnx add kernel support sh m mlnx ofed path to MLNX OFED directory make iso make tgz gt gt make iso Create MLNX OFED ISO image gt make tgz Create MLNX OFED tarball Default gt t tmpdir lt local work dir gt kmp v verbose Example The following command will create MLNX OFED LINUX ISO image for RedHat 5 6 under the tmp directory 1 The firmware will not be updated if you run the install script with the without fw update option 22 Mellanox Technologies Rev 2 0 2 0 5 MLNX OFED LINUX 1 5 3 rh
63. ments the interface for querying and manipulating subnet manage ment data Subnet Manager SM One of several entities involved in the configuration and control of the an IB fabric Unicast Linear For A table that exists in every switch providing the port through which warding Tables LFT packets should be sent to each LID Virtual Protocol Inter A Mellanox Technologies technology that allows Mellanox channel connet VPI adapter devices ConnectX to simultaneously connect to an Infini Band subnet and a 10GigE subnet each subnet connects to one of the adpater ports Related Documentation Table 4 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 The InfiniBand Architecture Specification that 1s Release 1 2 1 provided by IBTA IEEE Std 802 3aeTM 2002 Part 3 Carrier Sense Multiple Access with Collision Amendment to IEEE Std 802 3 2002 Detection CSMA CD Access Method and Physical Document PDF SS94996 Layer Specifications Amendment Media Access Control MAC Parame ters Physical Layers and Management Parameters for 10 Gb s Operation Firmware Release Notes for Mellanox adapter See the Release Notes PDF file relevant to your devices adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Manual See under docs folder of installed package MFT Release Notes Release Notes for th
64. of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm is to examine the cube formed by the eight switch locations bounded by the corners x y z and 1 1 2 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that 1s consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm is based on examining the topology of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2QoS will detect and report when it has insufficient configuration for a torus with radix 4 dimensions In the event the torus is significantly degraded 1 e there are many missing switches or links it may happen that torus 2QoS is unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condition occurs if torus 2QoS is misconfigured 1 e the radix of a torus dimension as configured does not match the radix of that torus dimension as wired and many switches links in the fabric will not be placed in
65. pcap Sniffer WQEs max burst size 4096 Initiating resources searching for IB devices in host Port active mtu 2048 MR was registered with addr 0x60d850 1 0 28042601 rkey 0x28042601 flags 0x1 QP was created QP number 0x4004a Ready to capture Press c to stop Mellanox Technologies 181 Rev 2 0 2 0 5 Appendix A Mellanox FlexBoot A 1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology FlexBoot supports remote Boot over InfiniBand BoIB and over Ethernet Using Mellanox Virtual Protocol Interconnect VPI technologies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target iSCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox Con nectX products FlexBoot is based on the open source project iPXE available at http www ipxe org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then it connects to a DHCP server to obtain its assigned IP address and network parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some other service For an InfiniBand port Mellanox FlexBoot implements a network driver with IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Li
66. radix 4 torus dimension need both ym link and yp link configuration yp link 0x200000 0x200005 sw 0 y 0 z 0 gt sw y 1 z 0 ym link 0x200000 0x20000f sw y 0 z 0 gt sw y 3 z 0 N is not radix 4 torus dimension only need one of zm link or zp link configuration zp link 0x200000 0x200001 sw y 0 z 0 gt sw 0 2 1 next seed yp link 0x20000b 0x200010 sw 0 y 2 z 1 sw y 3 z 1 ym link 0x20000b 0x200006 sw 0 y 2 z 1 sw y 1 z 1 zp link 0x20000b 0x20000c 4 sw y 2 z 1 gt sw y 2 z 2 y dateline 2 Move the dateline for this seed z dateline 1 back to its original position HE cb If OpenSM failover is configured for maximum resiliency one instance should run on a host attached to a switch from the first seed and another instance should run on a host attached to a switch from the second seed Both instances should use this torus 2QoS conf to ensure path SL values do not change in the event of SM failover port_order defines the order on which the ports would be chosen for routing 00261912 7 10 9 11 12 25 28 26 29 27 10 6 Quality of Service Management in OpenSM 10 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabri
67. the devices of this host returning to the original state prior to the failed path Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA srp daemon c e R 300 i InfiniBand HCA name p port number This step can be performed by executing srp daemon sh which sends its log to var log srp daemon log Now it is possible to access the SRP LUNs on dev mapper It is possible for regular non SRP LUNs to also be present the SRP LUNs may be identified by their names You can configure the etc multipath conf file to change ad multipath behavior occur if the SRP LUNs are in the black list of multipath Edit the blacklist section in gt It is also possible that the SRP LUNs will not appear under dev mapper This can etc multipath conf and make sure the SRP LUNs are not black listed Mellanox Technologies 43 J Rev 2 0 2 0 5 Driver Features Automatic Activation of High Availability Set the value of SRPHA ENABLE in etc infiniband openib conf to yes For the changes in openib conf to take effect run etc init d openibd restart From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper It is possible that regular not SRP LUNs ma
68. their values Output Files Table 20 ibdiagnet of ibutils Output Files Output File Description Ibdiagnet log A dump of all the application reports generate according to the pro vided flags Ibdiagnet lst List of all the nodes ports and links in the fabric 154 Mellanox Technologies Rev 2 0 2 0 5 Table 20 ibdiagnet of ibutils Output Files Output File Description ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet masks In case of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links Ibdiagnet pkey A dump of the existing partitions and their member host ports ibdiagnet mcg A dump of the multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible p
69. usr share man Firmware The firmware of existing network adapter devices will be updated if the following two conditions are fulfilled 1 You run the installation script in default mode that is without the option without fw update 2 The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image If an adapter s Flash was originially programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image Incase your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file Post installation Notes Most of the Mellanox OFED components can be configured or reconfigured after the installation by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 34 Mellanox Technologies Rev 2 0 2 0 5 24 Updating Firmware After Installation In case you ran m1nxofedinstall script with the without fw update option and now you wish to manually update firmware on your adapter card s you need to perform the following steps
70. 0 5 A 8 3 A 8 3 1 A 8 3 2 A 8 3 3 A 8 3 4 A 8 3 5 A 8 3 6 The CLI resembles a Linux shell where the user can run commands to configure and manage one or more PXE port network interfaces Each port is assigned a network interface called neti where i 15 0 1 2 lt of interface gt Some commands are general and are applied to all network interfaces Other commands are port specific therefore the relevant network interface is speci fied in the command Command Reference ifstat Displays the available network interfaces in a similar manner to Linux s ifconfig iPXE gt ifstat netO 0O0O 02 c9 03 00 0c 786 11 on PCIO2Z 00 0 Cclosed CLink down TX 8 TXE 2 RX 11 RXE 111 LLink status The socket is not connected 1 2 x No such file or directory x The socket is not connected CRXE 8 x Operation canceled neti 00 02 c9 0c 78 12 on PCIOZ 00 0 Copen CLink up TX 12 TXE O RX 0O HRXE 0O1 iPXE gt ifopen Opens the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example iPXE gt ifopen netl ifclose Closes the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example iPXE ifclose netl autoboot Starts the boot process from the device s sanboot Starts the boot process of an iSCSI target Example iPXE sanboot iscsi 11 4 3 7 1gqn 2007 08 7
71. 0 MC v2 9 1000 Firmware Check on CA 0 NIC PASS Oi CA Hil 5 E Wo dil 0 Firmware Check on CA 1 NIC PASS Kost Driver AQ o ocooocscosco PASS Number of CA POrtS ACECIVE 4 Port State of Port 1 0 NIC UP 1X QDR Ethernet Port State of Port 2 on CA 0 NIC UP 1X QDR Ethernet Port State of Port 1 on CA 1 NIC UP 1X QDR Ethernet Port State of Port 2 on CA 1 NIC UP 1X QDR Ethernet Error Counter Check on CA 0 NIC NA Eth ports Error Counter Check on CA 1 NIC NA Eth ports re h PASS 0710 CA 0 00 02 c9 03 00 07 4 8 Node GOD CA sil M 002022291108 00h 3580 60 After the installer completes information about the Mellanox OFED installation such prefix kernel version and installation parameters can be retrieved by running the com Ad mand etc infiniband info 2 3 4 Installation Results Software The OFED and MFT packages are installed under the usr directory The kernel modules are installed under InfiniBand subsystem lib modules uname r updates kernel drivers infiniband 32 Mellanox Technologies Rev 2 0 2 0 5 mlx4 core driver lib modules kernel version
72. 00 inet addr 11 4 3 175 11 4 255 255 Mask 255 255 0 0 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s 4 3 4 IPoIB MTU IPoIB MTU is defined by the OpenSM configuration Currently there are 2 MTU options 2K and 4K MTU where the physical port MTU indicates the port capability default is 4K and the OpenMS default value for all IPoIB ports in the fabric is 2K Applications that do not use OpenSM MTU e g private application that uses ib verbs can use the 4K MTU IPoIB which receives its port MTU logical MTU from the OpenSM the default value is 2K as it is set by the OpenSM gt In order to change the MTU to 4k Edit the OpenSM partition file in the section of IPoIB setting Default 0xffff ipoib mtu 5 ALL full 1 mtu 5 indicates that all IPoIB ports in the fabric are using 4k MTU mtu 4 indicates 2k MTU For more details on OpenSM configuration please refer to Section 10 OpenSM Subnet Man ager on page 101 4 3 5 Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri mary parent interface The default Pa
73. 00 0000 0002 c900 0101 d152 base lid 0x0 sm lid 0x0 phys state 5 LinkUp rate 10 Gb sec 4X Mellanox Technologies 161 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities 2 List the status of specific ports of specific devices gt ibstatus mthca0 1 mlx4 0 2 Infiniband device mthca0 port 1 status default gid e80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 state phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid 0 1 sm lid 0 1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR 11 10 ibportstate Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a switch port then ibportstate can be used to disable enable or reset the port validate the port s link width and speed against the peer port Synopsis ibportstate d e v V D G s lt smlid gt V C ca name P ca port t timeout ms dest dr path lid guid portnum op value Output Files Table 24 lists the various flags of the command Table 24 ibportstate Flags and Options Default Flag If Not Description 254 Specified h help Optional Print the help menu d
74. 06350 0x0000 29b 0 008 4 DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x0004749c 0x0005913 0x011ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 0x0007a124 0x0007bdff 0x001cdc DDR OK 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007f0af 0x000518 Configuration OK 0x0007 0b0 0x0007f0fb 0x00004c Jump addresses OK 0x0007 0fc 0x0007f2a7 0x0001ac FW Configuration OK FW image verification succeeded Image is bootable 11 16 ibv asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv asyncwatch Mellanox Technologies 179 11 17 Rev 2 0 2 0 5 Examples 1 Display asynchronous events gt ibv_asyncwatch mlx4_0 async event FD 4 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX ConnectX 2 ConnectX 3 adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analysis The following describes a work flow for local HCA adapter sniffing e Run ibdump with the desired options Run the application that you wish its traffic to be analyzed Stop ibdump CTRL C or wait for the data buffer to fill in mem mode Open Wireshark and load the generated file How to Get Wireshark Download the current release from www wireshark org for a Linux or Windows environment See the ibdump release notes txt
75. 0x2134af2306 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited ShareIO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited ShareIO 0x80 0x123453 0x123454 0x123455 full 0x123456 0x123457 will be limited ShareIO 0x80 defmember limited 0x123456 0 123457 0x123458 full ShareIO 0x80 defmember full 0x123459 0 12345 ShareIO 0x80 defmember full 0x12345b 0x12345c limited 0x12345d The following rule 1s equivalent to how OpenSM used to run prior to the partition manager Default 0x7fff ipoib ALL full 10 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm Based on the minimum hops to each node where the path length is optimized 2 UPDN Algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet Mellanox Technologies 113 Rev 2 0 2 0 5 OpenSM Subnet Manager 3 Fat tree Routing Algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is constrained to rank ing rules 4 LASH Routing Algorithm
76. 1 Rev 2 0 2 0 5 OpenSM Subnet Manager 1 armgr conf file lt ar mgr options file name gt to the event plugin options option in the file options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt See an example of AR Manager options file with all the default values in Example of Adaptive Routing Manager Options File on page 145 10 8 3 2 Disabling Adaptive Routing There are two ways to disable Adaptive Routing Manager 1 By disabling it explicitly in the Adaptive Routing configuration file 2 By removing the armgr option from the Subnet Manager options file Adaptive Routing mechanism is automatically disabled once the switch receives setting of the usual linear routing table LFT Therefore no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing 10 8 4 Querying Adaptive Routing Tables When Adaptive Routing is active the content of the usual Linear Forwarding Routing Table on the switch is invalid thus the standard tools that query LFT e g smpquery dump_Ifts sh and others cannot be used To query the switch for the content of its Adaptive Routing table use the smparquery tool that is installed as a part of the Adaptive Routing Manager package To see its usage de
77. 2 The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented ibsrpdm ibsrpdm is using for the following tasks 1 Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device dev umad0 execute the following command ibsrpdm This command will output information on each SRP Target detected in human readable form Mellanox Technologies 39 J Rev 2 0 2 0 5 Driver Features Sample output IO Unit Info port LID 0103 port GID e800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 IO class 0100 ID LSI Storage Systems SRP Driver 200400a0b81146a1 Service entries 1 service 0 200400a0b81146a1 SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the following command ibsrpdm d lt umad device gt 2 Assistance in creating an SRP connection a To generate output suitable for utilization in the echo command of Section 4 1 2 2 add the c option to ibsrpdm ibsrpdm c Sample output id ext 200400A0B81146A1 i1oc guid 0002c90200402bd4 dgid f
78. 2 4 Automatic Discovery and Connection to Targets Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running To connect to all the existing Targets in the fabric run szrp daemon e o This util ity will scan the fabric once connect to every Target it detects and then exit Mellanox Technologies 41 Rev 2 0 2 0 5 Driver Features srp daemon will follow the configuration it finds in etc srp daemon conf Thus it will ignore a target that is disallowed in the configuration file e connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric To execute SRP daemon as a daemon you may run_srp_daemon found under usr sbin providing it with the same options used for running srp daemon Make sure only one instance of run_srp_daemon runs per port ae To execute SRP daemon as a daemon on all the ports run srp_daemon sh found under usr sbin srp_daemon sh sends its log to var log srp daemon log It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRPHA ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availabil
79. 2 c9 03 00 00 10 39 For a ConnectX device with ports configured as Ethernet comment out the following line hardware ethernet 00 02 c9 00 00 bb A 11 WinPE Mellanox FlexBoot enables WinPE boot via TFTP For instructions on preparing a WinPE image please see http etherboot org wiki winpe Mellanox Technologies 195 Rev 2 0 2 0 5 Appendix SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possible to work with and support a lot of IO modes on real or virtual devices in the back end 1 scst vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices B 1 Prerequisites and Installation 1 SRP targer is part of the OpenFabrics OFED software stacks Use the latest OFED distribu tion package to install SRP target On distribution default kernels you can run scst_vdisk blockio mode to obtain good performance 2 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from http scst sourceforge net downloa
80. 3 4 11 iscsiboot echo Echoes an environment variable Example iPXE echo root path 188 Mellanox Technologies Rev 2 0 2 0 5 8 3 7 A network interface attempts to open the network interface and then tries to connect to com municate with the DHCP server to obtain the IP address and filepath from which the boot will occur Example iPXE dhcp net1 A 8 3 8 help Displays the available list of commands A 8 3 9 exit Exits from the command line interface A 9 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the initrd image must include a device driver module and be configured to load that driver This can be achieved by adding the device driver module into the initrd image and loading it The initrd image of some Linux distributions such as SuSE Linux Enterprise Server and Red Hat Enterprise Linux cannot be edited prior or during the installation process If you need to install Linux distributions over Flexboot please replace your initrd images with the images found at www mellanox com gt Products gt Adapter IB VPI SW gt FlexBoot Download Tab A 9 1 Case 1 InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section A 9 1 1 for an example ib_addr ko jb core ko jb mad ko jb sa ko jb cm ko jb uverbs ko jb ucm ko jb umad ko e iw
81. 33 00 00 00 01 6 4 Assigning a Virtual Function to a Virtual Machine This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine 6 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server Step 1 the virt manager Step 2 Double click on the virtual machine and open its Properties Mellanox Technologies 73 Rev 2 0 2 0 5 Single Root IO Virtualization SR IOV 6 5 Step 3 to Details gt Add hardware gt PCI host device File Virtual Machine View Send Key Q Add new virtual hardware DA Adding Virtual Hardware This assistant will guide you through adding a new piece of virtual hardware First select what type of hardware you wish to add Hardware type Storage Parallel Physical Host Device 00 video B watchdog cancel gt Forward sf Add Hardware Remove Step 4 Choose a Mellanox virtual function according to its PCI device e g 00 03 1 5 the Virtual Machine is up reboot it otherwise start it Step 6 Log into the virtual machine and verify that it recognizes the Mellanox card Run lspci grep Mellanox 00 03 0 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Uninstalling SR IOV Driver gt To uninstall SR IOV driver perform the following 1 For Hypervisors detach all the Virtual Functions VF from all the Virtual Machines VM or stop the Virtual Mac
82. 4 time 0 079 ms 64 bytes from 11 4 3 176 icmp seg 1 ttl 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seq 3 ttl 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seg 4 ttl 64 time 0 065 ms 1 8 9 1 918 Sicelesiencls 5 packets transmitted 5 received 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 3 7 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg 100 e The only meaningful bonding policy in IPoIB is High Availability bonding mode num ber 1 or active backup Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only supported value is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions the bonding master configuration file e g ifcfg bond0 in addition to Linux bond ing semantics use the following parameter MTU 65520 65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 4 3 2 IPoIB Mode Setti
83. 84 75 2 Runtime Parameters o us on pas a e ege et ble a aed ole SAGAR 85 7 5 3 Various Executable 000 cee cece eee ees 86 Chapter 8 Working With sc occ see eden o a onerum rk rn de 98 8 1 Port Type 88 8 2 Auto SENSING SA set nee dest moveo Mire 88 8 2 1 Enabling A to Sensing rss esu fag pnd 89 Chapter 9 90 9 1 General System 90 9 1 1 PCI Express PCIe Capabilities 0 0 0 90 9 1 2 Memory s 90 9 1 3 Recommended BIOS 90 9 2 Performance Tuning for Linux 93 9 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 93 9 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 93 9 2 3 Preserving Your Performance Settings after a 94 9 2 4 Tuning Power Management 94 9 2 5 Interrupt Re 96 9 2 6 Tuning for NUMA Architecture eee 96 9 271 IRQ Affinity et tr t dr a e t c 98 9 2 8 Tuning Multi Threaded IP
84. BHHHBHRHHI libmlx4 deve THHHHHHHBHHHHHBHHHHHHHHBHHHHHHHHBHHHHHHHHBHHHRSHRHHI Preparing 0 0 THHHBHHHBHHHHHBHHHHHHHHHHHHHHHHBHBHHHHHHHBHBHHHHE libmlx4 devel THHHHHHHBHHHHHBHHHHHHHHHHHHBHHHHBHHHHHHHHBHHHBHRHHHI Preparing THHHHHHHHHHHBHBHHHHHHHHBHHHHHHHHBHHHHHHHHBHHHBSHRHHI libmlx5 THHHBHHHBHHHHHBHHBHHHHHHHHHHHHHHBHBHHHHHHHBHBHHRHEHE Preparing THHHBHHHHHHHHHBHBHHHHHHBHHHHHHHHBHBHHHHHHHBHBHHHHE libmlx5 HH HH HH HH HH HH HH Preparing THHHHHHHHHHHBHBHHHHHHHHBHHHBHHHHBHHHHHHHHHBHHHBHRHHI libmlx5 devel THHHBHHHHBHHHHHBHHHHHHHHBHHHHBHHHHBHHHHHHHHBHHHBSHRHHI Preparing Do THHHHHHHHHHHBHBHHHHHHHHBHHHHHHHHBHHHHHHHHHBHHHBHBRHHI libmlx5 devel THHHBHHHHBHHHBHHBHHHHHHHHHHHHHHHHHBHHHHHHHHHBHHHBSHHHHI Preparing THHHBHHHHHHHHBHBHHHHHHHHHBHHHHHHHBHHHHHHHHHBHHHBHRHHI libmverbs devel THHHHHHHBHHHHHBHHHHHHHHBHHHHHHHBHBHHHHHHHBHBHHEHHE Preparing THHHBHHHHHHHHHBHHBHHHHHHBHHHHHHHHBHBHHHHHHHBHBHHHNE libmverbs devel THHHBHHHBHHHBHHBHHHHHHHHBHHHBHHHHBHHHHHHBHBHHHBRHRHHI Preparing THHHBHHHHHHHBHBHHHHHHHHBHHHHHHHHBHHHHHHHHBHHHBSHRHHI libmge HH H HH HH HH H HH HH HH Preparing THHHBHHHHHHHHHHBHBHHHHHHHHHHHHHHBHBHHHHHHHBHBHHHHE libmge HH HH Het HH HH HH HH HH Preparing THHHBHHHBHHHBHBHHHHHHHHBHHHBRHHHHBHHHHHHHHBHHHBSHRHHI libmge devel THHHBHHHHBHHHHHBHHHHHHHHHHHHHHHHHBHHHHHHHHHHHHRHRHHI Preparing
85. CIe function numa node 96 Mellanox Technologies Rev 2 0 2 0 5 Example supported system cat sys class net eth3 device numa_node 0 Example for unsupported system cat sys class net ib0 device numa node 1 9 2 6 1 1 Improving Application Performance on Remote NUMA Node Verbs API applications that mostly use polling will have an impact when using the remote NUMA node libmlx4 has a build in enhancement that recognizes an application that is pinned to a remote NUMA node and activates a flow that improves the out of the box latency and throughput However the NUMA node recognition must be enabled as described in section Tuning for Intel Sandy Bridge Platform on page 96 In systems which do not support SLIT the following environment variable should be applied MLX4 LOCAL CPUS 0x bit mask of local NUMA node Example for local NUMA node which its cores are 0 7 LOCAL CPUS Oxff Additional modification can apply to impact this feature by changing the following environment variable MLX4 STALL NUM LOOP integer default 400 The default value is optimized for most applications However several applications might benefit from increasing decreasing this value ae 9 2 6 2 Tuning for Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system e With a2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 With a 4 so
86. D for SM SA queries V ersion Optional Show version info C ca name Optional Use the specified channel adapter or router P ca port Optional Use the specified port t Optional Override the default timeout for the solicited timeout ms msec op Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2vl lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt mepi lt addr gt lt portnum gt lt dest dr_path Optional Destination s directed path LID or GUID lid guid gt Examples 1 Query PortInfo by LID with port modifier gt smpquery portinfo 1 1 Port info Lid 1 port 1 M eei E crs UD T 0x0000000000000000 RIKE 0xfe80000000000000 doe er rt DA TET 0x0001 SM er 0x0001 CapMaslc 0x251086a Mellanox Technologies 169 InfiniBand Fabric Diagnostic Utilities 170 Mellanox Technologies Rev 2 0 2 0 5 me rican oe 8 ceca 8 Maxie titania nt M an DUM 0 ROUNGTGLD eter T TT 0 2 Query SwitchInfo by GUID gt smpquery G switchinfo 0x000b8cffff004016 Switch info Lid 3 IMAM OG UM 49152 tra TIT CT 0 cast 1024 name 8 DEEPO LA 0 DE
87. File Option File Desctiption Values max_ errors When number of errors exceeds max_errors of send Values error window receive errors or timeouts in less than error window max errors 0 zero tollerance seconds the CC MGR will abort and will allow abort configuration on first error OpenSM to proceed error window 0 mechanism dis abled no error checking Default 5 cc statistics cycle Enables CC MGR to collect statistics from all nodes Default 0 every cc statistics cycle seconds When the value is set to 0 no statistics are collected Mellanox Technologies 149 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities 11 InfiniBand Fabric Diagnostic Utilities 11 1 Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric 11 2 Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation synopsis and options descriptions error codes and examples 11 2 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is
88. HHHHBHHHRHRHHI Preparing THHHHHHHHHHHBHBHHHHHHHHBHHHBHHHHBHHHHHHHHBHHHBHRHHI libibverbs THHHBHHHBHHHHHBHHHHHHHHBHHHHHHHHBHHHHHHHHBHHHBSHRHHI Preparing oo THHHBHHHHBHHHBHBHHHHHHHHBHHHHHHHHBHHHHHHHHBHHHBHRHHI libibverbs devel THHHBHHHHBHHHBHBHHHHHHHHHHHHHHHHHBHHHHHHHHBHHHBHHHHI Preparing THHHBHHHHEHHHBHHBHHHHHHHHHBHHHHHHHHBHHHHHHHHBHHHBHRHHI libibverbs devel THHHBHHHBHHHHHBHHHHHHHHBHHHHHHHHBHHHHHHHHHBHHHHBSHRHHI Preparing THHHBHHHBHHHHHHHBHHHHHHBHHHHHHHBHBHHHHHBHHBHBHHHNE libibverbs devel static THHHHHHHHHHHHHHHHHHRHBHHHHHHHHHHHHHHHHHHHRHHHHHRHRI Preparing THHHBHHHHHHHBHBHHHHHHHHHHHHHHHHHBHHHHHHHHBHHHHBSHRHHI libibverbs devel static Preparing THHHHHHHHHHHHHBHHHHHHHHBHHHHHHHHBHBHHHHHHHBHBHHHHE libibverbs utils THHHHHHHBHHHHHBHHHHHHHHBHHHBHHHHBHHHHHHHHBHHHRSHRHHHI Preparing THHHHHHHBHHHBHBHHHHHHHHBHHHBHHHBHHHHHHHHHHBHHHBSHRHHI libmverbs THHHHHHHBHHHHHBHHHHHHHHBHHHHHHHHBHHHHHHHHHHHHRHRHHI Prepar 1 ng HH Ht HH HH H H HH H H HH HH libmverbs THHHBHHHHHHHHHHBHHHHHHHHBHHHBHHHHBHHHHHHHHBHHHRHHHHI Preparing HH i HHH HHH HH libmlx4 HH HH HH HH HH Preparing THHHBHHHBHHHHHBHBHHHHHHHHHHHHHHBHBHHHHHHHBHBHHHHE libmlx4 HH HH He 41 4E 4L iHi HH HH HH Preparing THHHHHHHHHHHBHBHHHHHHHHBHHHBHHHHBHHHHHHHH
89. Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled Mellanox Technologies 91 J Rev 2 0 2 0 5 Performance 9 1 3 3 Intel Nehalem Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalem based processors Configuring the Completion Queue Stall Delay Table 11 Recommended BIOS Settings for Intel Nehalem Westmere Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled Hyper Threading Disabled Recommended for latency and message rate sen sitive applications CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled 9 1 3 4 AMD Processors The following table displays the recommended BIOS settin
90. L PS MSC 176 11 16 1 _ asyncwatchi E e RD ERR CR 179 11 17 abdump Ree reete oeste ORD SA aD are 180 Appendix A Mellanox FlexBoot 182 Ad Overview meer Noe eoe ew Sure V RH 182 A 2 Burning the Expansion ROM Image 183 Preparing the DHCP Server in Linux Environment 184 A 4 Subnet Manager OpenSM 185 A 186 A 6 BIOS 186 AT adage p Syne A ae 186 A 8 Command Line Interface 187 A 9 Diskless 189 A07 3S CSEBOOE nre re ee e ps 194 ACE S WanPE 2 seus hoo oot s eso ete p en CORNER 195 Appendix SRP Target 196 B 1 Prerequisites and 196 B 2 How to Run Boe BOS NEL ENDE re US 196 How to Unload Shutdown 2 0 0 cece eee 199 Appendix mlx4 Module Parameters 200 CA mlx4 c re Parameters iip ewe dyed MEDEA E ETS 201 C2 Parameters
91. LEIO mode Using 0 device and file 10G file a modprobe scst b modprobe scst vdisk c echo open vdisk0 dev md0 gt proc scsi_tgt vdisk vdisk d echo open vdisk1 10G file gt proc scsi tgt vdisk vdisk e echo add vdisk0 0 gt proc scsi_tgt groups Default devices f echo add vdiskl 1 gt proc scsi_tgt groups Default devices 2 Run For all distributions except SLES 11 gt modprobe ib srpt For SLES 11 gt modprobe f ib srpt For SLES 11 please ignore the following error messages in var log messages when loading ib srpt to SLES 11 distribution s kernel ib srpt no symbol version for scst unregister ib srpt Unknown symbol scst unregister ib srpt no symbol version for scst register ib srpt Unknown symbol scst register ib srpt no symbol version for scst unregister target template ib srpt Unknown symbol scst unregister target template B On Initiator Machines On Initiator machines manually perform the following steps Mellanox Technologies 197 Rev 2 0 2 0 5 1 Run modprobe ib srp 2 Run ibsrpdm d dev infiniband umadX to discover a new SRP target umad0 port 1 of the first HCA umadl port 2 of the first HCA umad2 port 1 of the second HCA 3 echo new target info gt sys class infinband srp srp mthca0 l add target 4 fdisk 1 Example will show the newly discovered scsi disks Assume that you use port 1 of first HCA in the system i e mthca0 root lab104
92. MAGE Mellanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2013 Mellanox Technologies Rights Reserved Mellanox Mellanox logo BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale MLNX OS PhyX SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd Connect IB FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroXTM MetroDX ScalableHPC Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2877 Rev 2 0 2 0 5 Table of Contents Table of Contents 360466 ee eb SON ee Ce Re ee ies ba List OF Fibres ect s Lastof lableS becS esek b 9 Chapter 1 Mellanox OFED 15 1 1 Introduction to Mellanox OFED 06 e ees 15 1 2 Mellanox OFED Package 15 12 1 SOInmage toe ck E nai aa eee 15 1 2 2 Software Components diern cee EE 15 12 3 Riri ware
93. MXM Settings The default MXM settings are already optimized To check the available MXM parameters and their default values run the opt mellanox mxm bin mxm dump config utility which is part of the MXM RPM MXM parameters can be modified in one of the following methods Modifying the default MXM parameters value as part of the mpirun mpirun UD RX MAX BUFFERS 128000 lt gt Modifying the default MXM parameters value from SHELL export MXM UD RX MAX BUFFERS 128000 mpirun lt gt 7 3 A Configuring Multi Rail Support Multi Rail support enables the user to use more than one of the active ports on the card by mak ing a better use of the resources It provides a combined throughput among the used ports To configure dual rail support Specify the list of ports you would like to use to enable multi rail support x PORTS cardName portNum mpirun x PORTS mlx4 0 1 mlx4 0 2 lt gt 82 Mellanox Technologies Rev 2 0 2 0 5 7 3 5 Configuring MXM over the Ethernet Fabric To configure MXM over the Ethernet fabric 1 sure the Ethernet port is active ibv devinfo ibv devinfo displays the list of cards and ports in the system Please make sure in the ibv_devinfo output that the desired port has Ethernet at the 1ink layer field and that its state is PORT ACTIVE de 2 Specify the ports you would like to use if there is a non Ethernet
94. Mellanox TECHNOLOGIES Mellanox OFED for Linux User Manual Rev 2 0 2 0 5 Last Updated June 18 2013 www mellanox com Rev 2 0 2 0 5 THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DA
95. Mellanox Technologies Installation Prepar opensm Prepar opensm Prepar opensm Prepar opensm Prepari compat Prepari compat Prepari compat Prepar perftes Prepar Prepar mft Prepar srptool Prepar rds too Prepar ing devel devel Ire static Ig static 8 dapl devel dapl devel vel LNG ss vel static vel static E ing mstflint Ie S Ig ls ing rds devel Prepari ibutils Prepar ibutils Hom 2 H H HH HH H H HH H E T H H HH HH H H HH H H H H HH HH T T H H HH T E H H HH E T Ww H H HH H H HH HH H H HH H H H HH HH H H HH H E H H HH HH atte H H HH H H H HH HH T Ww H H HH
96. NCE 0 DefMcastNotPrimPort 0 a 18 Statechange cs 0 TVS AAA De A 0 DarbBntorceCap 32 1 Out pound ENEMIES P MM il 5 EE EET ELT il bound ea ts il decet MM 0 3 Query by direct route gt smpquery D nodeinfo 0 Node info DR path slid 65535 dlid 65535 0 Ba SEVO SEN 1 NEES i il Channel Adapter 2 0x0002c9030000103b GR d 0x0002c90300001038 Cete T hare E E 0x0002c90300001039 Dare Cap t LC TE ne CE 128 JD AIR Ste Ee 0x634a ROVISIOUU E P 0x000000a0 E 1 Vendor eed niet ES 0x0002c9 11 13 perfquery Queries InfiniBand ports performance and error counters Optionally it displays aggregated counters for all ports of a node It can also reset counters after reading them or simply reset them Mellanox Technologies 171 Rev 2 0 2 0 5 Synopsis per query h sal 1 lt calmame gt P lt gt 25 timeout ms V lt lid guid gt reset_mask Output Files Table 27 lists the various flags of the command Table 27 perfquery Flags and Options Default Flag A e If Not Description y Specified h help Optional Print the help menu
97. Provides basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsis smpquery h d e v D G s lt smlid gt V C lt ca_name gt P lt ca_port gt t lt timeout_ms gt node name map lt node name map gt op dest dr path lid guid op params Output Files Table 26 lists the various flags of the command Table 26 smpquery Flags and Options Default Flag e If Not Description Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d 168 Mellanox Technologies Rev 2 0 2 0 5 Table 26 smpquery Flags and Options Optional Default Flag tr If Not Description y Specified e rr show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it 1s the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target LI
98. SA queries Without s stress testing is not performed Multicast ModeThis option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC ultiple mode Could be run with other apps using 110 Mellanox Technologies Rev 2 0 2 0 5 with OpenSM Without M default flow testing is formed t timeout This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds This option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout v verbose This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the vf option for more information about log verbosity W This option sets the maximum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity iE This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level
99. a FECN Values 0 Oxffff Default packet_size Any packet less than this size bytes will not be marked with FECN Values 0 0x3fc0 Default 0x200 Table 17 Congestion Control Manager CA Options File Option File Desctiption Values port control Specifies the Congestion Control attribute for this port Values 0 QP based congestion control 1 SL Port based congestion con trol Default 0 ca control map An array of sixteen bits one for each SL Each bit indicates whether or not the corresponding SL entry 1s to be modified Values Oxffff ccti increase Sets the CC Table Index CCTI increase Default 1 trigger threshold Sets the trigger threshold Default 2 ccti min Sets the CC Table Index CCTI minimum Default 0 cct Sets all the CC table entries to a specified value The Values comma separated first entry will remain 0 whereas last value will be set to the rest of the table list Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes 148 Mellanox Technologies Rev 2 0 2 0 5 Table 17 Congestion Control Manager CA Options File Option File Desctiption Values ccti_timer Sets for all SL s the given ccti timer Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes Table 18 Congestion Control Manager CC MGR Options
100. a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 1 On the command line specify the file name using the option t topology file name gt 2 Define the environment variable IBDIAG_TOPO_FILE To specify the local system name to an diagnostic tool use one of the following two options 1 On the command line specify the system name using the option s 1ocal system name gt 2 Define the environment variable IBDIAG SYS NAME 11 2 2 InfiniBand Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the fol lowing options 1 On the command line specify the port number using the option p local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as w
101. ad librdmacm mft mstflint diagnos tic tools OpenSM ib bonding MVAPICH Open MPI MPI tests MPI selector dynamic librar ies basic Install all kernel modules libibverbs libibu mad mft mstflint dynamic libraries Mellanox Technologies 23 J Rev 2 0 2 0 5 Installation msm Install all kernel modules libibverbs libibumad mft mstflint diagnostic tools OpenSM ib bonding dynamic libraries NOTE With msm flag the OpenSM daemon is configured to run upon boot vma Install packages required by VMA to support both IB and Ethernet vma ib Install packages required by VMA to work over InfiniBand vma eth Install packages required by VMA to work over Ethernet v vv vvv Set verbosity level q Set quiet no messages will be printed umad dev rw Grant non root users read write permission for umad devices instead of default hugepages overcommit Set 80 of MAX MEMORY as overcommit for a huge page allocation 24 Mellanox Technologies Rev 2 0 2 0 5 2 3 2 1 mlnxofedinstall Return Codes Table 2 lists the m1nxo fedinstall script return codes and their meanings Table 2 mInxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration This can occur when the require
102. affinity of a single interrupt vector gt echo hexadecimal bit mask gt proc irg irq vector gt smp affinity Bit i in lt hexadecimal bit mask gt indicates whether processor core i is in lt irq vector gt s affinity or not IRQ Affinity Configuration It is recommended to set each IRQ to a different core For Sandy Bridge or AMD systems set the irq affinity to the adapter s NUMA node For optimizing single port traffic run set irq affinity bynode sh numa node interface Foroptimizing dual port traffic run set irq affinity bynode sh numa node interfacel interface2 To show the current affinity settings run show irq affinity sh interface 9 2 7 2 Auto Tuning Utility MLNX OFED 2 0 x introduces a new affinity tool called mlnx affinity This tool can automati cally adjust your affinity settings for each network interface according to the system architecture Usage Start mlnx affinity start 98 Mellanox Technologies Rev 2 0 2 0 5 Stop mlnx affinity stop Restart mlnx_affinity restart mlnx_affinity can also be started by driver load unload gt To enable mlnx_affinity by default e Add the line below to the etc infiniband openib conf file RUN AFFINITY TUNER yes 9 2 7 3 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core utilization so there will be no in
103. ailed torus 2QoS would use the path S m p o T r D Note that it can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say 5 and 0 can be ignored for path segments on that ring One result of this is that torus 2QoS can route around many simultaneous link failures as long as no 1D ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabrics Note that in the case where there are multiple parallel links between a pair of switches torus 2005 will allocate routes across such links in a round robin fashion based on ports at the path destination switch that are active and not used for inter switch links Should a link that is one of severalsuch parallel links fail routes are redistributed across the remaining links When the last of such a set of parallel links fails traffic is rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise illegal 1 not allowed by DOR rules Torus 2QoS will introduce such a turn as close as possible to the failed switch in order to route around it n the above example suppose switch T has fail
104. an Megatrends Inc 68 Mellanox Technologies Rev 2 0 2 0 5 2 Enable Intel Virtualization Technology BIOS SETUP UTILITY Modern order i Enabled ualizat n Tech le Bit Capability Step 3 Install the hypervisor that supports SR IOV Step 4 Depending on your system update the boot grub grub conf file to include a similar command line load parameter for the Linux kernel For example to Intel systems add default 0 timeout 5 splashimage hd0 0 grub splash xpm gz hiddenmenu title Red Hat Enterprise Linux Server 2 6 32 36 x86 645 root hd0 0 kernel vmlinuz 2 6 32 36 x86 64 ro root dev VolGroup00 LogVol00 rhgb quiet intel iommu on initrd initrd 2 6 32 36 x86 64 img 1 Please make sure the parameter intel_iommu on exists when updating the boot grub grub conf file otherwise SR IOV cannot be loaded Step 5 Install the MLNX OFED driver for Linux that supports SR IOV Step 6 Verify the HCA is configured to support SR IOV root selene mstflint dev PCI Device dc e Verify in the HCA section the following field appears 2 HCA num pfs 1 cotal vig 5 sriov en true 1 Ifthe fields in the example above do not appear in the HCA section meaning SR IOV is not supported in the used INI 2 IfSR IOV is supported to enable if it is not it is sufficient to set sriov_en true in the INI Mellanox Technologies 69 Rev 2 0 2 0 5 Single Root IO Virtualizat
105. an be changed on the fly Values for both options 0 Oxffff MAX ERRORS 0 zero tolle rance abort configuration on first error Default 10 ERROR WINDOW 0 mecha nism disabled no error checking Default 5 LOG FILE full path gt AR Manager log file This option can be changed on the fly Default var log armgr log LOG_SIZE lt size in MB gt This option defines maximal AR Manager log file size in MB The logfile will be truncated and restarted upon reaching this limit This option cannot be changed on the fly 0 unlimited log file size Default 5 10 8 5 1 1 Per switch AR Options A user can provide per switch configuration options with the following syntax SWITCH lt GUID gt lt switch option 1 gt lt switch option 2 gt 144 Mellanox Technologies Rev 2 0 2 0 5 The following are the per switch options Table 14 Adaptive Routing Manager Pre Switch Options File Option File Description Values ENABLE Allows you to enable disable the AR on this switch Default true lt true false gt If the general ENABLE option value is set to false then this per switch option is ignored This option can be changed on the fly AGEING TIME Applicable to bounded AR mode only Specifies how Default 30 lt gt much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port
106. at id rsa pub gt gt authorized keys2 Step 5 Test host1 ssh host2 uname Linux 7 2 3 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mella nox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and login again Other packages such as environ ment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also sh
107. ata to the disk such that subsequent runs consistent The default direc tory used 15 var cache opensm The following file is included in it e guid21id stores the LID range assigned to each GUID 10 2 3 Signaling When OpenSM receives HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSR1 can be used to trigger a reopen of var log opensm 1log for logrotate pur poses 10 2 4 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter 5 1 opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file message registers only general major events the second file opensm log includes details of reported errors errors reported in opensm 1og should be 108 Mellanox Technologies Rev 2 0 2 0 5 treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly If a fatal error occurs opensm exits 10 2 4 1 Running OpenSM As Daemon OpenSM can also run as daemon To run OpenSM
108. aw package un backported source files are placed under usr src ofa kernel ver e The script openibd is installed under etc init d This script can be used to load and unload the software stack The script connectx port config is installed under sbin This script can be used to configure the ports of ConnectX network adapter cards to Ethernet and or InfiniBand For details on this script please see Section 8 1 Port Type Management The directory etc infiniband is created with the files info and openib conf and connectx conf The info script can be used to retrieve Mellanox OFED instal lation information The openib conf file contains the list of modules that are loaded Mellanox Technologies 33 J Rev 2 0 2 0 5 Installation 2 3 5 when the openibd script is used The connectx conf file saves the ConnectX adapter card s ports configuration to Ethernet and or InfiniBand This file is used at driver start restart etc init d openibd start The file 90 ib rules is installed under etc udev rules d If OpenSM is installed the daemon opensmd is installed under etc init d and opensm conf 15 installed under etc e If IPoIB configuration files are included ifcfg ib n files will be installed under etc sysconfig network scripts RedHat machine The installation process unlimits the amount of memory that can be pinned by a user space application See Step 5 Man pages will be installed under
109. ax backlog 250000 Increase the TCP maximum and default buffer sizes using setsockopt sysctl w net core rmem max 4194304 Sysctl w net core wmem max 4194304 sysctl w net core rmem default 4194304 Sysctl w net core wmem default 4194304 sysctl w net core optmem max 4194304 Increase memory thresholds to prevent packet dropping sysctl w net ipv4 tcp rmem 4096 87380 4194304 sysctl w net ipv4 tcp wmem 4096 65536 4194304 Enable low latency mode for TCP sysctl w net ipv4 tcp low latency 1 9 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Mellanox Technologies 93 J Rev 2 0 2 0 5 Performance Enable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp sack 1 9 2 3 Preserving Your Performance Settings after a Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows lt sysctl namel gt lt valuel gt lt sysctl name2 gt lt value2 gt lt sysctl name3 gt lt value3 gt lt sysctl name4 gt lt value4 gt For example Tuning the Network Adapter for Improved IPv4 Traffic Performance on page 93 lists the following setting to disable the TCP timestamps option sysctl w net ipv4 tcp timestamps 0 In o
110. bnet VLO to VL7 VLO is defined as a high pri ority VL and it 15 limited to 6 x AKB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off 10 6 8 Deployment Example Figure 5 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs Figure 5 Example QoS Deployment on InfiniBand Subnet Traffic class SDP Service level 2 Policy min 20 BW Traffic class Partition A Service level 0 Policy min 40 App A Server Service Access Points Traffic class SRP Service Level 1 Policy min 30 BW Traffic class IPoIB ice Level 3 2 Virtual Server App B Server 10 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each exam ple provides the QoS level assignment and their administration via OpenSM configuration files Mellanox Technologies 137 Rev 2 0 2 0 5 OpenSM Subnet Manager 10 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI Separate from I O load Min BW of 70 Storage Control Lustre MDS Low latency Storage Data Lustre OST Min BW 30 Administration MPlis assigned an SL
111. c elements and enforces the provided policy on client requests The overall flow for such requests is as follows The request is matched against the defined matching rules such that the QoS Level def inition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 4 QoS Manager y Administrator jid QoS Policy Config File InfiniBand subnet with OFED 1 3 Manager based nodes OSM 4 4 There are two ways to define QoS policy 128 Mellanox Technologies Rev 2 0 2 0 5 Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 10 6 2 Advanced QoS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by Port GUID Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group Partition name which means tha
112. ce 4 4 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand Servers IB Eihernet Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 10 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS 52 Mellanox Technologies Rev 2 0 2 0 5 Up to 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served Packets carry class of service marking in the range 0 to 15 in their header SL field Each switch can map the incoming packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL The Subnet Administrator controls the parame
113. ch rules 10 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include Default match rule that is applied to PR MPR query that didn t match any of the other match rules e SDP SDP application with a specific target TCP IP port range SRP with a specific target IB port GUID RDS e IPoIB with a default PKey e IPoIB with a specific PKey Any ULP application with a specific Service ID in the PR MPR query Any ULP application with a specific PKey in the PR MPR query Any ULP application with a specific target IB port GUID in the PR MPR query Since any section of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file Mellanox Technologies 133 Rev 2 0 2 0 5 OpenSM Subnet Manager The shortest policy file in this case would be as follows qos ulps default end qos ulps 0 default SL It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible key words
114. ch the routing tables 10 5 4 1 Routing between non CN Nodes The use of the cn_guid file option allows non CN nodes to be located on different levels in the fat tree In such case it is not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around Spinel Spine2 Spine 3 7 de NN SN ao D Switch 2 Switch N3 FIN FU 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn guid file OpenSM options Mellanox Technologies 117 Rev 2 0 2 0 5 OpenSM Subnet Manager Going down to compute nodes To solve this problem a list of non CN nodes be specified by G or V io guid fileV option These nodes will be allowed to use switches the wrong way around a specific number of times specified by H or V max reverse hopsV With the proper max reverse hops and io guid file values you can ensure full connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt gt N3 With a max reverse hops value of 2 N1 N2 and will all have routes between them Using max
115. cket system the PCIe adapter will be connected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 9 2 6 3 Recognizing NUMA Node Cores gt To recognize NUMA node cores run the following command cat sys devices system node node X cpulist cpumap Example cat sys devices system node node1 cpulist do Up Sy Mb 1 9 15 cat sys devices system node nodel cpumap 0000aaaa Mellanox Technologies 97 Rev 2 0 2 0 5 Performance 9 2 6 3 1 Running an Application on Certain NUMA Node 9 2 7 9 2 7 1 In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool For example if the adapters NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only gt To run an application run the following commands taskset c 8 15 ib write bw Or taskset Oxff00 ib write bw a IRQ Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalance stop The following command assigns the
116. cm ko rdma cm ko rdma ucm ko mlx4 core ko mlx4 ib ko Mellanox Technologies 189 Rev 2 0 2 0 5 ib_mthca ko ipoib_helper ko this module is not required for all OS kernels Please check the release notes ib_ipoib ko A 9 1 1 Example Adding an IB Driver to initrd Linux Prerequisites 1 2 The FlexBoot image is already programmed on the HCA card The DHCP server is installed and configured as described in Section 4 3 3 1 IPoIB Config uration Based on DHCP and is connected to the client machine An initrd file To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the IB Driver to the initrd File executed by users with expertise in the boot process Improper application of this pro The following procedure modifies critical files used in the boot procedure It must be cedure may prevent the diskless machine from booting 1 Back up your current initrd file Step 2 Make a new working directory and change to it hostl mkdir tmp initrd ib hostl cd tmp initrd ib Step3 Normally the initrd image is zipped Extract it using the following command host1 gzip dc initrd image cpio id The initrd files should now be found under tmp initrd_ib Step 4 Create a directory for t
117. command above as follows to disable LRO sbin insmod lib modules ib ib ipoib ko lro 0 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin dhclient cf sbin dhclient conf ibl Save the init file Close initrd hostl cd tmp initrd ib host1 find cpio H newc o gt tmp new initrd ib img host1 gzip tmp new init ib img At this stage the modified initrd including the IB driver is ready and located at tmp new init ib img gz Copy itto the original initrd location and rename it prop erly A 9 2 Case Il Ethernet Ports 192 Mellanox Technologies Rev 2 0 2 0 5 The Ethernet driver requires loading the following modules in the specified order see the exam ple below e mlx4 core ko mlx4 en ko A 9 2 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 3 3 1 on page 46 and connected to the client machine 3 An initrd file 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with a MLNX EN Linux Driver that 1s appro priate
118. cted device This utility also has a non interactive mode sbin connectx_port_config d device PCI device ID gt c conf portl port2 8 2 Auto Sensing Auto Sensing enables the NIC to automatically sense the link type InfiniBand or Ethernet based on the link partner and load the appropriate driver stack InfiniBand or Ethernet 88 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 0 0 For example if the first port is connected to an InfiniBand switch and the second to Ethernet switch the NIC will automatically load the first switch as InfiniBand and the second as Ethernet 8 2 1 Enabling Auto Sensing Upon driver start up 1 Sense the adapter card s port type If a valid cable or module is connected SFP or SFP with EEPROM in the cable module Set the port type to the sensed link type IB Ethernet Otherwise Set the port type as default Ethernet During driver run time Sense a link every 3 seconds if no link is sensed detected fsensed set the port type as sensed Mellanox Technologies 89 Rev 2 0 2 0 5 Performance 9 Performance 9 1 General System Configurations The following sections describe recommended configurations for system components and or interfaces Different systems may have different features thus some recommendations below may not be applicable 9 1 1 PCI Express PCle Capabilities Table 9 Recommended PCle Config
119. cted route addressing is used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to the one specified in the option Otherwise an error is reported When ibdiagpath queries for the performance counters along the path between the source and destination ports it always traverses the LID route even if a directed route is specified If along the LID route one or more links are not in the ACTIVE state ibdi agpath reports an error Moreover the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source Synopsis ibdiagpath lt src name dst name gt 1 lt src li
120. d dst lid gt d lt pl p2 p3 gt c count v t lt topo file gt s sys name ic lt dev index gt c p lt port num gt o lt out dir gt lw 1x 4x 12x 1s lt 2 5 5 10 gt pm P lt lt PM counter gt lt Trash Limit gt gt 156 Mellanox Technologies Rev 2 0 2 0 5 Options n lt src name dst name gt Names of the source and destination ports as defined in the topology file source may be omit ted gt local port is assumed to be the source lt Source and destination LIDs source may be omit ted gt the local port is assumed to be the source Gol WA 9 Directed route from the local node which is the source and the destination node lt count gt The minimal number of packets to be sent across each link default 100 V Enable verbose mode t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p lt port num gt Specifies the local device s port number used to connect to the IB fabric 0 lt out dir gt Specifies the directory where the output files will be placed default tmp lw lt 1x 4x 12x gt Specifies the expected link width ls lt 2 5 5 10 gt Specifies the e
121. d hardware is not present on the system 172 Prerequisites are not met For example missing the required software installed or the hardware is not configured correctly 173 Failed to start the mst driver Mellanox Technologies 25 J Rev 2 0 2 0 5 Installation 2 3 3 Installation Procedure 1 Login to the installation machine as root Step 2 Mount the ISO image on your machine host1 mount o ro loop MLNX OFED LINUX ver 0S label gt lt CPU arch gt iso mnt Step 3 the installation script mlnxofedinstall This program will install the MLNX OFED LINUX package on your machine Note that all other Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Uninstalling the previous version of MLNX OFED LINUX root swl014 MLNX OFED LINUX 2 0 2 0 0 rhel6 3 x86 64 root swl014 MLNX OFED LINUX 2 0 2 0 0 rhel6 3 x86 64 mlnxofedinstall This program will install the MLNX OFED LINUX package on your machine Note that all other Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Uninstalling the previous version of MLNX OFED LINUX Starting MLNX OFED LINUX 2 0 2 0 0 installation Installing mlnx ofa kernel RP Preparing HH HH HH H HHH HH HH H HHH HH HH HH mlnx ofa kernel HH H HHH HH HH H HH HH HH Het HH HHHH HH H HH HH HH Installing kmod mlnx ofa_
122. define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR honor guid2lid x This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and is valid By default this is FALSE 109 file lt log file names This option defines the log to be the given file By default the log goes to var log opensm log For the log to go to standard output use f stdout Mellanox Technologies 105 Rev 2 0 2 0 5 OpenSM Subnet Manager log limit L size in MB This option defines maximal log file size in MB When Specified the log file will be truncated upon reaching this limit log fille e This option will cause deletion of the log file if it previously exists By default the log file is accumulative Pconfig P lt partition config file gt This option defines the optional partition configuration file The default name is etc opensm partitions conf no part enforce N This option disables partition enforcement on switch external ports This option enables Adaptive Routing Manager in OpensM ar config file path to file This option specifies the optional Adaptive Routing config file The default name is etc opensm osm ar conf 06 0 9 0 This option enables QoS setup qos policy file Y QoS policy file This option defines the optional QoS policy file The defau
123. determine how they are reported 56 Mellanox Technologies Rev 2 0 2 0 5 gt To enable time stamping for a net device Admin privileged user can enable disable time stamping through calling ioctl sock SIOCSHWT STAMP amp ifreq with following values Send side time sampling Enabled by ifreq hwtstamp config tx type when possible values for hwtstamp config tx type enum hwtstamp tx types No outgoing packet will need hardware time stamping should a packet arrive which asks for it no hardware time stamping will be done xx HWTSTAMP TX OFF Enables hardware time stamping for outgoing packets the sender of the packet decides which are to be time stamped by setting 80 TIMESTAMPING TX SOFTWARE before sending the packet 5 HWTSTAMP TX ON Enables time stamping for outgoing packets just as HWTSTAMP TX ON does but also enables time stamp insertion directly into Sync packets In this case transmitted Sync packets will not received a time stamp via the socket error wi HWTSTAMP TX ONESTEP SYNC 3 Note for send side time stamping currently only HWTSTAMP TX OFF and HWTSTAMP TX ON are supported Mellanox Technologies 57 Rev 2 0 2 0 5 Driver Features Receive side time sampling Enabled by ifreq hwtstamp config rx filter when possible values for hwtstamp config rx filter enum hwtstamp rx filters time stamp no incoming packet
124. ding invalid entries v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v V ersion Optional Show version info a ll Optional Show all LIDs in range including invalid entries n o dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The parameters lt startlid gt and lt endlid gt specify the MLID range s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited timeout ms MADs msec lt dest Optional Destination s directed path LID or GUID lid guid gt lt startlid gt Optional Starting LID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 166 Mellanox Technologies Rev 2 0 2 0 5 Lid Out D
125. ds html b Untar scst 1 0 1 1 tar at est iil tar oye 0 1 1 Installscst 1 0 1 1 as follows make amp amp make install B 2 How to Run A On an SRP Target machine 1 Please refer to SCST s README for loading scst driver and its dev handlers drivers scst vdisk block or file IO mode nullio n Regardless of the mode you always need to have lun 0 in any group s device list Then you can have any lun number following lun 0 it is not required to have the lun numbers in ascending order except that the first lun must always be 0 Setting SRPT_LOAD yes in etc infiniband openib conf is not enough as it only loads srpt module but does not load scst not its dev handlers 196 Mellanox Technologies Rev 2 0 2 0 5 The scst disk module pass thru mode of SCST is not supported by Mellanox OFED ae Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst b modprobe scst_vdisk c echo open vdisk0 dev md0 BLOCKIO gt proc scsi_tgt vdisk vdisk d echo open vdisk1 dev sda BLOCKIO gt proc scsi tgt vdisk vdisk e echo open vdisk2 dev cciss cld0 BLOCKIO gt proc scsi tgt vdisk vdisk f echo add vdisk0 0 gt proc scsi_tgt groups Default devices echo add vdisk1 1 proc scsi tgt groups Default devices h echo add vdisk2 2 gt proc scsi_tgt groups Default devices Example 2 working with scst vdisk FI
126. e 13 Adaptive Routing Manager Options File Option File Description Values ENABLE lt true false gt Enable disable Adaptive Routing on fabric switches Note that if a switch was identified by AR Manager as device that does not support AR AR Manager will not try to enable AR on this switch If the firmware of this switch was updated to support the AR the AR Manager will need to be restarted by restarting Sub net Manager to allow it to configure the AR on this switch This option can be changed on the fly Default true AR_MODE lt bounded free gt Adaptive Routing Mode free no constraints on output port selection bounded the switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets This option can be changed on the fly Default bounded AGEING_TIME lt usec gt Applicable to bounded AR mode only Specifies how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmis sion burst 32 bit value This option can be changed on the fly Default 30 ERRORS lt N gt ERROR_WINDOW lt N gt When number of errors exceeds ERRORS of send receive errors or timeouts in less than ERROR WINDOW seconds the AR Manager will abort returning control back to the Subnet Manager This option c
127. e Mellanox Firmware Tools See under docs folder of installed package Mellanox Technologies 13 J Rev 2 0 2 0 5 Support and Updates Webpage Please visit http www mellanox com gt Products gt InfiniBand VPI Drivers gt Linux SW Drivers for downloads FAQ troubleshooting future updates to this manual etc 14 Mellanox Technologies Rev 2 0 2 0 5 1 Mellanox OFED Overview 1 1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack based on the OpenFabrics Enterprise Distribution OFED Linux stack and operates across all Mellanox net work adapter solutions supporting 10 20 40 and 56 Gb s InfiniBand IB 10Gb s and 40Gb s Ethernet and 2 5 or 5 0 GT s PCI Express 2 0 and 8 GT s PCI Express 3 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are supported with major operating system distributions Mellanox OFED is certified with the following products Mellanox Messaging Accelerator VMA software Multicast socket acceleration library that performs OS bypass for standard socket based applications Mellanox Unified Fabric Manager UFM software Powerful platform for managing demanding scale out computing fabric environments built on top of the OpenSM industry standard routing engine Fabric Collective Accelerator FCA is a Mellanox MPI integrated s
128. e Notes file Installer Privileges The installation requires administrator privileges on the target machine 2 2 Downloading Mellanox OFED 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX entries in the display The following example shows a system with an installed Mellanox HCA host1l lspci v grep Mellanox 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 Step 2 Download the ISO image to your host The image s name has the format MLNX_OFED_LINUX lt ver gt lt OS label gt lt CPU arch gt iso You can download it from http www mellanox com gt Products gt Software gt InfiniBand Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO image Run the following com mand and compare the result to the value provided on the download page host1 md5sum MLNX OFED LINUX ver 0S label gt iso Mellanox Technologies 21 Rev 2 0 2 0 5 Installation 23 Installing OFED Mellanox OFED includes an installation script called m1nxofedinstall which performs the fol lowing e Discovers the currently installed kernel Uninstalls any software stacks that are part of the standard operating system distribution or another vendor s commercial stack Installs the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identifies the currently
129. e interface ib0 send dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 10 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd_ib init and add the following lines at the point you wish the IB driver to be loaded Hm The order of the following commands for loading modules is critical Mellanox Technologies 191 Rev 2 0 2 0 5 Step 11 Step 12 Step 13 Step 14 Step 15 echo loading ipv6 sbin insmod lib modules ipv6 ko echo loading IB driver sbin insmod lib modules ib ib addr ko sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insmod lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod lib modules ib iw cm ko sbin insmod lib modules ib rdma cm ko sbin insmod lib modules ib rdma ucm ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4 ib ko sbin insmod lib modules ib ib mthca ko The following command loading ipoib_helper ko is not required for all OS kernels Please check the release notes de sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib ipoib ko In case of interoperability issues between iSCSI and Large Receive Offload LRO change the last
130. e management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image e Querying for firmware information Burning a firmware image to a single InfiniBand node MFT includes the following tools 1 OpenSM is disabled by default See Chapter 10 OpenSM Subnet Manager for details on enabling it Mellanox Technologies 19 J Rev 2 0 2 0 5 Mellanox OFED Overview mlxburn provides the following functions Generation of a standard or customized Mellanox firmware image for burning in bin binary or img format Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device Querying the firmware version loaded on an HCA board Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mella nox network adapter bridge switch device It includes query functions to the burnt firmware image and to the binary image file spark This tool burns a firmware binary image to the EEPROM s attached to an InfiniScaleIII switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific MADS over the InfiniBand fabric In Band tool e Debug utilities A set of debug utilities e g itrace mstdump isw and
131. e second file will include details of reported errors All errors reported in this second file should be treated as indi cators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 10 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits config F lt file name gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists create config c lt file name gt OpenSM will dump its configuration to the specified file and exit This is a way to generate OpenSM configuration file template guid g GUID in hex This option specifies the local port GUID value with which OpenSM should bind OpenSM may be Mellanox Technologies 101 Rev 2 0 2 0 5 OpenSM Subnet Manager bound to 1 port at time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port lmc 1 lt LMC gt his option specifies the subnet s LMC value he number of 110 assigned to each port is 2 LMC MC values gt 0 allow multiple paths between ports lt T It The LMC value must be in the range 0 7 Li Li MC values gt 0 should only be used if the subnet topology actually provides multiple paths between port
132. e800000000000000002c90200402bd5 pkey ffff service 10 200400 0081146 1 b To establish a connection with an SRP Target using the output from the ibsrpdm c example above execute the following command echo n id ext 200400A0B81146A1 i1oc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service 10 200400 0081146 1 gt sys class infiniband srp srp mthca0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the disk 1 command srp daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsr pdm functionality described above srp daemon can also Establish an SRP connection by itself without the need to issue the echo command described in Section 4 1 2 2 Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit 40 Mellanox Technologies Rev 2 0 2 0 5 Enable High Availability operation together with Device Mapper Multipath e Have a configuration file that determines the targets to connect to 1 srp daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c a o is equivalent to ibsrpdm c These srp_daemon commands can behave dif
133. eboot your machine Step 5 script adds the following lines to etc security limits conf for the userspace com ponents such as MPI soft memlock unlimited hard memlock unlimited These settings unlimit the amount of memory that can be pinned by a user space application If desired tune the value unlimited to a specific amount of RAM Step 6 For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be running on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 10 OpenSM Subnet Manager 7 InfiniBand only Run the hca self test ofed utility to verify whether or not the Infini Band link is up The utility also checks for and displays additional information such as HCA firmware version Kernel architecture Mellanox Technologies 31 J Rev 2 0 2 0 5 Installation Driver version Number of active HCA ports along with their states Node GUID Note For more details on hca self test ofed seethefilehca self test readme under docs hca self test ofed Performing Adapter Device Self Test Number of CAs Detected d T T CE 2 Check PASS 86_64 Hoo tD VOL NE ST On RR E TTL T T TT MLNX OFED LINUX 2 0 2 0 0 OFED 2 0 2 0 0 2 6 32 279 e16 x86 64 WN CEES P PASS Imire Ol CA
134. ed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns i e those turns that are illegal under DOR This causes the first hop after any such turn to use a separate set of VL values and pre vents deadlock in the presence of a single failed switch For any given path only the hops after a turn that is illegal under DOR can contribute to a credit loop that leads to deadlock So in the example above with failed switch T the location of the illegal turn at I in the path from S to D requires that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I 1 e hop r D cannot contribute to a credit loop because they cannot be used to construct a loop encircling T The hop I r uses a separate VL so it cannot contribute to a credit loop encircling T Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2005 can also route around multiple failed switches on the
135. either is sufficient However the algorithm used for topology discovery needs extra information for torus dimensions of radix four see TOPOLOGY DISCOVERY torus 2005 8 For such cases both the positive and negative coordinate directions must be specified Based on the topology specified via the torus mesh keyword torus 2QoS will detect and log when it has insufficient seed configuration X dateline position y dateline position z dateline position In order for torus 2QoS to provide the guarantee that path SL values do not change under any conditions for which it can still route the fabric its idea of dateline position must not change rel ative to physical switch locations The dateline keywords provide the means to configure such behavior The dateline for a torus dimension is always between the switch with coordinate 0 and the switch with coordinate radix 1 for that dimension By default the common switch in a torus seed is taken as the origin of the coordinate system used to describe switch location The position param 126 Mellanox Technologies Rev 2 0 2 0 5 eter for dateline keyword moves the origin and hence the dateline the specified amount rela tive to the common switch in a torus seed next_seed If any of the switches used to specify a seed were to fail torus 2QoS would be unable to complete topology discovery successfully The next seed keyword specifies that the following link and dateline keywords apply to
136. el5 6 x86 64 docs mlnx add kernel support sh i mnt MLNX OFED LINUX 1 5 3 rhel5 6 x86 64 iso All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Removing OFED RPMs Running mkisofs Created tmp MLNX OFED LINUX 1 5 3 rhel5 6 x86 64 iso 2 3 2 Installation Script The usage of the installation script is described below You will use it during the installation pro cedure described in Section 2 3 3 Installation Procedure on page 26 Usage mnt mlnxofedinstall OPTIONS Options c config packages config file Example of the configuration file can be found under docs n net network config file Example of the network configuration file can be found under docs p print available Printavailablepackagesforthecurrentplat form and create a corresponding ofed conf file The installation script exits after creating ofed conf without 32bit Skip 32 bit libraries installation without depcheck Skip Distro s libraries check without fw update Skip firmware update force fw update Force firmware update force Force installation without querying the user all Install all kernel modules libibverbs libibu mad librdmacm mft mstflint diagnostic tools OpenSM ib bonding MVAPICH Open MPI MPI tests MPI selector perftest sdpnetstat and libsdp srptools rds tools static and dynamic libraries hpc Install all kernel modules libibverbs libibum
137. ell For this use on of the following options 1 On the command specify the index of the local device using the following option 1 index of local device gt 2 Define the environment variable IBDIAG DEV IDX 150 Mellanox Technologies Rev 2 0 2 0 5 11 2 3 Addressing This section applies to the ibdiagpath tool only tool command may require defining the destination device or port to which it applies ah The following addressing modes can be used to define the IB ports Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination e Using port LIDs Tool option I In this mode the source and destination ports are defined by means of their LIDs If the fabric 1 con figured to allow multiple LIDs per port then using any of them is valid for defining a port Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option 1 relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the I option 11 3 ibdiagnet of ibutils2 IB Net Diagnostic This version of ibdiagnet is included in the ibutils2 package and it is run by default after i
138. emonstrate setup and configuration of SR IOV in a Red Hat Linux envi ronment using Mellanox ConnectX VPI adapter cards family 6 1 System Requirements To set up an SR IOV environment the following is required MLNX OFED Driver Aserver blade with an SR IOV capable motherboard BIOS e Hypervisor that supports SR IOV such as Red Hat Enterprise Linux Server Version 6 Mellanox ConnectX amp VPI Adapter Card family with SR IOV capability 6 2 Setting Up SR IOV Depending on your system perform the steps below to set up your BIOS The figures used in this section are for illustration purposes only For further information please refer to the appropriate BIOS User Manual 1 Enable SR IOV in the system BIOS BIOS SETUP UTILITY Advanced Advanced PCL PnP Settings WARNING Setting wrong values in below sections Disabled may cause system to malfunction Enabled Clear NURAM Nol Plug amp Play 0 5 Yes PCI Latency Tiner 641 PCI IDE BusHaster IDisahledl Sloti PCI X OPROM Enabled Slot2 PCI X OPROM Enabled Slot3 PCI K OPROM Enabled Select Screen Slot4 PCI E Enabled Select Item 51015 PCI E Enabled Change Option Slot6 PCI E PROH Enabled General Help Load Onboard LAN 1 Option ROM Enabled Save and Exit Load Unboard LAN 2 Uption RUM Disabled Exit Onboard LAN Option Rom Select PXE Boots Graphic Adapter Priority Onboard 9681 vu02 68 OCopuright 1985 2008 fimeric
139. era tions FCA providing an unprecedented level of scalability for SHMEM programs running over InfiniBand The latest ScalableSHMEM software can be downloaded from the Mellanox website 76 Mellanox Technologies Rev 2 0 2 0 5 71 2 Running SHMEM with FCA The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI or ScalableSHMEM process onto Mella nox InfiniBand managed switch CPUs As a system wide solution FCA utilizes intelligence on Mellanox InfiniBand switches Unified Fabric Manager and MPI nodes without requiring addi tional hardware The FCA manager creates a topology based collective tree and orchestrates an efficient collective operation using the switch based CPUs on the MPI ScalableSHMEM nodes FCA accelerates MPI ScalableSHMEM collective operation performance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime is disabled by default and must be configured prior to using it from the Scal ableSHMEM gt To enable FCA by default in the ScalableSHMEM 1 Edit the opt mellanox openshmem 2 2 etc openmpi mca params conf file 2 Set the 11 fca enable parameter to 1 Scoll fca enable 1 3 Setthe scoll fca np parameter to 0 Scoll fca np 0 gt To enable FCA in the shmemrun command line add the following mca scoll fca enable 1
140. estination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 2 Dump all Lids with valid out ports of the switch with Lid 2 ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0 0003 021 Switch portguid 0x000b8cffff004016 47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 2 3 7 Unicast lids 0x3 0x7 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0 0003 021 Switch portguid 0x000b8cffff004016
141. et according to the following enum enum ibv flow attr type ip steering according to rule specifications IBV FLOW ATTR NORMAL default unicast and multicast rule receive all Eth traffic which isn t steered to any QP S IBV FLOW ATTR ALL DEFAULT Oal default multicast rule receive all Eth multicast traffic which isn t steered to any QP es IBV FLOW ATTR MC DEFAULT T sniffer rule receive all port traffic IBV_FLOW_ATTR_MIRROR When setting the flow type to NORMAL the incoming traffic will be steer according to the rule specifi cations ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type since InfiniBand link type packets always include QP number Output parameters Returns a pointer to the created flow or NULL if the request fails ibv destroy flow int ibv destroy flow struct ibv flow flow id ibv destroy flow is called to detach the flow from the QP Input parameters destroy flowrequires struct flow which is the return value of create flowin case of success Output parameters Returns 0 on success or the value of errno on failure e Ethtool Ethtool domain is used to attach an RX ring specifically its QP to a specified flow Please refer to the most recent ethtool manpage for all the ways to specify a flow 66 Mellanox Technologies Rev 2 0 2 0 5 Examples ethtool U eth5 flow type ether dst
142. eters The following parameters can be passed to upcrun in order to change FCA support behavior Table 7 Runtime Parameters Parameter Description fca enable lt 0 1 gt Disables Enables FCA support at runtime default disable fca np value Enables FCA support for collective operations if the number of processes in the job is greater than the _ value default 64 fca verbose level Sets verbosity level for the FCA modules fca ops op list op list comma separated list of collective operations e fca ops lt gt list Enables disables only the speci fied operations e fca ops lt gt Enables disables all operations By default all operations are enabled Allowed operation names are barrier br bcast bt reduce rc allgather ag Each operation can be also enabled disabled via environment variable e GASNET FCA ENABLE BARRIER GASNET ENABLE BCAST GASNET ENABLE REDUCE Note the operations are enabled by default 7 5 2 4 Enabling FCA Operations through Environment Variables in ScalableUPC This method can be used to control UPC FCA offload from environment using job scheduler srun utility The valid values are 1 enable 0 disable To enable a specific operation with shell environment variables in ScalableUPC export GASNET FCA ENABLE BARRIER 1 export GASNET FCA ENABLE BCAST 1 export GASNET FCA ENABLE REDUCE 1 oo
143. eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE _ib0 eth0 4 ist 180 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on the first eth n interface that is found for 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 TPADDR Vost ean NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 4 3 3 3 Manually Configuring IPoIB This manual configuration persists only until the next reboot or driver restart To manually configure IPoIB the default IB partition VLAN perform the following steps Step 1 configure the interface enter the ifconfig command with the following items The appropriate IB interface ib0 ibl etc The IP address that you want to assign to the interface The netmask keyword The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface host1 ifconfig 100 11 4 3 175 netmask 255 255 0 0 48 Mellanox Technologies Rev 2 0 2 0 5 Step 2 Optional Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib argument The following example shows how to verify the configuration host1 ifconfig 100 b0 Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00
144. ev 2 0 2 0 5 OpenSM Subnet Manager The user can override the node list manually If this stage cannot find any root nodes and the user did not specify guid list file OpenSM defaults back to the Min Hop routing algorithm adi 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 3 Min Hop Table setting after ranking is done BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet Up Down routing does not allow LID routing communication between switches that are located inside spine switch systems The reason is that there is no way to allow LID route between them that does not break the Up Down rule One ramification al of this is that you cannot run SM on switches other than the leaf switches of the fabric 10 5 3 1 UPDN Algorithm Usage Activation through OpenSM Use R updn option instead of old u to activate the UPDN algorithm Use a root guid file gt for adding an UPDN guid file that contains the root
145. f crO conf0 Use this device name to obtain the Port GUID via the following query command flint d MST DEVICE NAME q Example with ConnectX 2 QDR MHJH29B XTR Dual 4X IB QDR Port PCIe Gen2 x8 Tall Bracket ROHS R6 HCA Card CX4 Connectors as the adapter device 184 Mellanox Technologies Rev 2 0 2 0 5 Image type ConnectX FW Version 2 9 1000 Rom Info type PXE version 3 3 400 devid 26428 proto VPI Device ID 26428 Description Node Porti Port2 Sys image GUIDs 0002c9030005cffa 0002c9030005cffb 0002c9030005cffc 0002 9030005 ACS 0002c905cffa 0002c905cffb Board ID MT_0DD0110009 VSD PSID MT 0DD0110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 05 cf fb Extracting the Port GUID Method Il An alternative method for obtaining the port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure below Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ2 00 0 open ILink dowun TX O TXE O RX O RXE 0 1 Link status The socket is not connected Waiting for link up on netO ok Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of re
146. ferently than the equivalent ibsrpdm command when etc srp_daemon conf is not empty 2 srp daemon extensions to ibsrpdm To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port lt port num gt and to generate output suitable for echo you may execute host1 srp daemon c a o i lt InfiniBand HCA name gt p port number To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run ls sys class infiniband To both discover the SRP Targets and establish connections with them just add the e option to the above command e Executing srp daemon over a port without the a option will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a Itis recommended to use the n option This option adds the initiator ext to the connecting string See Section 4 1 2 5 for more details e srp daemon has a configuration file that can be set where the default is etc srp_daemon conf Use the f to supply a different configuration file that configures the tar gets srp_daemon is allowed to connect to The configuration file can also be used to set values for additional parameters e g max cmd per lun max sect continuous background daemon operation providing an automatic ongoing detection and connection capability See Section 4 1 2 4 4 1
147. fies the wait time for trap 64 65 in Seconds It is used only when running f t the trap 64 65 flow Default 10 sec This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support This option specifies the maximal LID number to be searched for during inventory file build Default 100 This option specifies the local port GUID value with which OpenSM should bind may be bound to 1 port at a time If GUID given is 0 OpenSM displays list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port This option displays a menu of possible local port GUID values with which osmtest could bind inventory This option specifies the name of the inventory file stress Normally osmtest expects to find an inventory file which osmtest uses to validate real time information received from the SA during testing If 1 is not Specified osmtest defaults to the file osmtest dat See c option for related information This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description 81 Single MAD response SA queries 82 Multi MAD RMPP response SA queries 83 Multi MAD RMPP Path Record
148. figuration file under the name usr etc opensm par titions conf To change this filename you can use opensm with the Pconfig or P flags The default partition 1s created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a Key value of Ox7fff The port out of which runs OpenSM is assigned full membership in the default partition other end ports are assigned partial mem bership File Format Notes Line content followed after character is comment and ignored by parser General File Format Partition Definition gt lt PortGUIDs list Partition Definition PartitionName PKey flag value de member full limited where PartitionName string will be used with logging When omitted an empty string will be used P Key value for this partition Only low 15 bits will be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of this partition defmember full limited specifies default membership for port guid list Default is limited Currently recognized flags are ipoib indicates that this partition may be used for IPoIB asa result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL fo
149. file for more details Although ibdump is a Linux application the generated pcap file may be analyzed either operating system ade Synopsis ibdump options Output Files Table 31 lists the various flags of the command Table 31 ibdump Options Default Flag NA EUR If Not Description y Specified h help Optional Print the help menu d ib dev lt dev gt Optional First device Use IB device lt dev gt found 1 ib port lt port gt Optional 1 Use port lt port gt of IB device 0 output lt file gt Optional sniffer pcap Dump file name b max burst lt log2 Optional 12 4096 log2 of the maximal burst size that can be cap burst gt entries tured with no packet loss Each entry takes MTU bytes of memory 180 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 0 2 0 5 Table 31 ibdump Options Optional Default Flag PAR dator If Not Description y Specified mem mode size Optional When specified packets are written to the dump file only after the capture is stopped It is faster than the default mode less chance for packet loss but it uses more memory In this mode ibdump stops after size bytes are captured decap Optional Decapsulate port mirroring headers Should be used when capturing RSPAN traffic Examples 1 Run ibdump ibdump IB device mlx4 0 IB port cet Dump file sniffer
150. finiBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Management Model 13 Subnet Management 14 and Subnet Administration 15 10 2 opensm Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas th
151. for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels Management traffic ssh e IPoIB management VLAN partition A Min BW 10 Application traffic e IPoIB application VLAN partition B Isolated from storage and database Mellanox Technologies 139 Rev 2 0 2 0 5 OpenSM Subnet Manager Min BW of 30 Database Cluster traffic RDS Min BW of 30 SRP Min BW 3096 Bottleneck at storage nodes Administration OpenSM QoS policy file OpenSM options file Partition configuration file 10 8 Adaptive Routing 10 8 1 Overview 140 Mellanox Technologies Rev 2 0 2 0 5 Adaptive Routing AR enables the switch to select the output port based the port s load AR supports two routing modes Free AR No constraints on output port selection Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric switches It scans all the fabric switches deduces which switches support Adaptive Routing and configures the AR functionality on these switches Currently Adaptive Routing Manager supports only link aggregation algorithm Adaptive Rout ing Manager configures AR mechanism to allow switches to select output port ou
152. for the next transmis sion burst 32 bit value In the pre switch options file this option refers to the particular switch only This option can be changed on the fly 10 8 5 1 2Example of Adaptive Routing Manager Options File ENABLE true LOG FILE tmp ar_mgr log LOG SIZE 100 MAX ERRORS 10 ERROR WINDOW 5 SWITCH 0x12345 ENABLE true AGEING TIME 77 SWITCH 0x0002c902004050f8 AGEING TIME 44 SWITCH Oxabcde ENABLE false 10 9 Congestion Control 10 9 1 Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in i e it is a shared library libcc mgr so that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation Mellanox Technologies 145 Rev 2 0 2 0 5 OpenSM Subnet Manager 10 9 2 10 9 3 The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Congestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so perform the following 1 Create the file Ru
153. g an iSCSI Target Linux Environment Prerequisites 1 Make sure that an iSCSI Target is installed on your server side You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 3 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target iqn line Lun 0 Path dev sda5 Type fileio Example of an iSCSI Target iqn line Target iqn 2007 08 7 3 4 10 iscsiboot Step 4 Start your iSCSI Target Example host1 etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section 4 3 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI target Filename option root path iscsi iscsi target ip iscsi target ign The following is an example for configuring an IB ETH device to boot from an iSCSI target host hosti filename For a ConnectX device with ports configured as InfiniBand comment out the following line option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 0
154. gle VM However the number of VFs varies upon the working mode requirements The protocol types are e Port1 IB Port 2 Ethernet port type array 2 2 Ethernet Ethernet port type 1 1 IB IB port type array 1 2 VPI IB Ethernet NO port type array module parameter ports are IB Step 9 Reboot the server If the SR IOV is not supported by the server the machine might not come out of boot load d Mellanox Technologies 71 Rev 2 0 2 0 5 Single Root IO Virtualization SR IOV Step 10 Load the driver and verify the SR IOV is supported Run lspci grep Mellanox 03 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 03 00 1 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 2 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 3 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 4 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 5 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Where 03 00 represents the Physical Function e 03 00 represents the Virtual Function connected to the Physical Function 6 3 Enabling SR IOV and Para Virtualization on the Same Setup
155. group c Ifnone prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node 114 Mellanox Technologies Rev 2 0 2 0 5 10 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign_lids option is specified r reassign lids This option causes OpenSM to reassign LIDs to all end nodes Specify ing on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obviously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 10 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by default if no routing algorithm is specified It can also be invoked by specifying R minhop The Min Hop alg
156. gs in machines with AMD based pro CeSsors Table 12 Recommended BIOS Settings for AMD Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled HPC Optimizations Enabled CPU frequency select Max performance 92 Mellanox Technologies Rev 2 0 2 0 5 Table 12 Recommended BIOS Settings for AMD Processors BIOS Option Values Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 9 2 Performance Tuning for Linux You can use the Linux sysct command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 9 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better throughput sysctl w net ipv4 tcp 1 Increase the maximum length of processor input queues sysctl w net core netdev m
157. he InfiniBand modules and them hostl mkdir p tmp initrd ib lib modules ib host1 cd lib modules uname r updates kernel drivers host1 cp infiniband core ib addr ko tmp initrd ib lib modules ib host1 cp infiniband core ib core ko tmp initrd ib lib modules ib hostl cp infiniband core ib mad ko tmp initrd ib lib modules ib hostl cp infiniband core ib sa ko tmp initrd ib lib modules ib host1 cp infiniband core ib cm ko tmp initrd ib lib modules ib hostl cp infiniband core ib uverbs ko tmp initrd ib lib modules ib hostl cp infiniband core ib ucm ko tmp initrd ib lib modules ib hostl cp infiniband core ib umad ko tmp initrd ib lib modules ib host1 cp infiniband core iw cm ko tmp initrd ib lib modules ib hostl cp infiniband core rdma cm ko tmp initrd ib lib modules ib hostl cp infiniband core rdma ucm ko tmp initrd ib lib modules ib host1 cp net mlx4 mlx4 core ko tmp initrd ib lib modules ib 190 Mellanox Technologies Rev 2 0 2 0 5 6081 cp infiniband hw mlx4 mlx4 ib ko tmp initrd ib lib modules ib host1 cp infiniband hw mthca ib mthca ko tmp initrd ib lib modules ib hostl cp infiniband ulp ipoib ipoib helper ko tmp initrd ib lib modules ib host1 cp infiniband ulp ipoib ib ipoib ko tmp initrd ib lib modules ib 5 requires loading an IPv6 module If you do not have it in your initrd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko
158. he algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface eth1 then turns off Rx interrupt moderation and last shows the new setting gt ethtool c ethl Coalesce parameters for ethl Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 rx usecs 16 rx frames 88 rx usecs irq 0 rx frames irq 0 ethtool C ethl adaptive rx off rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX off pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 rx usecs irq 0 rx frames irq 0 Tuning for NUMA Architecture Tuning for Intel amp Sandy Bridge Platform The Intel Sandy Bridge processor has an integrated PCI express controller Thus every PCIe adapter OS is connected directly to a NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCIe adapter is connected In order to identify which NUMA node is the adapter s node the system BIOS should support ACPI SLIT gt To see if your system supports PCIe adapter s NUMA node detection cat sys class net interface device numa node cat sys devices PCI root P
159. hile etho ipoib interface will use a ibX Y interfaces Mellanox Technologies 61 Rev 2 0 2 0 5 Driver Features Figure 3 An Example of a Virtual Network Host Pre mc pm to Nm PEE ib0 2 ib0 3 L 4 KVM GUEST1 IPOIB LAN pei sel Semen via port 1 etho Eo p 6_ J 4 tapl d mem KVM GUEST2 es 4 47 3 Setting Performance Tuning Use larger IPoIB RX TX rings dom0 Reload the IPoIB driver with the larger send_queue_size recv_queue_size values set the follow ing ib ipoib module parameters send queue size 1024 recv queue size 1024 Use Jumbo Frames JF up to 64K domu domu n UD mode the maximum MTU value is 4092 Bytes n CM mode the maximum MTU value is 65520 Bytes Make sure that all interfaces including the guest interface and its virtual bridge have the same MTU value For further information of MTU and JF settings please refer to the Hypervisor User Manual Tune the TCP IP stack using sysctl dom0 domu sbin sysctl perf tuning Enable irqbalancer dom0 domu etc init d irqbalance start Other performance tuning for KVM environment such as vCPU pinning and NUMA tuning may apply For further information please refer to the Hypervisor User Manual 4 8 Contiguous Pages Contiguous Pages improves performance by allocating user memory regions ove
160. hines that use the Virtual Functions Please be aware stopping the driver when there are VMs that use the VFs will cause machine to hang Step 2 Run the script below Please be aware uninstalling the driver deletes the entire driver s file but does not unload the driver root swl022 usr sbin ofed uninstall sh This program will uninstall all OFED packages on your machine Do you want to continue y N Running usr sbin vendor pre uninstall sh Removing OFED Software installations Running bin rpm e allmatches kernel ib kernel ib devel libibverbs libibverbs devel libibverbs devel static libibverbs utils libmlx4 libmlx4 devel libibcm libibcm devel libibumad libibumad devel libibumad static libibmad libibmad devel libibmad static librdmacm librdmacm utils librdmacm devel ibacm opensm libs opensm devel perftest com pat dapl compat dapl devel dapl dapl devel dapl devel static dapl utils srptools infini band diags guest ofed scripts opensm devel warning etc infiniband openib conf saved as etc infiniband openib conf rpmsave Running tmp 2818 ofed vendor post uninstall sh Step3 Restart the server 74 Mellanox Technologies Rev 2 0 2 0 5 6 6 Burning Firmware with SR IOV The following procedure explains how to create a binary image with SR IOV enabled that has 63 VFs However the number of VFs varies according to the working mode requirements To burn the firmware 1 Verify you have MFT installed in your
161. i2c For additional details please refer to the MFT User s Manual docs 1 4 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 10 OpenSM Subnet Manager 20 Mellanox Technologies Rev 2 0 2 0 5 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed 2 1 Hardware and Software Requirements 2 1 1 Software and Hardware Requirements Table 1 Software and Hardware Requirements Requirements Description Platforms A server platform with an adapter card based on one of the following Mellanox Technologies InfiniBand HCA devices e MT27508 ConnectX 3 VPI IB EN firmware fw ConnectX3 MT4113 Connect IB IB firmware fw Connect IB For the list of supported architecture platforms please refer to the Mellanox OFED Release Notes file Required Disk Space for 1GB Installation Device ID For the latest list of device IDs please visit Mellanox website Operating System Linux operating system For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Releas
162. ice ID gt _ROM lt version gt mrom where the number after the Connectx_FlexBoot prefix indicates the corresponding PCI Device ID of the ConnectX ConnectX 2 ConnectX 3 device 2 Additional documents under docs dhcp A 2 Burning the Expansion ROM Image 2 1 Burning the Image on ConnectX ConnectX 2 ConnectX 3 This section is valid for ConnectX ConnectX 2 devices with firmware versions 2 8 0600 or later and ConnectX 3 firmware de Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot release notes txt 2 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package version 2 7 0 or later in order to burn the PXE ROM image To download see Firmware Tools under www mellanox com gt Downloads Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run mst start mst status The device name will be of the form mt dev ids pci cro confo 2 Create and burn the composite image Run flint dev mst device name brom expansion ROM image Example on Linux flint dev dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom 1 Depending on the OS the device name may be superceded with a prefix Mellanox Technologies 183 Rev 2 0 2 0 5 A 3 A 3 1 A 3 2 A 3 2 1
163. ils package and it is not run by default after installing Mellanox OFED To use this ibdiagnet version and not that of the ibu Adi tils package you need to specify the full path opt bin ibdiagnet ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis ibdiagnet c lt count gt v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pc P lt lt PM gt lt Value gt gt lw 1x 4x 12x 1s lt 2 5 5 10 gt skip lt ibdiag check s gt load_db lt db file gt Mellanox Technologies 153 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities Options C count Min number of packets to be sent across each link default 110 V Enable verbose mode r Provides a report of the fabric qualities t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p lt port num gt Specifies the local device s port num used to connect to the IB fabric 0 lt out dir gt Specifies the directo
164. ion SR IOV Parameter Recommended Value num pfs 1 Note This field is optional and might not always appear total_vfs 63 sriov_en true Ifthe HCA does not support SR IOV please contact Mellanox Support support mellanox com Step 7 Create the text file etc modprobe d mlx4 core conf if it does not exist otherwise delete its contents Step8 Insert an option line in the etc modprobe d mlx4 core conf file to set the number of VFs the protocol type per port and the allowed number of virtual functions to be used by the physical function driver probe vf options mlx4 core num vfs 5 port type array 1 2 probe vf 1 Parameter Recommended Value num vfs Absent or zero The SRI OV mode is not enabled in the driver hence no VFs will be available e ts value is a single number in the range of 0 63 The driver will enable the num vfs VFs on the HCA and this will be applied to all Con nectX amp HCAs on the host ts format is a string which allows the user to specify the num vf s parameter separately per installed HCA ts format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to enable for that HCA This parameter can be set in one of the following ways For exam ple num vfs 5 The driver will enable 5 VFs on the HCA and this will be applied to all ConnectX amp on the host num vfs 00 04 0 5 00 07 0 8 The driver wil
165. ircuit implementing InfiniBand compliant communica tion IB Cluster Fabric Sub net A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connec tivity only Local Identifier ID An address assigned to a port data sink or source point by the Sub net Manager unique within the subnet used for directing packets within the subnet Local Device Node The IB Host Channel Adapter HCA Card installed on the machine System running IBDIAG tools Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Manager The Subnet Manager that is authoritative that has the reference con figuration information for the subnet See Subnet Manager 12 Mellanox Technologies Rev 2 0 2 0 5 Table 3 Glossary Sheet 2 of 2 Multicast Forwarding A table that exists in every switch providing the list of ports to for Tables ward received multicast packet The table is organized by MLID Network Interface Card A network adapter card that plugs into the PCI Express slot and pro NIC vides one or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in the role of a ager Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administrator An application normally part of the Subnet Manager that imple SA
166. ity that has some more features see Section 4 1 2 6 For the changes in openib conf to take effect run etc init d openibd restart 4 1 2 5 Multiple Connections from Initiator IB Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port i e SRP connections use the same path the configuration 1 enabled using a different initiator ext value for each SRP connection The initiator ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator ext value on each path The conventions is to use the Target port GUID as the initiator ext value for the rele vant path If you use srp daemon with n flag it automatically assigns initiator ext values according to this convention For example id ext 200500A0B81146A1 ioc 9011 0002 90200402 dgid fe800000000000000002c90200402bed service id 200500a0b81146al initiator ext ed2b400002c90200 Notes 1 It is recommended to use the n flag for all stp daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the u
167. ize bytes will not be marked with FECN To do so set the following param eter packet size The values are 0 0x3 c0 The default is ox200 When number of errors exceeds max errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to pro ceed To do so set the following parameter max errors error window The values are max errors 0 zero tollerance abort configuration on first error error window 0 mechanism disabled no error checking 0 48K The default is 5 Mellanox Technologies 147 Table 15 Congestion Control Manager General Options File Rev 2 0 2 0 5 OpenSM Subnet Manager 10 9 4 1 Congestion Control Manager Options File Option File Description Values enable Enables disables Congestion Control mechanism on Values lt TRUE FALSE gt the fabric nodes Default True num_hosts Indicates the number of nodes The CC table values Values 0 48K are calculated based on this number Default 0 base on the CCT calculation on the current sub net size Table 16 Congestion Control Manager Switch Options File Option File Description Values threshold Indicates how aggressive the congestion marking should be 0 Oxf 0 no packet marking Oxf very aggressive Default Oxf marking rate The mean number of packets between marking eligi ble packets with
168. kernel RPM Preparing H HH HHH HHH H H HHH HHH HHH HH HH kmod mlnx ofa_ kernel HH HH HH H HHH HH HH HH HH H HH HH Installing mlnx ofa_kernel devel RPM Prepari ng HHH HHH HH HH H HH H HH HH H HHH HH H H HH H HH HH mlnx ofa_kernel devel HH H HH HHHH HH H HHH H HHH H H HH H HH H HH HH Installing kernel mft RPM Preparing H HH HH H HH H HH H HH H HHH HHH H HH HH kerne HHH HHH HH HH H HH H HH HH H HHH H H HH HH Installing knem RPM Preparing HH H HH H HHH H HH H H Het HH H HH HH knem H HHH HH HH H HH HH Het HH H H HH HH H HH HH Installing mpi selector RPM Preparing As H HH HHHH iHi H HHH H HHH H H HH H HH H HH HH mpi selector HH H HH HH H HHH HH H H H HHH HH H HH HH 26 Mellanox Technologies Rev 2 0 2 0 5 Installing user level RPMs Preparing do THHHBHHHHHHHBHBHHHHHHHHBHHHHHHHHHHHHHHHHHBHHHBSHRHHHI ofed scripts Preparing A A A HRA ERRA HE libibverbs THHHHHHHBHHHHHBHHHHHHHHBHHHBRHHHHBHHHH
169. l enable 5 VFs on the HCA positioned in BDF 00 04 0 and 8 on the one in 00 07 0 Note PFs not included in the above list will not have SR IOV enabled port type array Specifies the protocol type of the ports It is either one array of 2 port types t1 t2 for all devices or list of BDF to port type array bb dd f t1 t2 string 70 Mellanox Technologies Rev 2 0 2 0 5 Parameter Recommended Value probe vf Absent or zero No VFs will be used by the PF driver ts value is a single number in the range of 0 63 Physical Function driver will use probe vf VFs and this will be applied to all Con nectX amp HCAs on the host ts format is a string which allows the user to specify the probe vf parameter separately per installed HCA ts format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to use in the PF driver for that HCA This parameter can be set in one of the following ways For exam ple probe vfs 5 The PF driver will probe 5 VFs on the HCA and this will be applied to all ConnectX amp HCAs on the host probe vfs200 04 0 5 00 07 0 8 The PF driver will probe 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for the one in 00 07 0 Note PFs not included in the above list will use any of their VFs in the PF driver The example above loads the driver with 5 VFs num vfs The standard use of a VF is a single VF per a sin
170. lanox OFED 36 Chapter 3 Configuration Files 37 Chapter 4 Driver Features 38 4l SCSERDMA Protocoles sisira ose a p 38 4 1 1 gd Rb RR ERR ER EI UR UE ace Rh s 38 41 2 SRP Initiator UP PUE 38 4 2 iSCSI Extensions for RDMA ISER 0 ees 44 4 23 tntesuM CER eU PN bene See ie b RES 45 42 2 A3SER Intlatob oct eee NUDO LUPO EHE 45 43 IP over InfiniBand 0 45 Mellanox Technologies 3 Rev 2 0 2 0 5 43 1 Introduction Sees Mates boreal et 45 4 32 IPoIB Mode Setting iit Eee 45 4 3 3 IPoIB Configuration 2 2 n eens 46 4 3 4 TROL MEU Doubs ver er ee pt eur pex EEE 49 4 3 5 S Subinterfacess d o st oe etat Nd e td cte to rude ec 49 4 3 6 Verifying IPoIB Functionality 50 4 357 Bonding IPoIB ur RR E eA aane brie Sedat ute ENSE 51 4 4 Quality of Service ome bem rS Ra De SE REN 52 4 4 1 Quality of Service Overview 52 44 2 QoS Architectures i Gerh DS ORE HERR En ua 53 443 Supported Policy errs e bce eee pec ee E 54 AAA CMA Features ore rRNA ER 54 4 4 5 OpenSM Features 0 04 06 nener een eet 55 4 5 Time stamping
171. licable These values can be set later using the sg command see Table 30 below No com Force clear the Flash semaphore on the device No command is clear semapho re mands allowed allowed when this switch 1s used Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage burn verify Binary image file image qq burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g I2C MTUSB 1 nofs burn Burn image in a non failsafe manner skip is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sec tor difference is detected byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages y es All Non interactive mode Assume the answer is yes to all questions no All Non interactive mode Assume the answer is to all questions vsd lt string gt burn Write this string of up to 208 characters to VSD upon a burn com mand use_image ps burn Burn vsd as it appears in the given image do not keep existing VSD on Flash
172. lt name is etc opensm qos policy conf ou This option will cause SM not to exit on fatal initialization issues if SM discovers duplicated guids or 12x link with lane reversal badly configured By default the SM will exit on these errors daemon B Run in daemon mode OpenSM will run in the background inactive I Start SM in inactive rather than normal init SM state prefix routes file lt path to file gt This option specifies the prefix routes file 106 Mellanox Technologies Rev 2 0 2 0 5 Prefix routes control how the SA responds to path record queries for off subnet DGIDs Default file is etc opensm prefix routes conf consolidate ipv6 snm req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P Key log prefix prefix text Prefix to syslog messages from OpenSM verbose V This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the D option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing The V is equivalent to D OxFF d 2 See the D option for more information about log verbosity D D flags This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a Specific log level as follows BIT LOG LEVEL ENABLED 0
173. lt routing This option prevents OpenSM from falling back to default routing if none of the provided engines was able to configure the subnet do mesh analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing lash start vl vl number Sets the starting VL to use for the lash routing algorithm Defaults to 0 sm_sl sl number Sets the SL to use to communicate with the SM SA Defaults to 0 connect roots z This option enforces routing engines up down and fat tree to make connectivity between root switches and in this way be IBA compliant In many cases this can violate pure deadlock free algorithm so use it carefully This option enables unicast routing cache to prevent routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g in case of host reboot This option becomes very handy when the cluster size is thousands of nodes lid matrix file M file name This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded lfts file U file name This option specifies the name of the LFTs file from where switch
174. machine Step 2 Enter the firmware directory according to HCA type e g ConnectX 3 The path is mlnx_ofed firmware lt device gt lt FW version Step3 Find the ini file that contains the HCA s PSID Run devinfo grep board id board id MT 1090120019 If such ini file cannot be found in the firmware directory you may want to dump the configura tion file using mstflint Run mstflint dev PCI device dc gt ini device file Step 4 Edit the ini file that you found in the previous step and add the following lines to HCA section in order to support 63 VFs SRIOV enable total vfs 63l num pfs 1 sriov_en true 1 Some servers might have issues accepting 63 Virtual Functions or more In such case please set the number of total vfs to any required value Step 5 Create a binary image using the modified ini file Run mlxburn fw fw name mlx conf modified ini file wrimage file name gt bin The file file name gt bin is a firmware binary file with SR IOV enabled that has 63 VFs It can be spread across all machines and can be burnt using mstflint which is part of the bundle using the following command mstflint dev PCI device image file name bin b After burning the firmware the machine must be rebooted If the driver is only restarted the machine may hang and a reboot using power OFF ON might be required Mellanox Technologies 75 J Rev 2 0 2
175. mple if the guest has a VIF that is connected to the Virtual Bridge bro then enslave the eth interface to bro by run ning brctl addif br0 ethX In RHEL KVM environment there are other methods to create configure your virtual net work e g macvtap For additional information please refer to the Red Hat User Manual ae The IPoIB daemon ipoibd detects the new VIFs and creates a new IPoIB instances as a result number of IPoIB interfaces ibX Y are shown as being created destroyed and are being enslaved to the corresponding ethx interface to serve any active VIF in the system according to the set configuration This process is done automatically by the ipoibd service To see the list of ipoib interfaces enslaved under eth ipoib interface cat sys class net ethX eth vifs For example cat sys class net eth5 eth vifs SLAVE ib0 1 MAC 9a c2 1f d7 3b 63 VLAN N A SLAVE ib0 2 52 54 00 60 55 88 VLAN N A SLAVE ib0 3 52 54 00 60 55 89 VLAN N A ethX you will notice that ibx 1 is always created to serve applications running from the gt Each et hx interface has at lease one ibX Y slave to serve the PIF itself In the VIFs list of adi Hypervisor on top of the ethx interface directly For InfiniBand applications that require native IPoIB interfaces e g CMA the original IPoIB interfaces ibx can still be used For example CMA and ethx drivers can co exist and make use of IPoIB ports CMA can use ibo w
176. n opensm options file name 2 Find the event plugin name option in the file and add cemgr to it Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt Once the Congestion Control is enabled on the fabric nodes to completely disable F Congestion Control you will need to actively turn it off Running the SM w o the CC Manager is not sufficient as the hardware still continues to function in accordance to the previous CC configuration For further information on how to turn OFF CC please refer to Section 10 9 3 Configuring Congestion Control Manager on page 146 Configuring Congestion Control Manager Congestion Control CC Manager comes with a predefined set of setting However you can fine tune the CC mechanism and CC Manager behavior by modifying some of the options To do so perform the following 1 Find the event plugin options option in the SM options file and add the following conf file cc mgr options file name Options string that would be passed to the plugin s event plugin options ccmgr conf file lt cc mgr options file name gt 2 Run the SM with the new options file opensm F options file name To turn CC OFF set enable to FALSE in the Congestion Control Manager configura tion file and run OpenSM ones with this configuration dh For the full list of CC Manager options
177. n DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by main taining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 118 Mellanox Technologies Rev 2 0 2 0 5 3 Once this stage has been completed it is highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoid ing the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm
178. n the ibv shared mr sample program can be found in the ibv shared mr man page 64 Mellanox Technologies Rev 2 0 2 0 5 5 Flow Steering Flow Steering is currently at beta level Please be aware that the content below is subject to change Flow steering is a new model on top of which all unicast and multicast attaches work It is dis abled by default To enable it set the module parameter 1og_num_mgm_entry_ size to 1 It is configured over Ethernet via ethtool and via user verbs and Infiniband via user verbs The following are flow steering main purposes Replaces the existing steering model Allows more features such as priorities and L3 L4 attaches At the lowest level it allows the user to attach QP to an L2 L3 L4 flow Any packet that matches the specified flow is steered into the specified A flow specification at the lowest level is a subset of L2 L4 attributes 5 1 Flow Domains and Priorities Flow steering defines the concept of domain and priority Each domain represents a user agent that can attach a flow The domains are prioritized A higher priority domain will always super sede a lower priority domain when their flow specifications overlap Setting a lower priority value will result in higher priority In addition to the domain there is priority within each of the domains Each domain can have multiple priorities in accordance to its needs The following are the domains at a descending order of p
179. nder dev use the following command echo n id ext GUID value ioc_guid GUID value dgid port GID value pkey ffff service id service 0 value gt sys class infiniband_srp srp mthca hca number port number add target See Section 4 1 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes Itis possible to include additional parameters in the echo command max cmd per lun Default 63 max sect short for max sectors sets the request size of a command 10 class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 e initiator ext Please refer to Section 9 Multiple Connections Tolist the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk I This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 4 1 2 3 SRP Tools ibsrpdm and srp daemon To assist in performing the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above echo command Step
180. ndled by the unicast routing cache is host reboot which otherwise would cause two full routing recalculations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destina tion LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID This step is common to standard and Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same target port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC
181. nes are ignored so the indentation in the example is just for better readability Comments are started with the pound sign and terminated by EOL Any keyword should be the first non blank in the line unless it s a comment Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules Any section subsection of the policy file 1s optional 10 6 5 Examples of Advanced Policy File As mentioned earlier any section of the policy file 1s optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file gos levels gos level name DEFAULT 813 Qi end qos level end qos levels 130 Mellanox Technologies Rev 2 0 2 0 5 Port groups section is missing because there no match rules which means that port groups not referred anywhere and there is no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning port groups
182. nformation that is available for use from user space Synopsis ibv devinfo d device i lt port gt 1 v Output Files Table 22 lists the various flags of the command Table 22 ibv devinfo Flags and Options Optional Detant Flag If Not Description y Specified d lt device gt Optional First found Run the command for the provided IB ib dev lt device gt device device device i lt port gt Optional All device Query the specified device port lt port gt ib port lt port gt ports 158 Mellanox Technologies Rev 2 0 2 0 5 Table 22 ibv devinfo Flags and Options Optional Default Flag M dator If Not Description y Specified 1 Optional Inactive Only list the names of InfiniBand devices list v Optional Inactive Print all available information about the verbose InfiniBand device s Examples 1 List the names of all available InfiniBand devices gt ibv devinfo 1 2 HCAs found mthca0 mlx4 0 2 Query the device mlx4 0 and print user available information for its Port 2 gt ibv devinfo d mlx4 0 i 2 hca id mlx4 0 fw ver 2 5 944 node guid 0000 0000 0007 3895 Sys image guid 0000 0000 0007 3898 vendor id 0x02c9 vendor part id 25418 hw ver 0xA0 board id MT 04A0140005 phys port cnt 2 Dont State PORT ACTIVE 4 max mtu 2048 4 active mtu 2048 4 sm lid i port lid 1 port lmc 0x00 11 8 ibde
183. ng on page 45 and are configured with the same value For IPoIB slaves that work in datagram mode use MTU 2044 If you do n not set the correct MTU or do not set MTU at all performance of the interface might decrease n the bonding slave configuration file e g ifcfg ib0 use the same Linux Network Scripts semantics In particular DEVICE ib0 e In the bonding slave configuration file e g ifcfg 1b0 8003 the line TYPE InfiniBand is necessary when using bonding over devices configured with partitions p key For RHEL users Mellanox Technologies 51 J Rev 2 0 2 0 5 Driver Features In etc modprobe b bond conf add the following lines alias bond0 bonding For SLES users It is necessary to update the MANDATORY DEVICES environment variable in etc sysconfig net work config with the names of the IPoIB slave devices e g 100 161 etc Otherwise bonding mas ter may be created before IPoIB slave interfaces at boot time Itis possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bond ing master Restarting openibd does no keep the bonding configuration via Network Scripts You have to restart the network service in order to bring up the bonding master After the configuration is saved restart the network service by running etc init d network al restart 4 4 Quality of Servi
184. ng is enabled time stamp is placed is the socket Ancillary data recvmsg can be used to get this control message for regular incoming packets For send time stamps the outgo ing packet is looped back to the socket s error queue with the send time stamp s attached It can be received with recvmsg flags MSG ERRQUEUE The call returns the original outgoing packet data including all headers preprended down to and including the link layer the scm timestamping control message and a sock extended err control message with ee errno ENOMSG and ee origin SO EE ORIGIN TIMESTAMPING A socket with such 58 Mellanox Technologies Rev 2 0 2 0 5 a pending bounced packet is ready for reading as far as select is concerned the outgoing packet has to be fragmented then only the first fragment is time stamped and returned to the sending socket When time stamping 15 enabled VLAN stripping is disabled For more info please refer to Documentation networking timestamping txt in kernel org 4 6 Atomic Operations 4 6 1 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those defined by the IB spec Atomicity guarantees Atomic Ack generation ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec 4 6 1 1 Masked Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an ex
185. ng necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization 1 turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimension Use R dor option to activate the DOR algorithm Mellanox Technologies 119 Rev 2 0 2 0 5 OpenSM Subnet Manager 10 5 7 Torus 2QoS Routing Algorithm Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics The torus 2QoS routing engine can provide the following functionality on a 2D 3D torus Free of credit loops routing Two levels of QoS assuming switches support 8 data VLs Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL values Very short run times with good scaling properties as fabric size increases 10 5 7 1 Unicast Routing Torus 2QoS is a DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension It encodes into a path SL which datelines the path crosses as follows gil e gr for d 0
186. no passphrase Enter same passphrase again Your identification has been saved in home lt username gt ssh id_rsa Your public key has been saved in home lt username gt ssh id_rsa pub The key fingerprint is 38 10 29 0 4 08 00 4 0 50 0 05 44 7 9 05 lt username gt host1l Step 2 Check that the public and private keys have been generated host1 cd home lt username gt ssh host1 18 host1 15 la total 40 CNE 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 eieeooec 1 root root 1675 Mar 5 04 57 id rsa Mellanox Technologies 79 J Rev 2 0 2 0 5 HPC Features rw r r 1 root root 404 Mar 5 04 57 id rsa pub Step 3 Check the public key hostl cat id rsa pub Ssh rsa AAAAB3NzaClyc2EAAAABIWAAAQEAlzVY8VBHQh90kZN70A11bUQ74RXm4 zHeczyVxpYHaDPyDmqezbYMKrCIVz d10bHeZkCOrpLYviUOoUHd3fvNTfMs0gcGg08PysUf 12FyYjira2Plxyg6mkHLGGqVutfEMmABZ3wNCUg6J2X 3G uiuSWXeubZmbXcMrP wAIWByfH8ajwo6A5SWioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorX NCOZBe4kTnUqm63nQ2zi1qVMdL9FrCmalxIOu9 SQUAjwONevaMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrod ZlfCQ username Ghostl Step 4 Now you need to add the public key to the authorized keys2 file on the target machine host1 cat id rsa pub xargs ssh host2 V echo home username ssh authorized keys2 lt username gt host2 s pass word Enter password For a local machine simply add the key to authorized keys2 hostl c
187. nodes for ranking If the option is not used OpenSM uses its auto detect root nodes algo rithm Notes on the guid list file 1 valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 10 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also prevents credit loop dead locks 116 Mellanox Technologies Rev 2 0 2 0 5 If the root guid file is not provided a or root_guid_file options the topology has to be pure fat tree that complies with the following rules Tree rank should be between two and eight inclusively Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches Switches of the same rank should have the same
188. nos teaser inact Avie sg 16 1 2 4 Directory Structure tenet i 16 1 37 Architecture o eres ege re er b e bar e Tear nth ed Claman wigs 16 1 3 1 M4 VPIDriyet DC op NEL MS 17 1 3 2 mlx5 Driver uso RR REUNIR ee CR aee Ries 18 133 MidslayerCore iy one der tee beady oats hase PE 18 13 4 e UDEPs 5e tS oth toe ho elk dae Peto eb ole sa oed 18 13 5 MP Rire Oe eue ORT AN 19 1 3 6 InfiniBand Subnet 19 1 3 7 Diagnostic Utilities ee ea ie b eR e ER 19 1 3 8 Mellanox Firmware Tools 0 0 0 0 cece cece eee 19 1 4 Quality of Service 1 20 Chapter 2 Installation o cess eee x y wen aao do e a cux E a aos dx d rac T 2 Hardware and Software Requirements 21 2 1 1 Software and Hardware 21 2 2 Downloading Mellanox OFED eh 21 2 3 Installing Mellanox OFED 0 20 cece cece eae 22 2 3 Pre mstallation Notes hss benthic heb 22 2 3 2 Installation Script LEUR EQ tes 23 2 3 3 Installation Procedure ossia e E eR EE 26 2 3 4 Installation i4 cresta ri Ph pee 32 2 3 5 Post installation Notes lisse 34 2 4 Updating Firmware After 5 35 2 5 Uninstalling Mel
189. ns in the network alleviating bottlenecks within the parallel communication libraries The latest MXM software can be downloaded from the Mellanox website 7 3 1 Compiling OpenMPI with MXM Step 1 Install MXM from RPM rpm ihv mxm x y z 1 x86 64 rpm will be installed automatically in the opt mellanox mxm folder Step 2 Enter OpenMPI source directory and run 5 cd OMPI HOME configure with mxm opt mellanox mxm lt other configure parameters make all amp amp make install oo MLNX OFED v2 0 or later comes with a pre installed version of MXM v1 1 and OpenMPI compiled with MXM v1 1 Toupgrade MLNX OFED v2 0 or later with a newer MXM 1 Remove MXM 1 1 rpm mxm Mellanox Technologies 81 J Rev 2 0 2 0 5 HPC Features Step 2 Remove the pre compiled OpenMPI rpm e mlnx openmpi gcc Step3 Install the new MXM and compile the OpenMPI with it To run OpenMPI without MXM run mpirun mca mtl mxm lt gt ates When upgrading to MXM v1 5 OMPI compiled with MXM v1 1 should be recompiled with MXM v1 5 7 3 2 Enabling MXM in OpenMPI MXM Rev 2 0 2 0 5 is automatically selected by OpenMPI when the Number of Processes NP is higher or equal to 128 To enable MXM for any NP use the following OpenMPI parameter mca mtl mxm np lt number gt To activate MXM for any NP run 9 mpirun mca mtl mxm np 0 other mpirun parameters 7 3 3 Tuning
190. nstalling Mellanox OFED use this ibdiagnet version run ibdiagnet Please see ibutils2 release notes txt for additional information and known issues ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis ibdiagnet i lt dev name gt p lt port num gt pm pc P lt lt PM gt lt Value gt gt ex feu lw lt 1x 4x 8x 12x gt 1s lt 2 5 5 10 gt skip ibdiag stage gt o lt out dir gt h V Mellanox Technologies 151 Rev 2 0 2 0 5 Options i device lt dev name gt p port lt port num gt pm Dump all pm pc Specify the name of the device of the port used to connect to the IB fabric in case of multiple devices on the local system Specify the local device s port number used to connect to the IB fabric Counters values into ibdiagnet pm Reset all the fabric links pmCounters P counter lt lt PM gt lt Value gt gt r routing u fat tree lw 1x 4x 8x 12x ls lt 2 5 5 10 gt Skip ibdiag check Print any provided pm that is greater than its provided value Provide a report of the fabric qualities Indicate that UpDown credit loop checking Should be done against automatically determined roots Specify the expected link width
191. number of ports in each UP going port group Switches of the same rank should have the same number of ports in each DOWN going port group Allthe CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and it should only comply with the following rules Tree rank should be between two and eight inclusively All the Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algo rithm allows leaf switches to have any number of CAs the closer the tree 1 to be fully popu lated the more effective the shift communication pattern will be general even if the root list is provided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will mat
192. nux software package see www mellanox com gt Products gt Software gt InfiniBand VPI Drivers The binary code is exported by the device as an expansion ROM image 1 1 Supported Mellanox Adapter Devices and Firmware The package supports all ConnectX ConnectX 2 ConnectX 3 network adapter devices and cards Specifically adapter products responding to the following PCI Device IDs are sup ported ConnectX ConnectX 2 ConnectX 3 devices Decimal 25408 Hexadecimal 6340 Decimal 25418 Hexadecimal 634a Decimal 25448 Hexadecimal 6368 Decimal 26418 Hexadecimal 6732 Decimal 26428 Hexadecimal 673c Decimal 26438 Hexadecimal 6746 Decimal 26448 Hexadecimal 6750 Decimal 25458 Hexadecimal 6372 Decimal 26458 Hexadecimal 675a Decimal 26468 Hexadecimal 6764 Decimal 26478 Hexadecimal 676e e Decimal 4099 Hexadecimal 1003 A 1 2 Tested Platforms See the Mellanox FlexBoot Release Notes FlexBoot release notes txt 182 Mellanox Technologies Rev 2 0 2 0 5 A 1 3 FlexBoot in Mellanox OFED The FlexBoot package is provided as a tarball tgz extension containing the files specified in Appendix A 1 1 Supported Mellanox Adapter Devices and Firmware page 182 1 A PXE ROM image file for each of the supported Mellanox network adapter devices Specif ically the following images are included ConnectX ConnectX 2 ConnectX 3 images e ConnectX FlexBoot PCI Dev
193. of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 10 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traffic and that s why it is the only ULP that did not appear in the qos ulps section 10 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are Max VLs the maximum number of VLs that will be on the subnet High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 e VLArb low table Low priority VL Arbitration table IBA 7 6 9 template e VLArb high table High priority VL Arbitration table IBA 7 6 9 template Mellanox Technologies 135 Rev 2 0 2 0 5 OpenSM Subnet Manager SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by qos_ lt type gt _ string Here is a full list of the currently suppor
194. oftware package that utilizes CORE Direct technology for implementing the MPI collectives communications 1 2 Mellanox OFED Package 1 2 1 180 Image Mellanox OFED for Linux MLNX OFED LINUX is provided as ISO images one per sup ported Linux distribution and CPU architecture that includes source code and binary RPMs firmware utilities and documentation The ISO image contains an installation script called mlnxofedinstall that performs the necessary steps to accomplish the following Discover the currently installed kernel Uninstall any InfiniBand stacks that are part of the standard operating system distribu tion or another vendor s commercial stack Install the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 2 2 Software Components MLNX OFED LINUX contains the following software components Mellanox Host Channel Adapter Drivers mlx5 mlx4 VPI which is split into multiple modules e mlx4 core low level helper mlx4 ib IB e Mid layer core Verbs MADs SA CM CMA uVerbs uMADs Mellanox Technologies 15 J Rev 2 0 2 0 5 Mellanox OFED Overview Upper Layer Protocols ULPs IPoIB RDS SRP Initiator and SRP NOTE RDS was not tested by Mellanox Technologies MPI Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the
195. one or more of the sharing access bits The sharing bits are part of the ibv_reg_mr man page Turns IBV ACCESS ALLOCATE MR bit Step 2 Request to register to a shared MR A new verb called ibv reg shared mr is added to enable sharing an MR To use this verb the application supplies the MR ID that it wants to register for and the desired access mode to that MR The desired access is validated against its given permissions and upon successful creation the physical pages of the original MR are shared by the new MR Once the MR is shared it can be used even if the original MR was destroyed The request to share the MR can be repeated numerous times and arbitrary number of Memory Regions can potentially share the same physical memory locations Usage Uses the handle field that was returned from the ibv reg mrasthemr handle Supplies the desired access mode for that MR Supplies the address field which can be either NULL or any hint as the required output The address and its length are returned as part ofthe ibv mr struct To achieve high performance it is highly recommended to supply an address that is aligned as the origi nal memory region address Generally it may be an alignment to 4M address For further information on how to use reg shared mr verb please refer to the ibv reg shared mr man page and or to the ibv shared mr sample program which demonstrates a basic usage of this verb Further information o
196. or Eth 8 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type usethe connectx port config script after the driver 1 loaded Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved configuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are eth Ethernet jb Infiniband auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds Table 8 lists the ConnectX port configurations supported by VPI Table 8 Supported ConnectX Port Configurations Port 1 Configuration Port 2 Configuration ib ib ib eth eth eth Note that the configuration Port eth and Port2 ib is not supported The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified if there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the sele
197. orithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter 1s supplied by i equalize ignore guids file ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis 10 5 3 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a statis tical histogram 1s built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list ofthe root nodes found by this auto detect stage is used by the ranking process stage Mellanox Technologies 115 R
198. ovide 3 of the 4 turns needed for the loop In addi tion if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link or switch failures that result in a fabric for which torus 2QoS can generate credit loop free unicast routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example consider that same 2D 6 5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree 4 3 2 l 1 E y 0 x 0 X 2 3 4 5 Two things are notable about this master spanning tree First assuming the x dateline was between
199. ows what the current selections are This command is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting 80 Mellanox Technologies Rev 2 0 2 0 5 7 2 4 Compiling MPI Applications Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich_ user guide html To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps 7T 3 MellanoX Messaging MellanoX Messaging provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hard ware This includes a variety of enhancements that take advantage of Mellanox networking hard ware including Multiple transport support including RC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication These enhancements significantly increase the scalability and performance of message commu nicatio
200. p new carry if bit add res atomic response bit position carry new carry amp amp MASK IS SET compare add mask bit position return atomic response 4 7 Ethernet Tunneling Over IPoIB Driver elPolB Ethernet Tunneling Over IPoIB Driver is currently at alpha level Please be aware that the content below is subject to change aes The eth ipoib driver provides a standard Ethernet interface to be used as a Physical Interface PIF into the Hypervisor virtual network and serves one or more Virtual Interfaces VIF This driver supports L2 Switching Direct Bridging as well as other L3 Switching modes e g NAT This document explains the configuration and driver behavior when configured in Bridging mode 4 7 1 Enabling the Driver Once the mlnx ofed driver installation is completed perform the following Step 1 Open the etc infiniband openib cont file and include E IPOIB LOAD yes Step2 Restart the InfiniBand drivers etc init d openibd restart 4 7 2 Configuring the Ethernet Tunneling Over IPoIB Driver When eth ipoib is loaded the number of PIFs is created with the following default naming scheme ethx where X represents the ETH port available on the system For example on a system with dual port HCA the following two interfaces might be created eth6 and ethe 60 Mellanox Technologies Rev 2 0 2 0 5 These interfaces can be used to configure the network for the guest For exa
201. p of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the target IP and port number ULPs might also provide QoS Class The CMA then creates Ser vice ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes Mellanox Technologies 53 J Rev 2 0 2 0 5 Driver Features 4 4 3 4 4 4 4 4 4 1 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four sub sections 1 Port Group
202. pecifi cation is timed to coincide with OFED release of the Open Fabrics www openfabrics org soft ware stack For more information about the DAT collaborative go to the following site http www datcollaborative org 1 3 5 MPI Message Passing Interface MPI is a library specification that enables the development of paral lel software libraries to utilize parallel computers clusters and heterogeneous networks Mella nox OFED includes the following MPI implementations over InfiniBand Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Bench mark and Presta 1 3 6 InfiniBand Subnet Manager InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OFED See Chap ter 10 OpenSM Subnet Manager 1 3 7 Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data center managers ibutils Mellanox Technologies diagnostic utilities infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools 1 3 8 Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmwar
203. ple of a configuration file for the ConnectX PCI Device ID 26428 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 Example of a configuration file for InfiniHost Ex PCI Device ID 25218 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 20 00 55 04 01 80 00 00 00 00 00 00 00 02 9 02 00 23 13 92 In order to use the configuration file run host1 dhclient cf dhclient conf ibl 4 3 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that 1s not based on DHCP you need to supply the installation script with a configuration file using the n option containing the full IP configu Mellanox Technologies 47 Rev 2 0 2 0 5 Driver Features ration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface A static IPoIB configuration An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring IP addresses The following code lines are an excerpt from a sample IPoIB configuration file Static settings all values provided by this file IPADDR ib0 11 4 3 175 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on
204. present ing a client machine for the DHCP server host hosti next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 A 4 Subnet Manager OpenSM This section applies to ports configured as InfiniBand only FlexBoot requires Subnet Manager to be running one of the machines in the IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accom plish this Note that OpenSM may be run on the same host running the DHCP server but it is not mandatory For details on OpenSM see OpenSM Subnet Manager page 101 Mellanox Technologies 185 Rev 2 0 2 0 5 5 A 7 A 7 1 A 7 2 To use OpenSM caching for large InfiniBand clusters gt 100 nodes it is recommended to use the OpenSM options described in Section 10 2 1 opensm Syntax on ad page 101 TFTP Server When you set the filename parameter in your DHCP configuration file to a non empty file name the client will ask for this file to be passed through TFTP For this reason you need to install a TFTP server BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device The priority of this list can be modified through BIOS setup
205. r physical con tiguous pages It enables a user application to ask low level drivers to allocate contiguous mem ory for it as part of ibv_reg_mr Additional performance improvements can be reached by allocating Queue Pair QP and Com pletion Queue CQJ buffers to the Contiguous Pages To activate set the below environment variables with values of PREFER CONTIG or CONTIG For QP MLX QP ALLOC TYPE 62 Mellanox Technologies Rev 2 0 2 0 5 ALLOC TYPE The following are all the possible values that can be allocated to the buffer Table 3 Buffer Values Possible Value Description ANON Use current pages ANON small ones Default value HUGE Force huge pages CONTIG Force contiguous pages PREFER CONTIG Try contiguous fallback to ANON small pages PREFER HUGE Try huge fallback to ANON small pages ALL Try huge fallback to contiguous if failed fallback to ANON small pages 1 Values are NOT case sensitive Usage The application calls reg mr API which turns the ACCESS ALLOCATE MR bit and sets the input address to NULL Upon success the address field of the struct mr will hold the address to the allocated memory block This block will be freed implicitly when the dereg mr is called The following are environment variables that can be used to control error cases contiguity Table 4 Parameters Used to Control Error
206. r this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate mtu and scope should be specified as defined in the IBTA specification for example mtu 4 for 2048 PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too 112 Mellanox Technologies Rev 2 0 2 0 5 full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters e The line can be wrapped after after a Partition Definition and between PartitionName does not need to be unique but PKey does need to be unique Ifa PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note tis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions Examples Default 0x7ff ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 limi
207. ramming language designed for high per formance computing on large scale parallel machines The language provides a uniform program ming model for both shared and distributed memory hardware The programmer is presented with a single shared partitioned address space where variables may be directly read and written by any processor but each variable is physically associated with a single processor UPC uses a Single Program Multiple Data SPMD model of computation in which the amount of parallelism is fixed at program startup time typically with a single thread of execution per processor In order to express parallelism UPC extends ISO C 99 with the following constructs Mellanox Technologies 83 J Rev 2 0 2 0 5 HPC Features 7 5 1 An explicitly parallel execution model shared address space Synchronization primitives and a memory consistency model Memory management primitives The UPC language evolved from experiences with three other earlier languages that proposed parallel extensions to ISO C 99 AC Split C and Parallel C Preprocessor PCP UPC is not a superset of these three languages but rather an attempt to distill the best characteristics of each UPC combines the programmability advantages of the shared memory programming paradigm and the control over data layout and performance of the message passing programming para digm Mellanox ScalableUPC is based on Berkely UPC package see http upc Ibl gov and con
208. rder to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 9 2 4 Tuning Power Management Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent e Check the maximum supported CPU frequency cat sys devices system cpu cpu cpufreq cpuinfo max freq e Check that core frequencies are consistent scat proc cpuinfo grep cpu MHz Check that the output frequencies are the same as the maximum supported If the CPU frequency is not at the maximum check the BIOS settings according to tables in is section Recommended BIOS Settings page 90 to verify that power state 1 disabled Check the current CPU frequency to check whether it is configured to max available frequency cat sys devices system cpu cpu cpufreq cpuinfo cur freq 9 2 4 1 Setting the Scaling Governor If the following modules are loaded CPU scaling is supported and you can improve perfor mance by setting the scaling mode to performance freq table acpi cpufreq this module is architecture dependent It is also recommended to disable the module cpuspeed this module is also architecture depen dent gt To set the scaling mode to performance use echo performance gt sys devices system cpu cpu7 cpufreg scaling governor 94 Mellanox Technologies Rev 2 0 2 0 5 gt To disable cpuspeed use
209. re 1 Mellanox OFED Stack for ConnectX Family Adapter Cards _ e n as _ E E B wa s peru e UDAPL uverbs Sockets Layer pes TCP ICNP Mid Layer IP x Netdevice SRP iSER elPolB mlx4 en verbs ib core mlx4 ib IB and RoCE Adapter 4 Mellanox VPI Device HCA NIC The following sub sections briefly describe the various components of the Mellanox OFED stack 1 3 1 mlx4 VPI Driver m1x4 is the low level driver implementation for the ConnectX ConnectX 2 and ConnectX 3 adapters designed by Mellanox Technologies ConnectX ConnectX 2 ConnectX 3 can operate as an InfiniBand adapter or as an Ethernet NIC The OFED driver supports InfiniBand and Ethernet NIC configurations To accommodate the supported configurations the driver is split into the following modules mlx4_core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mlx4 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer mlx4 en A 10 40GigE driver under drivers net ethernet mellanox mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer Mellanox Technologies 17 Rev 2 0 2 0 5 Mellanox OFED Overview 1 3 2 ml
210. reverse hops creates routes that use the switch in a counter stream way This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or similar Also having routes the other way around can cause credit loops 10 5 4 2 Activation through OpenSM Use R ftree option to activate the fat tree algorithm LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm is invoked instead 10 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path rout ing algorithm that enables topology agnostic deadlock free routing within communication net works When computing the routing function LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock from HCA between and switch does not need virtual layers as deadlock will not arise gt LASH analyzes routes and ensures deadlock freedom between switch pairs The link between switch and HCA ae In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a give
211. riority User Verbs allows a user application QP to be attached into a specified flow when using create flowand destroy flow verbs ibv create flow struct ibv flow ibv create flow struct ibv qp struct ibv flow attr flow Input parameters e struct ibv_qp the attached QP e struct flow attr pointer for the flow specifications structs that contain mandatory control parameters and optional L2 L3 and L4 headers The optional headers are detected by setting the size and num of specs fields in flow attr struct which is a mandatory control struct struct ibv flow attr uint32 t comp mask enum ibv flow attr type type uintl6 t size uintl6 t priority uint8 t num of specs uint8 t port uint32 t flags Mellanox Technologies 65 J Rev 2 0 2 0 5 Flow Steering ie Following are the optional layers according to user request struct ibv flow spec xxx struct ibv flow spec fi struct ibv flow attr can be followed by the optional flow headers structs struct flow spec ib struct ibv flow spec eth struct ibv flow spec ipv4 struct ibv flow spec tcp udp Each header struct holds the relevant network layer parameters for matching To enforce the match the user sets a mask for each parameter The supported masks are Allone mask include the parameter value in the attached rule Allzero mask ignore the parameter value in the attached rule The flow type 1 s
212. roblematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the x option is provided a full report of the fabric qualities is dis played This report includes SM report Number of nodes and systems Hop count information maximal hop count an example path and a hop count histo gram paths traced Credit loop report mgid mlid HCAs multicast group and report Partitions report IPoIB report Incase the IB fabric includes only one CA then CA to CA paths are not reported Furthermore if a topology file is provided ibdiagnet uses the names defined in it for nm the output reports Mellanox Technologies 155 11 5 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities Error Codes Failed to fully discover the fabric Failed to parse command line options Failed to intract with IB fabric Failed to use local device or local port Failed to use Topology File Failed to load requierd Package Ov H9 CO 1 ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If dire
213. rst port on the first HCA in the host is called interface ib0 the second port 15 called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 3 3 1 or on a static configuration Section 4 3 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 3 3 3 4 3 3 1 IPoIB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP is performed similarly to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhep For SLES BOOTPROTO dchp If IPoIB configuration files are included ifcfg ib lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine etc sysconfig network SuSE machine A patch for DHCP is required for supporting IPoIB For further information please see the REAME which is available under the docs dhcp directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hard ware address To overcome this problem DHCP over InfiniBand messages convey a client iden tifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client
214. rt and GRH headers in Ethernet packets bearing a dedicated ether type ISER iSCSI Extensions for iSER extends the iSCSI protocol to It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies For further information please refer to Chapter 4 2 iSCSI Extensions for iSER SRP SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI soft ware to be readily used on InfiniBand architecture The SRP driver known as the SRP Initia tor differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in 18 Mellanox Technologies Rev 2 0 2 0 5 an I O unit and provides storage services See Chapter 4 1 SCSI Protocol and Appen dix B SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that promotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface is defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 s
215. rtition Key PKey ff ff applies to the primary parent interface This section describes how to Create subinterface Section 4 3 5 1 Remove a subinterface Section 4 3 5 2 Mellanox Technologies 49 J Rev 2 0 2 0 5 Driver Features 4 3 5 1 Creating a Subinterface In the following procedure 160 is used an example of an IB subinterface de To create a child interface subinterface follow this procedure 1 Decide on the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For example a value of 0 will give a PKey with the value 0x8000 Step2 Create a child interface by running hostl echo lt gt gt sys class net IB subinterface create child Example hostl echo 0 gt sys class net ib0 create child This will create the interface 10 8000 Step3 Verify the configuration of this interface by running host1 ifconfig subinterface subinterface gt Using the example of Step 2 host1 ifconfig ib0 8000 1b0 8000 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b 4 can be seen the interface does not have IP or network addresse
216. ry where the output files will be placed default tmp lw lt 1x 4x 12x gt Specifies the expected link width ls lt 2 5 5 10 gt Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm pc Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen skip skip option s Skip the executions of the selected checks Skip options one or more can be specified dup guids zero guids pm logical state part ipoib all wt file name Write out the discovered topology into the given file This flag is useful if you later want to check for changes from the current state of the fabric A directory named ibdiag 1601 is also created by this option and holds the IBNL files required to load this topology To use these files you will need to set the environment variable named IBDM IBNL PATH to that directory The directory is located in tmp or in the output directory provided by the o flag load db lt file name gt gt Load subnet data from the given db file and skip subnet discovery stage Note Some of the checks require actual subnet discovery and therefore would not run when load db is specified These checks are Duplicated zero guids link state SMs status h help Prints the help page information V version Prints the version of the tool Vars Prints the tool s environment variables and
217. s 0c cece 168 Table 31 perfquery Flags and 172 Table 32 ibcheckerrs Flags and Options ees 174 Table 33 mstflint Switches 176 Table 34 imstflint Conimands o eere etre 178 Table 35 abdump Options ori v e sb ese 180 Mellanox Technologies 9 Rev 2 0 2 0 5 Document Revision History Table 1 Document Revision History Release Date Description 2 0 2 0 5 April 2013 Initial release 10 Mellanox Technologies Rev 2 0 2 0 5 About this Manual This Preface provides general information concerning the scope and organization of this User s Manual Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description B Capital is used to indicate size in bytes or multiples of bytes e g IKB 1024 bytes and 1MB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g 1Kb 1024 bits
218. s CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 Ifnot found the first port that is UP physical link state is LinkUp Examples 1 Query the status of Port 1 of CA mlx4_0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstatus mlx4 0 1 Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 Mellanox Technologies 163 Rev 2 0 2 0 5 2 Query the status of two channel adapters using directed paths 3 sm lid 0x3 Sieg 23 UNE phys state 5 LinkUp rate 20 Gb sec 4X DDR gt ibportstate C mlx4 0 3 1 query PortInfo Port info Lid 3 port 1 Initialize DhysluinkStatec te T E LinkUp mss opens e 1X or 4X 1X or 4 baal E ood Gon oe AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps E E T T T 2 5 Gbps or 5 0 Gbps TT 5 0 Gbps gt ibportstate C 1 4 0 D 01 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 THINKS Initialize DhyshinkSbatec e TE LinkUp LinkWidthSupported 1X or 4X niga 5 ooo ooo so0can oco 1X or 4X sod BOE INCL cci ocio 4X LinkSpeedSupported
219. s To configure those you should follow the manual configuration procedure described in Section 4 3 3 3 Step 5 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 10 OpenSM Subnet Manager 4 3 5 2 Removing a Subinterface To remove a child interface subinterface run echo subinterface gt sys class net ib interface delete child Using the example of Step 2 echo 0x8000 gt sys class net ib0 delete child Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example above 4 3 6 Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps 50 Mellanox Technologies Rev 2 0 2 0 5 1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig 160 11 4 3 175 netmask 255 255 0 0 host2 ifconfig 160 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command host1 ping c 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seq 0 ttl 6
220. s i e multiple interconnects between switches Without 1 OpenSM defaults to LMC 0 which allows one path between any two ports priority p lt PRIORITY gt This option specifies the SM s PRIORITY This will effect the handover cases where master is chosen by priority and GUID Range goes from 0 lowest priority to 15 highest smkey k SM Key This option specifies the SM s SM Key 64 bits This will effect SM authentication Note that OpenSM version 3 2 1 and below used the default value 1 in a host byte order it is fixed now but you may need this option to interoperate with old OpenSM running on a little endian machine reassign lids r This option causes OpenSM to reassign 1108 to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID routing engine R engine name This option chooses routing engine s to use instead of default Min Hop algorithm Multiple routing engines can be specified Separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail If all configured routing engines fail OpenSM will always attempt to route with Min Hop unless no fallback is included in the list of routing engines Supported engines updn file ftree lash dor torus 2QoS 102 Mellanox Technologies Rev 2 0 2 0 5 defau
221. s int Enable debug tracing if 0 int Attempt to use MSI X if nonzero int Tune the cpu s for better performance default 0 int Block multicast loopback packets if 0 default 1 int Is either one num vfs val for all devices or list of BDF to num vfs bb dd f val in hex string Is either one val num vfs to probe by the pf driver for all devices or list of BDF to probe vf bb dd f val in hex string log mgm size that defines the num of qp per mcg for example 10 gives 248 range 7 log num mgm entry size 12 To activate device managed flow steering when available set to 1 int Enable steering mode for higher packet rate default off int Enable fast packet drop when no recieve WQEs are posted int Enable 64 byte CQEs EQEs when the FW supports this if non zero default 1 int Log2 max number of MACs per ETH port 1 7 int Obsolete Log2 max number of VLANs per ETH port 0 7 int Log2 number of MTT entries per segment 0 7 default 0 int Is either one array of 2 port types t1 t2 for all devices or disg or wBDE to oore in hex string log maximum number of QPs per HCA default 19 int log maximum log number of RDMARC buffers per QP default 4 log maximun log maximun int number of SRQs per HCA default 16 int int number of CQs per HCA default 16 int number of multicast groups per HCA default 13 in
222. s 85 Table 12 Supported ConnectX Port Configurations 88 Table 13 Recommended PCIe Configuration 90 Table 14 Recommended BIOS Settings for Intel Sandy Bridge Processors 9 Table 15 Recommended BIOS Settings for Intel Nehalem Westmere Processors 92 Table 16 Recommended BIOS Settings for AMD Processors 92 Table 17 Adaptive Routing Manager Options File 144 Table 18 Adaptive Routing Manager Pre Switch Options 145 Table 19 Congestion Control Manager General Options File 148 Table 20 Congestion Control Manager Switch Options 148 Table 21 Congestion Control Manager CA Options File 148 Table 22 Congestion Control Manager CC MGR Options File 149 Table 23 ibdiagnet of ibutils2 Output Files eects 152 Table 24 ibdiagnet of ibutils Output Files 154 Table 25 ibdiagpath Output Files 157 Table 26 devinfo Flags and Options s 158 Table 27 ibstatus Flags and Options s 160 Table 28 ibportstate Flags and Options 0 0 ccc cece es 162 Table 29 ibportstate Flags and Options 0 ccc cette eens 166 Table 30 smpquery Flags and Option
223. ser or automat ically at startup by setting SRPHA ENABLE to yes 42 Mellanox Technologies Rev 2 0 2 0 5 4 1 2 6 High Availability Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each initiator is connected to the same target from several ports HCAs The DM multipath is responsi ble for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fab ric and send the ib_srp module requests to connect to them as well Operation When a path from port1 to a target fails the ib srp module starts an error recovery process If this process gets to the reset host stage and there is no path to the target from this port ib srp will remove this scsi host After the scsi host is removed multipath switches to another path to this target from another port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection 1 up there will be a new scsi host for this target Multipath will be executed on
224. ses OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state Sweep s interval This option specifies the number of seconds between subnet sweeps Specifying s 0 disables sweeping Without s OpenSM defaults to a sweep interval of 10 seconds timeout t lt milliseconds gt This option specifies the time in milliseconds 104 Mellanox Technologies Rev 2 0 2 0 5 used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds retries number This option specifies the number of retries used for transactions Without retries OpenSM defaults to 3 retries for transactions maxsmps n number This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs Without maxsmps OpenSM defaults to a maximum of 4 outstanding SMPs console q off local This option activates the OpenSM console default off ignore guids 1 lt equalize ignore guids file gt This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm hop weights file w lt path to file gt This option provides the means to define a weighting factor per port for customizing the least weight hops for the routing dimn ports file 0 path to file gt This option provides the means to
225. simple QoS definition effectively it is important to understand how each of the ULPs is matched 10 6 6 1 IPoIB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the mul ticast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent 134 Mellanox Technologies Rev 2 0 2 0 5 ipoib lt SL gt ipoib pkey Ox7fff SL any pkey Ox7fff SL 10 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The follow ing two match rules are equivalent sdp ESSI any Service id 0x0000000000010000 0x000000000001ffff SL 10 6 6 3 RDS Similar to SDP RDS PR query is matched by Service ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0 00000000010648 The following two match rules are equivalent rds SL any service id 0x00000000010648CA SL 10 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the tar get IB port GUID The following two match rules are equivalent srp target port guid 0x1234 SL any target port guid 0x1234 SL Note that any
226. sit the failed compo nents will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was config ured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and mes sage deadlock are likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capable of routing a torus without credit loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To verify that a torus fabric is routed free of credit loops use ibdmchk to analyze data collected via ibdiagnet vlr Mellanox Technologies 125 Rev 2 0 2 0 5 OpenSM Subnet Manager 10 5 7 6 Torus 2QoS Configuration File Syntax The file torus 2QoS conf contains configuration information that is specific to the OpenSM rout ing engine torus 2QoS Blank lines and lines where the first non whitespace character is are ignored A token is any contiguous group of non whitespace characters Any tokens on a line fol lowing the recognized configuration tokens described below are ignored torus mesh x radi
227. t log maximun HCA defau log maximum number of memory protection table entries per lt 19 int number of memory translation table segments per HCA default max 20 2 MTTs for register all of the host memory limited to 30 int Enable Quality of Service support in the HCA default off bool Reset device on internal errors if non zero default 1 in SRIOV mode default is 0 int Rev 2 0 2 0 5 C 2 mlx4 ib Parameters sm guid assign Enable SM alias GUID assignment if sm guid assign 0 Default 1 int dev assign str Map all device function numbers to IB device numbers following the pattern bb dd f 0 bb dd f 1 all numbers are hexadecimals Max supported devices 32 string 202 Mellanox Technologies Rev 2 0 2 0 5 Appendix 0 mlx5 Module Parameters The mlx5 ib module supports a single parameter used to select the profile which defines the number of resources supported The parameter name for selecting the profile is prof sel The supported values for profiles are 0 low number of resources 1 medium number of resources e 2 large number of resources default Using the default profile increases the number of times the driver is loaded unloaded ae Mellanox Technologies 203
228. t all the ports in the subnet that belong to partition with a given name belong to this port group Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port QoS Setup denoted by gos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts QoS Levels denoted by gos levels Each QoS Level defines Service Level SL and a few optional fields e limit e Rate limit e PKey e Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Mellanox Technologies 129 Rev 2 0 2
229. t mostly serves the following users The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP using L2 flow specifications The mlx4 driver when it attaches its QP to his configured GIDS Fragmented UDP traffic cannot be steered It is treated as other protocol by hardware from the first packet and not considered as UDP traffic dh Mellanox Technologies 67 Rev 2 0 2 0 5 Single Root IO Virtualization SR IOV 6 Single Root IO Virtualization SR IOV Single Root IO Virtualization SR IOV is currently at beta level Please be aware that the content below is subject to change ae Single Root IO Virtualization SR IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus This technology enables multiple virtual instances of the device with separate resources Mellanox adapters are capable of exposing in ConnectX 3 adapter cards 63 virtual instances called Virtual Functions VFs These virtual functions can then be provisioned separately Each VF can be seen as an addition device con nected to the Physical Function It shares the same resources with the Physical Function and its number of ports equals those of the Physical Function SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual machines direct hardware access to network resources hence increasing its performance In this chapter we will d
230. t of all the ports that are linked to the same remote switch This algorithm suits any topology with several links between switches Especially it suits 3D torus mesh where there are several link in each direc tion of the X Y Z axis If some switches do not support AR they will slow down the AR Manager as it may get timeouts on the AR related queries to these switches ae 10 8 2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug in i e it is a shared library libarmgr so that is dynamically loaded by the Subnet Manager Adaptive Routing Manager is installed as a part of Mellanox OFED installation 10 8 3 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing AR Manager can be enabled disabled through SM options file 10 8 3 1 Enabling Adaptive Routing To enable Adaptive Routing perform the following 1 Create the Subnet Manager options file Run opensm c options file name 2 Add armgr to the event plugin name option in the file Event plugin name s event plugin name armgr 3 Run Subnet Manager with the new options file opensm F lt options file name gt Adaptive Routig Manager can read options file with various configuration parameters to fine tune AR mechanism and AR Manager behavior Default location of the AR Manager options file is etc opensm ar mgr conf To provide an alternative location please perform the following Mellanox Technologies 14
231. ta transfer to a different PE e get operations data transfer from a different PE and remote pointers allowing direct references to data objects owned by another PE Additional supported operations are collective broadcast and reduction barrier synchronization and atomic memory operations An atomic memory operation is an atomic read and update oper ation such as a fetch and increment on a remote or local data object SHMEM libraries implement active messaging The sending of data involves only one CPU where the source processor puts the data into the memory of the destination processor Likewise a processor can read data from another processor s memory without interrupting the remote CPU The remote processor is unaware that its memory has been read or written unless the programmer implements a mechanism to accomplish this 7 1 1 ScalableSHMEM The ScalableSHMEM programming library is a one side communications library that supports a unique set of parallel programming features including point to point and collective routines syn chronizations atomic operations and a shared memory paradigm used between the processes of a parallel programming application Mellanox ScalableSHMEM 1 based on the API defined by the OpenSHMEM org consortium The library works with the OpenFabrics RDMA for Linux stack OFED and also has the ability to utilize MellanoX Messaging libraries as well as Mellanox Fabric Collective Accel
232. tails run smparquery h 10 8 5 Adaptive Routing Manager Options File The default location of the AR Manager options file is etc opensm ar_mgr conf To set an alter native location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin option option in the file amp options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F options file name AR Manager options file contains two types of parameters 1 General options Options which describe the AR Manager behavior and the AR parameters that will be applied to all the switches in the fabric 2 Per switch options Options which describe specific switch behavior Note the following 142 Mellanox Technologies Rev 2 0 2 0 5 Adaptive Routing configuration file is case sensitive You can specify options for nonexisting switch GUID These options will be ignored until a switch with a matching GUID will be added to the fabric Adaptive Routing configuration file is parsed every AR Manager cycle which in turn is executed at every heavy sweep of the Subnet Manager Ifthe AR Manager fails to parse the options file default settings for all the options will be used Mellanox Technologies 143 Rev 2 0 2 0 5 OpenSM Subnet Manager 10 8 5 1 General AR Manager Options Tabl
233. tains the following enhancements e GasNet library used within UPC integrated with Mellanox FCA which off loads from UPC collective operations For further information on FCA please refer to the Mellanox website e GasNet library contains MXM conduit which offloads from UPC all P2P operations as well as some synchronization routines For further information on MXM please refer to the Mellanox website Mellanox OFED 1 8 includes ScalableUPC 2 1 which is installed under opt mellanox bupc re If you have installed OFED 1 8 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Installing ScalableUPC Mellanox ScalableUPC is installed as part of MLNX OFED package mad Mellanox OFED 1 8 5 includes ScalableUPC Rev 2 2 which is installed under opt mellanox bupc re If you have installed OFED 1 8 5 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Please note the binary distribution of ScalableUPC is compiled with the following defaults FCA support FCA is disabled at runtime by default and must be configured prior to using it from the ScalableUPC For further information please refer to FCA User Man ual support enabled by default 84 Mellanox Technologies Rev 2 0 2 0 5 7 5 2 Runtime Param
234. ted here is subject to change ae Time Stamping is currently supported in ConnectX 3 cards only t Time stamping is the process of keeping track of the creation of a packet A time stamping ser vice supports assertions of proof that a datum existed before a particular time Incoming packets are time stamped before they are distributed on the PCI depending on the congestion in the PCI buffers Outgoing packets are time stamped very close to placing them on the wire 4 5 1 Enabling Time Stamping Time stamping is off by default and should be enabled before use gt To enable time stamping for a socket Mellanox Technologies 55 J Rev 2 0 2 0 5 Driver Features e Call setsockopt with SO TIMESTAMPING and with the following flags SOF TIMESTAMPING TX HARDWARE try to obtain send time stamp in hardware SOF TIMESTAMPING TX SOFTWARE if SOF TIMESTAMPING TX HARDWARE is off or fails then do it in software SOF TIMESTAMPING RX HARDWARE return the original unmodified time stamp as generated by the hardware SOF TIMESTAMPING RX SOFTWARE if SOF TIMESTAMPING RX HARDWARE is off or fails then do it in software SOF TIMESTAMPING RAW HARDWARE return original raw hardware time stamp SOF TIMESTAMPING SYS HARDWARE return hardware time stamp transformed to the system time base SOF TIMESTAMPING SOFTWARE return system time stamp generated in software SOF TIMESTAMPING TX RX determine how time stamps are generated SOF TIMESTAMPING RAW SYS
235. ted sets e qos QoS configuration parameters set for CAs e qos rtr parameters set for routers e qos sw parameters set for switches port 0 qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos ca max vls 15 qos ca high limit 0 O 6 0 qos ca vlarb low 0 0 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 4 13 4 14 4 epe Gm 01 31 2 3 4565 6 795 9 310 33 312 7 qos swe max vls 15 qos swe high limit 0 qos swe vlarb high 0 4 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 qos vlarb low 0 0 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 4 13 4 14 4 OT 9 10 Ui VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped Ifa list entry is programmed for VL15 or for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times
236. tension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic response if compare add va amp compare add mask then va va amp swap mask swap amp swap mask return atomic response The additional operands are carried in the Extended Transport Header Atomic response genera tion and packet format for MskCmpSwap is as for standard IB Atomic operations 4 6 1 2 Masked Fetch and Add MFetchAdd The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the field boundary parameter specifies the field boundaries The pseudocode below describes the operation bitwadder eo value ci bl b2 co value amp 2 return value amp 1 define MASK IS SET mask attr 1 mask amp attr bit position 1 Mellanox Technologies 59 J Rev 2 0 2 0 5 Driver Features carry 0 atomic_response 0 os x to 69 dE xd bit position bit position 1 bit add res bit adder carry MASK IS SET va bit position MASK IS SET compare add bit position am
237. terleaving between interfaces The following script can be used to separate each adapter s IRQs to different set of cores set irq affinity cpulist sh cpu list interface cpu list can be either a comma separated list of single core numbers 0 1 2 3 or core groups 0 3 Example Ifthe system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the follow ing etc init d irgbalancer stop set irq affinity cpulist sh 0 1 eth2 set irq affinity cpulist sh 2 3 eth3 set irq affinity cpulist sh 4 5 eth4 set irq affinity cpulist sh 6 7 eth5 9 2 8 Tuning Multi Threaded IP Forwarding To optimize NIC usage as IP forwarding 1 Set the following options in etc modprobe d mlx4 conf For MLNX OFED 2 0 x options mlx4 en inline thold 0 options mlx4 core high rate steer 1 e ForMLNX EN 1 5 10 options mlx4 en num lro 0 inline thold 0 options mlx4 core high rate steer 1 2 Apply interrupt affinity tuning 3 Forwarding on the same interface set irq affinity bynode sh numa node interface 4 Forwarding from one interface to another set irq affinity bynode sh numa node interfacel interface2 Mellanox Technologies 99 J Rev 2 0 2 0 5 Performance 5 Disable adaptive interrupt moderation and set status values using ethtool C adaptive rx off 100 Mellanox Technologies Rev 2 0 2 0 5 10 OpenSM Subnet Manager 10 1 Overview OpenSM is an In
238. ters for LID 2 Port 1 gt ibcheckerrs v 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 3 Check the LID2 Port 1 using the specified threshold file gt cat threshl SymbolErrors 10 LinkRecovers 10 LinkDowned 10 RevErrors 10 RcvRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDiscards 100 XmtConstraintErrors 100 Mellanox Technologies 175 Rev 2 0 2 0 5 RcvConstraintErrors 100 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 gt ibcheckerrs v T threshl 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 11 15 mstflint Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access If you purchased a standard Mellanox Technologies network adapter card please down load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Output Files Table 29 lists the various switches of the utility and Table 30 lists its commands Table 29 mstflint Switches Sheet 1 of 3 InfiniBand Fabric Diagnostic Utilities
239. ters of each communication flow by viding them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The following subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack 4 4 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chro nology approach to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network 1 being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fab ric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast grou
240. the content below is subject to change 44 Mellanox Technologies Rev 2 0 2 0 5 421 Overview iSCSI Extensions iSER extends the iSCSI protocol to It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies 4 2 2 iSER Initiator The iSER initiator is controlled through the iSCSI interface available from the iscsi initiator utils package Targets settings such as timeouts and ret ries are set the same as any other iSCSI targets If targets are set to auto connect on boot and targets are unreachable it may take long time to continue the boot process if timeouts and max retries are set too high de Example for discovering and connecting targets over iSER iscsiadm m discovery o new o old t st I iser p ip port 1 ISER also supports RoCE without any additional configuration required To bond the RoCE interfaces set the ail over mac option in the bonding driver 4 3 IP over InfiniBand 4 3 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service The IPoIB driver ib ipoib exploits the following ConnectX ConnectX 2 ConnectX 3 capabili ties Uses any ConnectX IB ports one or two Inserts IP UDP TCP checksum on outgoing packets e Calculates checksum on received packets Support net
241. the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be loaded The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign static or dynamic IP address to your Mellanox ConnectX EN network interface Step9 Save the init file Step 10 Close initrd host1 cd tmp initrd en host1 find cpio H newc o gt tmp new initrd en img host1 gzip tmp new init en img At this stage the modified initrd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it properly A 10 iSCSI Boot Mellanox FlexBoot enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd Linux There are two instances of connection to the remote iSCSI Target the first is for getting the kernel and initrd via FlexBoot and the second is for loading other parts of the OS via initrd If you choose to continue loading the OS after boot through the HCA device driver please ver ify that the initrd image includes the HCA driver as described in Section A 8 194 Mellanox Technologies Rev 2 0 2 0 5 A 10 1 Configurin
242. to the torus 124 Mellanox Technologies Rev 2 0 2 0 5 10 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with Q Since torus 2QoS depends on such functionality for correct operation always invoke OpenSM with Q when torus 2QoS is in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines sup ported by OpenSM except torus 2QoS there is a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used for unicast QoS configuration will be honored by torus 2QoS For multicast QoS configuration only SL values 0 and 8 should be used with torus 2QoS Since SL to VL map configuration must be under the complete control of torus 2QoS any con figuration via qos sl2vl qos swe 512 1 etc must and will be ignored and a warning will be generated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise if traffic is not delivered fairly across each of these two VL ranges Torus 2QoS will detect and warn if VL arbi tration is configured unfairly across VLs in the range 0 3 and also in the range
243. trol 146 10 9 3 Configuring Congestion Control 146 10 9 4 Configuring Congestion Control Manager Main Settings 147 Chapter 11 InfiniBand Fabric Diagnostic 150 1143 Ovetview aed he We E Ev ed ERE Re AA ne 150 11 2 Utlities Usage ec oc deep eer RC ORE OSA TEN d 150 11 2 1 Common Configuration Interface and Addressing 150 11 2 2 InfiniBand Interface 150 1 223 Addressing 4 3 tows dd at seine asians ss 151 11 3 ibdiagnet of ibutils2 IB Net Diagnostic 151 11 4 ibdiagnet of ibutils IB Net Diagnostic 153 11 5 ibdiagpath IB diagnostic 156 11 6 devices asume Set oe ttr DH 158 IET i sus sei epe E 158 11 8 e se bee p OS GR EU ghee WHO wR LO EUN OS 159 11 9 s ao BSE eal hata ads 160 11 T0 1bportstate sius t o S ed ae ec e RU o ete 162 165 6 Mellanox Technologies Rev 2 0 2 0 5 TI 2 Smpquery teme te esee one Awe eve Se 168 1113 perfquery ecc Cte bd Sora e te ati thee a Watt ane 171 IA abchecketts 25 nee dtt e Tu 174 Li
244. und pages with SHMEM opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 If using compound pages is not possible then the user will fall back to regular hugepages mechanism gt To force use of compound pages allocator Run the following command opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 x CONTIG PAGES 1 For further information the Contiguous Pages please refer to Section 4 8 Contiguous Pages on page 62 7 1 5 Running ScalableSHMEM Application The ScalableSHMEM framework contains the shmemrun utility which launches the executable from a service node to compute nodes This utility accepts the same command line parameters as mpirun from the OpenMPI package For further information please refer to OpenMPI MCA parameters documentation at http www open mpi org faq category running Run shmemrun help to obtain ScalableSHMEM job launcher runtime parameters ScalableSHMEM contains support for environment module system http mod ules sf net The modules configuration file can be found at p opt mellanox openshmem 2 2 etc shmem modulefile 7 2 Message Passing Interface 7 2 1 Overview Mellanox OFED for Linux includes the following Message Passing Interface MPI implementa tions over InfiniBand Open MPI 1 4 6 amp 1 6 1 an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH2 1 7 an MPI 1 implementation by Ohio
245. uration PCIe Generation 3 0 Speed 8GT s Width x8 or x16 Max Payload size 256 Max Read Request 4096 For ConnectX3 based network adapters 40GbE Ethernet adapters it is recommended to use an x16 PCle slot to benefit from the additional buffers allocated by the CPU ae 9 1 2 Memory Configuration For high performance it is recommended to use the highest memory speed with fewest DIMMs and populate all memory channels for every CPU installed For further information please refer to your vendor s memory configuration instructions or mem ory configuration tool available Online 9 1 3 Recommended BIOS Settings These performance optimizations may result in higher power consumption 9 1 3 1 General Set BIOS power management to Maximum Performance 90 Mellanox Technologies Rev 2 0 2 0 5 9 1 3 2 Intel Sandy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors Table 10 Recommended BIOS Settings for Intel Sandy Bridge Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Enabled Hyper Threadin g HPC disabled Data Centers enabled CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node
246. v2netdev ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link Synopsys ibdev2netdev v h Mellanox Technologies 159 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities Options v Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt ib0 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 2 DOWN gt 1101 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 1 DOWN gt eth2 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 2 DOWN zz eth3 Down sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev 1 4 0 port 1 gt eth5 Down 1 4 0 port 1 gt 1 0 Down 1 4 1 port 1 gt eth2 Down 1 4 1 port 2 gt eth3 Down m m mlx4 0 port 2 gt ibl Down m m 11 9 ibstatus Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h device name port
247. via the command line 5 1 mpirun s1 0 e OpenSM QoS policy file p In the following policy file example replace OST and MDS with the real port GUIDs de qos ulps default 0 4 default SL for MPI any target port guid OST1 0ST2 0ST3 0ST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL for Lustre MDS end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 2 1 qos vlarb low 0 96 1 224 Cos 15 15 15 15 15 10 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage 138 Mellanox Technologies Rev 2 0 2 0 5 QoS Levels Application traffic IPoIB UD and CM and SDP Isolated from storage Min BW of 50 SRP Min BW 50 Bottleneck at storage nodes Administration e OpenSM QoS policy file In the following policy file example replace SRPT with the real SRP Target port GUIDs ae qos ulps default ipoib sdp srp target port guid SRPT1 SRPT2 SRPT3 end qos ulps 5 B e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 32 qos vlarb low 0 1 Gos SAW 0 1 2 378 598 15 15 15 15 15 15 15 10 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration
248. vice ID in the sent PR MPR IPoIB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 54 Mellanox Technologies Rev 2 0 2 0 5 4 4 4 2 SRP The current SRP implementation uses its own CM callbacks not So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 4 4 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 10 can be split into two main parts 1 Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the discovered fabric elements Il PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined match rules such that the target QoS Level defini tion is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level 45 Time stamping Service Time Stamping is currently at beta level Please be aware that everything lis
249. with all the default values See Configuring Congestion Control Manager on page 146 For further details on the list of CC Manager options please refer to the IB spec 146 Mellanox Technologies Rev 2 0 2 0 5 10 9 4 Configuring Congestion Control Manager Main Settings To fine tune CC mechanism and CC Manager behavior and set the CC manager main settings perform the following To enables disables Congestion Control mechanism on the fabric nodes set the follow ing parameter enable The values are TRUE FALSE gt The default is true CC manager configures CC mechanism behavior based on the fabric size The larger the fabric is the more aggressive CC mechanism is in its response to congestion To manu ally modify CC manager behavior by providing it with an arbitrary fabric size set the following parameter num hosts e The values are 0 4 The default is o base on the CCT calculation on the current subnet size Thesmaller the number value of the parameter the faster HCAs will respond to the con gestion and will throttle the traffic Note that if the number is too low it will result in suboptimal bandwidth To change the mean number of packets between marking eligi ble packets with a FECN set the following parameter marking rate The values are to ox 1 The default is oxa You can set the minimal packet size that can be marked with FECN Any packet less than this s
250. x m M t T y radix m M t T z radix m M t T Either torus or mesh must be the first keyword in the configuration and sets the topology that torus 2QoS will try to construct 2D topology can be configured by specifying one of x radix y radix or z radix as 1 An individual dimension can be configured as mesh open or torus looped by suffixing its radix specification with one of m M t or T Thus mesh 3T 4 5 and torus 3 4M 5M both specify the same topology Note that although torus 2QoS can route mesh fabrics its ability to route around failed compo nents is severely compromised on such fabrics A failed fabric componentis very likely to cause a disjoint ring see UNICAST ROUTING in torus 2QoS 8 xp link sw0 GUID 1 GUID yp link sw0 GUID 1 GUID zp link sw0 GUID swl GUID xm link sw0 GUID swl GUID ym link sw0 GUID swl GUID zm link sw0 GUID swl GUID These keywords are used to seed the torus mesh topology For example xp link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the positive x direction while xm link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the negative x direction AII the link keywords for a given seed must specify the same from switch In general it is not necessary to configure both the positive and negative directions for a given coordinate
251. x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without D OpenSM defaults to ERROR INFO 0x3 Specifying D 0 disables all messages Specifying D OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option Mellanox Technologies 107 Rev 2 0 2 0 5 OpenSM Subnet Manager debug d lt number gt This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support 910 Put OpenSM in testability mode Without d no debug options are enabled F Display this usage info then exit 10 2 2 Environment Variables The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet 15 opensm fdbs opensm mcfdbs By default this directory is var log OSM CACHE DIR opensm stores certain d
252. x5 Driver m1x5 is the low level driver implementation for the Connect IB adapters designed by Mella nox Technologies Connect IB operates as an InfiniBand adapter The mlx5 driver is comprised of the following kernel modules mlx5 core Acts as a library of common functions e g initializing the device after reset required by the Connect IB adapter card mlx5 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer 1 3 3 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and kernel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 34 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Unreliable Datagram UD by default but it may also be configured to be Reliable Connected RC The interface supports unicast multicast and broadcast For details see Chapter 4 3 IP over InfiniBand RoCE RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Ethernet net work It encapsulates IB transpo
253. xpected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm pc Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values Output Files Table 21 ibdiagpath Output Files Output File Description ibdiagpath log A dump of all the application reports generated according to the pro vided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links Mellanox Technologies 157 11 6 11 7 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities Error Codes 1 The path traced is un healthy 2 Failed to parse command line options 3 More then 64 hops are required for traversing the local port to the Source port and then to the Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File 6 Failed to load required Package ibv_devices Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv_devices Examples 1 List the names of all available InfiniBand devices gt ibv_devices device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 ibv devinfo Queries InfiniBand devices and prints about them i
254. xtension Show the new configuration gt ibportstate C mlx4 0 D01 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 THINKS PACES cet eq NE Initialize E MM LinkUp NSO NOES NR 1X or 4X LT 1 4 4X imimitS PCC SUP POREC e ne 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 5 0 Gbps IBA extension son 5 0 Gbps 11 11 ibroute Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop Synopsis ibroute h d v V a n D G M s lt smlid gt ca name gt P ca port t timeout gt dest dr path lid guid star tlid lt endlid gt Output Files Mellanox Technologies 165 Rev 2 0 2 0 5 InfiniBand Fabric Diagnostic Utilities Table 25 lists the various flags of the command Table 25 ibportstate Flags and Options Default Flag If Not Description Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a ll Optional Show all LIDs in range inclu
255. y also be present the SRP LUNs may be identified by their name e tis possible to see the output of SRP daemon in var log srp daemon log 4 1 2 7 Shutting Down SRP 4 2 SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop a by product of a complete system shutdown Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three cases 1 Without High Availability When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Killthe SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shut down etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown iSCSI Extensions for iSER iSCSI Extensions for iSER is currently at beta level Please be aware that
Download Pdf Manuals
Related Search
Related Contents
SERVICE MANUAL - Eurolube Equipment DELL Li-Ion 12.6 V, 6 cell NÁVOD K POUŽITĺ User Manual.qxd AMW-500 - American Weigh Scales Pole with Clamp/Grommet Attachment User`s manual alternadores síncronos da linha AN10 JVC KD-APD49 Installation Manual webCRM App para Hootsuite Manual de Usuario Olympus 850 SW Digital Camera User Manual Copyright © All rights reserved.
Failed to retrieve file