Home

Mellanox OFED Linux User`s Manual

image

Contents

1. Mellanox Technologies 35 J Rev 2 1 1 0 0 Installation IMPORTANT NOTE The FCA Manager and FCA MPI Runtime library are installed in opt mellanox fca directory The FCA Manager will not be started automatically To start FCA Manager now type etc init d fca managerd start There should be single process of FCA Manager running per fabric To start FCA Manager automatically after boot type etc init d fca managerd install service Check opt mellanox fca share doc fca README ct pS ct o r quick start instructions Preparing dapl Preparing dapl Preparing apl devel reparing apl devel reparing apl devel static reparing apl devel static reparing dapl utils igl 95 Ino Oy due ex vu o Preparing perftest Preparing mstflint Preparing ft reparing rptools Preparing rds tools Preparing rds devel Preparing ibutils2 Preparing ibutils Preparing cc mgr Preparing dump pr US Nn 36 Mellanox Technologies rev 2 1 1 0 0 Preparing ar mgr Preparing ibdump reparing infiniband diags compat P
2. In case of VLAN interface the UP obtained according to the above mapping is also used in the VLAN tag of the traffic 4 5 4 RoCE Quality of Service Mapping Applications use RDMA CM API to create and use QPs The following 1s the RoCE QoS mapping flow 1 The application sets the ToS of the QP using the rdma set option option RDMA OPTION ID TOS value 2 ToS is translated into the Socket Priority sk prio using a fixed translation TOS 0 lt gt sk prio 0 MOSES prio 2 TOS 24 lt gt sk prio 4 TOS 16 lt gt sk prio 6 3 The Socket Priority is mapped to the User Priority UP using the tc command In case ofa VLAN device the parent real device is used for the purpose of this mapping 68 Mellanox Technologies m 2 1 1 0 0 4 The the UP is mapped to the TC as configured by the ninx qos tool or by the 11dpad daemon if DCBX is used With RoCE there can only be 4 predefined ToS values for the purpose of QoS mapping 4 5 5 Raw Ethernet QP Quality of Service Mapping Applications open a Raw Ethernet QP using VERBs directly The following is the RoCE QoS mapping flow 1 The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP e Sets qp attrs ah attrs sl up e Calls modify qp with 1B QP av set in the mask 2 The UP is mapped to the TC as configured by the minx_qos tool or by the 11dpad daemon if DCBX is used When using Raw Ethernet QP mapping the T
3. e GM BS dq c tos 24 o UM A CO tos 16 74 Mellanox Technologies m 2 1 1 0 0 UP UP UP UP UP UP Oy Ch ws Co BO 4 5 8 3 Additional Tools tc tool compiled with the sch_mqprio module is required to support kernel v2 6 32 or higher This is a part of iproute2 package v2 6 32 19 or higher Otherwise an alternative custom sysfs interface is available e mlnx qos tool package ofed scripts requires python gt 2 5 tc wrap py package ofed scripts requires python gt 2 5 4 6 Ethernet Time Stamping 4 6 1 Ethernet Time Stamping Service Time Stamping is currently supported in ConnectX 3 ConnectX 3 Pro adapter cards only de Time stamping is the process of keeping track of the creation of a packet A time stamping ser vice supports assertions of proof that a datum existed before a particular time Incoming packets are time stamped before they are distributed on the PCI depending on the congestion in the PCI buffers Outgoing packets are time stamped very close to placing them on the wire 4 6 1 1 Enabling Time Stamping Time stamping is off by default and should be enabled before use gt To enable time stamping for a socket e Call setsockopt with SO TIMESTAMPING and with the following flags SOF TIMESTAMPING TX HARDWARE try to obtain send time stamp in hardware SOF TIMESTAMPING TX SOFTWARE i SOF TIMESTAMPING TX HARDWARE ig ori ole fails then do it in software SOF T
4. Output File Description ibdiagnet2 lst Fabric links in LST format ibdiagnet2 sm Subnet Manager ibdiagnet2 pm Ports Counters ibdiagnet2 fdbs Unicast FDBs ibdiagnet2 mcfdbs Multicast FDBx ibdiagnet2 nodes info Information on nodes Mellanox Technologies 193 9 4 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 26 ibdiagnet of ibutils2 Output Files Output File Description ibdiagnet2 db_csv ibdiagnet internal database An ibdiagnet run performs the following stages Fabric discovery Duplicated GUIDs detection Links in INIT state and unresponsive links detection Counters fetch Error counters check Routing checks Link width and speed checks Alias GUIDs check Subnet Manager check Partition keys check Nodes information Return Codes 0 Success 1 Failure with description ibdiagnet of ibutils IB Net Diagnostic This version of ibdiagnet is included in the ibutils package and it is not run by default after installing Mellanox OFED To use this ibdiagnet version and not that of the ibu ad tils package you need to specify the full path opt bin ibdiagnet Ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis ibdiagnet c
5. 0 0 0 cee eee eee 43 2 52 Installing MLNX OFED using the YUM Tool essel esee 44 2 5 3 Updating Firmware After Installation 0 0 esee 44 2 6 Uninstalling Mellanox OFED sssssseesseee eee eee 45 2 7 Uninstalling Mellanox OFED using the YUM Tool oooocococoo oo 45 Chapter 3 Configuration Files ooooooooooomoomrmmrm2m2 m29rrroroo 46 3 1 Persistent Naming for Network Interfaces 0 0 0 cee ee eee eee 46 Chapter 4 Driver Features isse e nacre it ae sr 0 47 Mellanox Technologies 3 Rev 2 1 1 0 0 4 1 SESTRDMA Protocol an is Ee teg 47 4 Overview oi kept A as les 47 4 1 2 SRP Initiator sd Ree a ELE E EA 47 4 2 iSCSI Extensions for ROMA ISER ssssseesee eh 56 42 1 COVE a e c a ot ut 56 4 2 2 ISER Initiator ii ice ie a Gace teme ead a 56 4 3 IP over InfiniBanidss ci ey eet VERE e Hep s 57 43 1 Introduction s dsrs sme seen Samet Ment Peed Fate Red AREE eae 57 4 32 IPoIB M de Setting vou ee ER yee a aun Eya eels eee es 57 4 3 3 IPoIB Configuration 0 0000 t enn eee 58 434 Submiterfaces oco cec eo tee Gane bbe Gale nrg beets es 61 4 3 5 Verifying IPoIB Functionality 0 0 00 eee 62 4 3 6 Bonding IPoIB i i esu Ree A ba eee YEA 63 4 4 Quality of Service InfiniBand 00 cece cette 64 4 4 1 Quality of Service OvervieW 00 0 0 c ccc tte teen eens 64 4 42 OoS Arcl tectute o poc bade oe ache RI RUN po PP
6. Output Files Table 33 lists the various flags of the command Table 33 smpquery Flags and Options Optional Default Flag s dat If Not Description xs Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Optional Use directed path address arguments The path is acomma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases 1t is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries V ersion Optional Show version info C Optional Use the specified channel adapter or router lt ca_name gt P lt ca_port gt Optional Use the specified port 210 Mellanox Technologies m 2 1 1 0 0 Table 33 smpquery Flags and Options Optional Du foe Flag Mandan if Do Description Specified t Optional Override the default timeout for the solicited lt timeout_ms MADs msec gt lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt po
7. 5 valid lids dumped 2 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 gt ibroute 2 3 7 208 Mellanox Technologies m 2 1 1 0 0 Unicast lids 0x3 0x7 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 3 valid lids dumped 4 Dump all Lids with valid out ports of the switch with portguid 0x00058c 004016 gt ibroute G 0x000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3
8. 6 2 1 Enabling Auto Sensing Upon driver start up 1 Sense the adapter card s port type If a valid cable or module is connected QSFP SFP or SFP with EEPROM in the cable module Set the port type to the sensed link type IB Ethernet Otherwise e Set the port type as default Ethernet During driver run time e Sense a link every 3 seconds if no link is sensed detected fsensed set the port type as sensed Mellanox Technologies 127 J Rev 2 1 1 0 0 Performance 7 Performance 7 1 General System Configurations The following sections describe recommended configurations for system components and or interfaces Different systems may have different features thus some recommendations below may not be applicable 7 1 1 PCI Express PCle Capabilities Table 16 Recommended PCle Configuration PCIe Generation 3 0 Speed 8GT s Width x8 or x16 Max Payload size 256 Max Read Request 4096 For ConnectX3 based network adapters 40GbE Ethernet adapters it is recommended to use an x16 PCle slot to benefit from the additional buffers allocated by the CPU 7 1 2 Memory Configuration For high performance it is recommended to use the highest memory speed with fewest DIMMs and populate all memory channels for every CPU installed For further information please refer to your vendor s memory configuration instructions or mem ory configuration tool available Online 7 1 3
9. Note For more details on hca self test ofed see the file hca self test readm under docs Hcafseltite stored Performing Adapter Device Self Test mule On CAS Devscusd eo suocooccas dad 1 ANN CET IE EIS PASS kernel NH oooooooocoonoonoococooconocs x86 64 Host Dihwew WELSLO oooocococooccoceonooco LNX OFED LINUX 2 1 1 0 0 OFED 2 1 1 0 0 3 0 76 0 11 default slot Derya BU mele oousousavasoscoonss PASS ies i CAFO VAL eT MEME v2 30 8000 Firmware Check on CA 0 VPI PASS iost Deiyer Imiricllizaci n coosoaocosovs PASS umber ur CA Boies MCLE ooonucoscsonon 1 Port State of Port 1 on CA 0 VPI UP 4X FDR InfiniBand Port State of Port 2 on CA 0 VPI UP 4X FDR InfiniBand Error Counter Check on CA 0 VPI PASS Kernel syslog ele MIFIT MM PASS oda GUID om CA 50 WRN oosovvovoconono 00 02 c9 03 00 30 0e 60 Sesso LLL LLL DONE prefix kernel version and installation parameters can be retrieved by running the com After the installer completes information about the Mellanox OFED installation such as Ad mand etc infiniband info 2 3 4 Installation Results Software e Most of MLNX OFED packages are installed under the usr directory except for the following packages which are installed under the opt directory openshmem bupc fca and ibutils The kernel modules are installed under lib modules uname r updates on SLES and F
10. Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 29 ibv_devinfo Flags and Options Optional Denon Flag i dator If Not Description y Specified l Optional Inactive Only list the names of InfiniBand list devices V Optional Inactive Print all available information about the verbose InfiniBand device s Examples 1 List the names of all available InfiniBand devices gt ibv devinfo 1 2 HCAs found mthca0 mlx4 0 2 Query the device mlx4_0 and print user available information for its Port 2 gt ibv_devinfo d mlx4 0 i 2 hca_id mlx4 0 fw ver 2 5 944 node guid 0000 0000 0007 3895 sys image guid 0000 0000 0007 3898 vendor id 0x02c9 vendor part id 25418 hw ver 0x40 board id T 04A0140005 phys port cnt 2 ports 2 state PORT ACTIVE 4 max mtu 2048 4 active mtu 2048 4 sm lid il port_lid 1 port lmc 0x00 9 8 ibdev2netdev Ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link Synopsys ibdev2netdev v h 200 Mellanox Technologies m 2 1 1 0 0 Options v Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mlx4 0 MT26428 MT1
11. qos setup This section of the policy file describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED the section is parsed and ignored SL2VL and VLArb tables should be configured in the OpenSM options file by default var cache opensm opensm opts end qos setup qos levels Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules qos level name DEFAULT use default QoS Level ss 0 end qos level the whole set SL MTU Limit Rate Limit PKey Packet Lifetime qos level name WholeSet 172 Mellanox Technologies Rev 2 1 1 0 0 Mellanox Technologies 173 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include e Default match rule that is applied to PR MPR query that didn t match any of the other match rules SDP SDP application with a specific target TCP IP port range SRP with a specific target IB port GUID RDS IPoIB with a default PKey IPoIB with a specific PKey Any ULP application with a specific Service ID in the PR MPR query Any ULP application
12. Rev 2 1 1 0 0 Installation Device 06 00 0 06 00 0 Network controller Mellanox Technologies MT27500 Family ConnectX 3 Link Width 8x PCI Link Speed 5Gb s Installation finished successfully Attempting to perform Firmware update Querying Mellanox devices firmware Device 1 Device 0000 06 00 0 Part Number MCX354A FCB Al Description ConnectX 3 VPI adapter card dual port QSFP FDR IB 56Gb s and A0GigE PCIe3 0 x8 8GT s RoHS R6 PSTD MT 1090110019 Versions Current Available FW 2 30 7384 2 30 8000 PXE 3 4 0146 3 4 0146 Status Update required Found 1 device s requiring firmware update Device 1 Updating FW Done A restart is needed for updates to take effect Log File tmp MLNX OFED LINUX 2 1 0 0 9 10740 10gs fw update log In case your machine has the latest firmware no firmware update will occur and the installation script will print at the end of installation a message similar to the following Device 1 Pe Pan Device 0000 06 00 0 Part Number MCX354A FCB A1 Description ConneceExX s VET wadapeervecand ass ose ORE DR IB 56Gb s and 40GigE PCIe3 0 x8 8GT s RoHS R6 POSTIDE ME OSO 1001 S Versions Current Available EW 2 30 71706 2 30 8000 PXE 3 4 0146 3 4 0146 Stratus Up to date 38 Mellanox Technologies m 2 1 1 0 0 In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Pl
13. Sandy Bridge Platform on page 134 In systems which do not support SLIT the following environment variable should be applied MLX4 LOCAL CPUS 0x bit mask of local NUMA node Example for local NUMA node which its cores are 0 7 MLX4 LOCAL CPUS 0xff Additional modification can apply to impact this feature by changing the following environment variable MLX4 STALL NUM LOOP integer default 400 The default value is optimized for most applications However several applications might benefit from increasing decreasing this value 7 2 6 2 Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system e With a 2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 e With a 4 socket system the PCIe adapter will be connected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 7 2 6 3 Recognizing NUMA Node Cores gt To recognize NUMA node cores run the following command cat sys devices system node node X cpulist cpumap Example cat sys devices system node nodel cpulist iL Sp 9p 15 Jp lily 195 LS cat sys devices system node nodel cpumap 0000aaaa Mellanox Technologies 135 Rev 2 1 1 0 0 Performance 7 2 6 3 1 Running an Application on a Certain NUMA Node 7 2 7 7 2 7 1 In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool Fo
14. Section 1 3 3 Mid layer Core on page 23 Section 4 8 Ethernet Tunneling Over IPoIB Driver eIPoIB on page 82 Section 8 2 1 opensm Syntax on page 139 e Appendix C mlx4 Module Parameters page 240 Added the following sections Section 1 5 RDMA over Converged Ethernet RoCE on page 25 Section 4 5 Quality of Service Ethernet on page 67 and its subsections Section 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand on page 88 Section 4 13 7 Configuring Pkeys and GUIDs under SR IOV on page 99 and its subsections e Section 4 15 Ethtool on page 105 Appendix E Lustre Compilation over MLNX OFED page 243 2 0 2 0 5 April 2013 Initial release Mellanox Technologies 13 J Rev 2 1 1 0 0 About this Manual This Preface provides general information concerning the scope and organization of this User s Manual Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or mul
15. Section 1 5 RDMA over Converged Ethernet RoCE on page 25 Section 2 3 3 Installation Procedure on page 32 Section 4 13 2 Setting Up SR IOV on page 92 Section 5 3 1 Compiling OpenMPI with MXM on page 120 Section 5 3 2 Enabling MXM in OpenMPT on page 121 Section 5 3 4 Configuring Multi Rail Support on page 122 Section 4 8 4 Setting Performance Tuning on page 86 Section 8 4 1 File Format on page 151 e Appendix C 2 mlx4 core Parameters page 240 Section 4 1 2 2 Manually Establishing an SRP Con nection on page 43 Section 4 1 2 3 SRP Tools ibsrpdm srp daemon and srpd Service Script on page 45 Section 4 1 2 4 Automatic Discovery and Connection to Targets on page 47 12 Mellanox Technologies m 2 1 1 0 0 Table 1 Document Revision History Release Date Description Section 4 1 2 5 Multiple Connections from Initiator InfiniBand Port to the Target on page 48 Section 4 1 2 6 High Availability HA on page 48 Section 4 1 2 7 Shutting Down SRP on page 49 Section 4 15 Ethtool on page 105 2 0 3 0 0 October 2013 Removed section Command Line Interface CLI Updated the following sections e Appendix E Lustre Compilation over MLNX OFED page 243 August 2013 Updated the following sections Section 1 3 4 ULPs on page 23 Section 4 12 Flow Steering on page 89 and its subsec tions
16. Specifies opensm path records dump file path src dst to SL mapping generated by SM plugin ibdiagnet will use this mapping for MADs sending and credit loop check if r option selected Provides a report of the fabric qualities ndicates that UpDown credit loop checking hould be done against automatically determined roots wn Specifies the directory where the output files will be placed default var tmp ibdiagnet2 Skip the executions of the given stage Applicable skip stages all dup guids dup node desc lids links sm pm nodes info speed width check pkey aguid Skip the load of the given library name Applicable skip plugins libibdiagnet cable diag plugin libibdiagnet cable diag plugin 2 1 1 Reset all the fabric PM counters If any of the provided PM is greater then its provided value than print it Specifies the seconds to wait between first counters sample and second counters sample If seconds given is 0 than no second counters sample will be done default 1 192 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities m 2 1 1 0 0 ad Test Provides a BER test for each port Calculate BER for each port and check no BER value has exceeds the BER threshold default threshold 10 12 ber use data Indicates that BER test will use the received data for calculation ber thresh lt value gt Specifies the threshold value fo
17. enable e The values are lt TRUE FALSE gt The default is True CC manager configures CC mechanism behavior based on the fabric size The larger the fabric is the more aggressive CC mechanism is in its response to congestion To manu ally modify CC manager behavior by providing it with an arbitrary fabric size set the following parameter num hosts The values are 0 48K The default is o base on the CCT calculation on the current subnet size Thesmaller the number value of the parameter the faster HCAs will respond to the con gestion and will throttle the traffic Note that if the number is too low it will result in suboptimal bandwidth To change the mean number of packets between marking eligi ble packets with a FECN set the following parameter marking rate The values are 0 ox The default is oxa e You can set the minimal packet size that can be marked with FECN Any packet less than this size bytes will not be marked with FECN To do so set the following param eter packet size The values are 0 0x3 c0 The default is ox200 Mellanox Technologies 187 Rev 2 1 1 0 0 OpenSM Subnet Manager e When number of errors exceeds max errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to pro ceed To do so set the following parameter max errors error window The values are max errors 0 ze
18. lesse eee 186 8 9 4 Configuring Congestion Control Manager Main Settings 187 Chapter 9 InfiniBand Fabric Diagnostic UtilitieS oooooooooo o o 190 9 1 OVervIew ix eee Se e CERE eq Cea ppc ted t bte cele s Calas 190 9 2 D hties Usage oe ERR RED een OUR REESE E RO S 190 9 2 1 Common Configuration Interface and Addressing 05 190 9 2 2 InfiniBand Interface Definition sseseseseeerrsererrrrrrrrrr ees 190 92 3 Addressing i Ie ee edt UR MARE DUME aa 191 9 3 ibdiagnet of ibutils2 IB Net Diagnostic 20 0 cece eee eee 191 9 4 ibdiagnet of ibutils IB Net Diagnostic 20 0 cece eee 194 9 5 ibdiagpath IB diagnostic path 0 0 eee 197 9 6 ADV GOVICES ici adn a O ais illa Fem alec ke ay cea ce 199 9 IBV OO la 199 9 8 A AA picid eiecerunt ea knwo 200 9 0 ID STANS REA 201 9 10 Ibp rtstate si eee see Ue Rem ee ceder ee oed ede Pete es 203 O TLCIDEQUIS oe oer CR e ure Rond c te eR datado fee es 206 9 12 SMPQuery o zug Rep ERR RE NUS eR DEC PE MER e EE 210 9 13 3pertqueLy 4 ice pobre ER LER DEL risu e ideni 213 9 14 abehecketrs i eese dut DAR RA gere B sedans ee 216 9 15 misti o4 pound ba A dus 218 9 16 Tbv asynewateh sod a e Per E Peer bep 222 9 117 abd mp 5 so A OG En TENES Hs 222 Appendix A Mellanox FlexBoot eeeeelelllss 224 AA OVervIewiu borsa pt RR Ro And ho E Mel ed 224 A GPlexBootPackage usos tn DRE eee
19. libibumad devel Preparing libibumad static Preparing libibumad static Preparing ibibmad reparing libibmad ig f gi tas fe des hal oh al oh 34 Mellanox Technologies rev 2 1 1 0 0 Preparing libibmad devel Preparing libibmad devel Preparing ibibmad static reparing libibmad static Preparing ibsim reparing ibacm Preparing librdmacm Preparing librdmacm Preparing librdmacm utils Preparing librdmacm devel Preparing ibrdmacm devel i Jg tor for ton bh Preparing opensm libs Preparing opensm libs Preparing pensm reparing pensm devel reparing pensm devel reparing pensm static reparing pensm static reparing nfiniband diags reparing fca UNO Jg Oo Jg O Jg O Jg O Ens Jg
20. on page 119 Section 5 2 4 Compiling MPI Applications on page 120 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login i e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote computers and or servers SSH Configuration The following steps describe how to configure password less access over SSH Step 1 Generate an ssh key on the initiator machine host1 host1 ssh keygen t rsa Generating public private rsa key pair Enter file in which to save the key home lt username gt ssh id rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home username ssh id rsa Your public key has been saved in home username ssh id rsa pub The key fingerprint is 38 1b 29 df 4 08 00 4a 0e 50 0 05 44 e7 9 05 lt username gt fhostl Step 2 Check that the public and private keys have been generated host1 cd home lt username gt ssh host1 1s hostis 1s la total 40 Gli 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 iWj 1 root root 1675 Mar 5 04 57 id rsa 118 Mellanox Technologies m 2 1 1 0 0 rW r r 1 root root 404 Mar 5 04 57 id rsa pub Step 3 Check
21. 1 D 1 D 1 D 1 D 1 D 1 D 1 D 1 D 1 D 1 D 1 D 1 D 1 a 1 a 1 a 1 a 1 a Al a libml Prepa 1 TE ket FC mee EE oa il i A D m 2 1 1 0 0 lling user level RPMs toys scripts tilo a verbs BING 54 y verbs HUNG os rbs devel UNG oc verbs devel til loas verbs devel static tiles y rbs devel static HUNG a verbs utils IBSIS oo x4 Lir quete x4 ve ME tos c x4 devel CHING or x4 devel PLING 5 ay x5 tilo ac x3 tios x5 devel tilo as x5 devel CHING oos gb3 tilo gb3 tos c gb3 devel tias gb3 devel BING oy gb4 Mellanox Technologies 33 J Rev 2 1 1 0 0 Installation Preparing libcxgb4 devel Preparing libcxgb4 devel Preparing ibnes reparing libnes Preparing ibnes devel static reparing ibnes devel static reparing ibipathverbs reparing ibipathverbs reparing ibipathverbs devel reparing ibipathverbs devel reparing libibem Preparing libibem Preparing libibcm devel Preparing libibcm devel Preparing libibumad Preparing libibumad Preparing libibumad devel Preparing
22. 2008 will allocate routes across such links in a round robin fashion based on ports at the path destination switch that are active and not used for inter switch links Should a link that is one of severalsuch parallel links fail routes are redistributed across the remaining links When the last of such a set of parallel links fails traffic is rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise illegal i e not allowed by DOR rules Torus 2QoS will introduce such a turn as close as possible to the failed switch in order to route around it n the above example suppose switch T has failed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns 1 e those turns that are illegal under DOR This causes the first hop after any such turn to use a separate set of VL values and pre vents deadlock in the presence of a single failed switch For any given path only the hops after a turn that is illegal under DOR can contribute to a cred
23. ATTR address 00 02 c9 e9 56 a1 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth3 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 e9 56 a2 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth4 Example for IPoIB interfaces SENSE o RXOIPTONESS Edu DES Ve ATR clei obe OQ AUR os 32 NAME ib0 SJUIE IS Sa inte RXCAPTIONES Ed IDEM Ve AE 0 O As 32 NAME ib1 46 Mellanox Technologies m 2 1 1 0 0 4 Driver Features 4 1 SCSI RDMA Protocol 4 1 1 Overview As described in Section 1 3 4 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol off load and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The kSRP Target resides in an IO unit and provides storage services Section 4 1 2 describes the SRP Initiator included in Mellanox OFED for Linux This package however does not include an SRP Target 4 4 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D available from http www t10 org The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3
24. Application use VERBs API to transmit using a Raw Ethernet QP 4 5 3 Plain Ethernet Quality of Service Mapping Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver The following is the Plain Ethernet QoS mapping flow Mellanox Technologies 67 Rev 2 1 1 0 0 Driver Features 1 The application sets the ToS of the socket using setsockopt IP TOS value 2 ToS is translated into the sk prio using a fixed translation TOS 0 lt gt sk prio 0 NOS lt gt sk prio 2 TOS 24 lt gt sk prio 4 TOS 16 lt gt sk xe 6 3 The Socket Priority is mapped to the UP Ifthe underlying device is a VLAN device egress map is used controlled by the vconfig command This is per VLAN mapping Ifthe underlying device is not a VLAN device the tc command is used In this case even though tc manual states that the mapping is from the sk prio to the TC number the mlx4 en driver interprets this as a sk prio to UP mapping Mapping the sk prio to the UP is done by using tc wrap py i dev name u 0 1 2 35 4 5 6 7 4 The the UP is mapped to the TC as configured by the ninx qos tool or by the 11dpad daemon if DCBX is used of the socket In this case the ToS to sk prio fixed mapping is not needed This allows gt Socket applications can use setsockopt SK PRIO value to directly set the sk prio Adi the application and the administrator to utilize more than the 4 values possible via ToS
25. FO SOvervie ws c he O qu 168 8 6 2 Advanced QoS Policy File 0 0 ccc rer rer rr rr rr enna 168 8 6 3 Simple QoS Policy Definition 0 0 cece ee 170 8 6 4 Policy File Syntax Guidelines 171 8 6 5 Examples of Advanced Policy File ssssserreerererrrr rer rr rr rr oo 171 8 6 6 Simple QoS Policy Details and Examples 0ooooooomomomo oo 174 6 Mellanox Technologies J m 2 1 1 0 0 8 6 7 SL2VL Mapping and VL Arbitration 0 0 0 ec eee 176 8 6 8 Deployment Example ssseesesreserrrrrrerr e 177 8 7 QoS Configuration Examples o oooooooooororrornr rer reser 178 8 7 1 Typical HPC Example MPI and Lustre 20 0 0 cece ess 178 8 7 2 EDC SOA 2 tier IPoIB and SRP o oooooooocoororrr esee 179 8 7 3 EDC 3 tier IPoIB RDS SRP 1 tees 180 8 8 Adaptive Routing 0 0 ea 181 8 8 1 OVA E ee eA ae 181 8 82 Installing the Adaptive Routing 0 0 0 c ee rer rr 182 8 83 Running Subnet Manager with Adaptive Routing Manager 182 8 84 Querying Adaptive Routing Tables 0 0 0 eee 183 8 8 5 Adaptive Routing Manager Options File 0 else 183 8 9 Congestion Control cipes eR at na p Bs 186 8 9 1 Congestion Control Overview 00 cee cece cent rr rr ra 186 8 92 Running OpenSM with Congestion Control Manager 04 186 8 9 3 Configuring Congestion Control Manager
26. QUE OM This option indicates the partition enforcement type for switches Enforcement type can be outbound only out inbound only in both or disabled off Default is both allow both pkeys W This option indicates whether both full and limited membership on the same partition can be configured in the PKeyTable Default is not to allow both pkeys q965 Q This option enables QoS setup 0105 gelbe inde 1 lt 09 eya Les This option defines the optional QoS policy file The default name is etc opensm qos policy conf congestion control EXPERIMENTAL This option enables congestion control configuration SES EXPERIMENTAL This option configures the CCkey to use when configuring congestion control stay on fatal y This option will cause SM not to exit on fatal initialization issues if SM discovers duplicated guids or 12x link with lane reversal badly configured By default the SM will exit on these errors daemon B Run in daemon mode OpenSM will run in the background macia cl Start SM in inactive rather than normal init SM state perfmgr Start with PerfMgr enabled perfmgr sweep time s lt sec gt PerfMgr sweep interval in seconds prefix routes file path to file This option specifies the prefix routes file Prefix routes control how the SA responds to path record queries for off subnet DGIDs Default file is etc opensm prefix routes conf Mellanox Technologies
27. Script required to rebuild MLNX OFED LINUX for customized kernel version on supported Linux Distribution e docs Directory of Mellanox OFED related documentation 20 Mellanox Technologies Rev 2 1 1 0 0 13 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protocols ULPs interface with the hardware and with the kernel and user space The application level also shows the versatility of markets that Mellanox OFED applies to Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards MPI uverbs rdmacm Sockets Layer SCSI TCP UDP ICMP Mid Layer IP Netdevice SRP Ei elPolB IPoIB mlx4_en verbs CMA ib_core mix3_ib IB mix4 ib IB and RoCE Adapter Driver mlx5 core Adapter Driver mix4_core The following sub sections briefly describe the various components of the Mellanox OFED stack 1 3 1 mlx4 VPI Driver m1x4 is the low level driver implementation for the ConnectX family adapters designed by Mel lanox Technologies ConnectX family adapters can operate as an InfiniBand adapter or as an Ethernet NIC The OFED driver supports InfiniBand and Ethernet NIC configurations To accommodate the supported configurations the driver is split into the following modules mlx4_core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions
28. T 4 X l I l I I I y 0 I x 0 1 2 3 4 5 Assuming the y dateline was between y 4 and y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as it occurs on a 1D ring the ring for x 3 that is broken by a failure as in the above example 8 5 7 3 Torus Topology Discovery The algorithm used by torus 2QoS to construct the torus topology from the undirected graph rep resenting the fabric requires that the radix of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm is to examine the cube formed by the eight switch locations bounded by the corners x y z and x 1 y 1 z 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that is consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm is based on examining the topology of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2QoS will detect and report when it has insufficient configuration for a torus with
29. also called a child interface has a different IP and network addresses from the pri mary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to Create a subinterface Section 4 3 4 1 Remove a subinterface Section 4 3 4 2 4 3 4 1 Creating a Subinterface In the following procedure 1b0 is used as an example of an IB subinterface y To create a child interface subinterface follow this procedure Step 1 Decide on the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For exam ple a value of 1 will give a PKey with the value 0x8001 Step 2 Create a child interface by running hostl echo lt PKey gt gt sys class net lt IB subinterface gt create child Example host1 echo 1 gt sys class net ib0 create child This will create the interface ib0 8001 Mellanox Technologies 61 Rev 2 1 1 0 0 Driver Features 4 3 4 2 4 3 5 Step 3 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 nostis ifconfig ib0 8001 ib0 8001 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overr
30. assuming switches support 8 data VLs Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL values e Very short run times with good scaling properties as fabric size increases Unicast Routing Torus 2QoS is a DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension It encodes into a path SL which datelines the path crosses as follows sl n for d 0 d torus dimensions d path crosses dateline d returns 0 or 1 sl path crosses dateline d lt lt d For a 3D torus that leaves one SL bit free which torus 2QoS uses to implement two QoS levels Torus 2QoS also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows ioe sil OF sil lt lt ike Sl deu cdir port reports which torus coordinate direction a switch port e Moines si and sesos 0 dL oe 2 y sl2vl iport oport sl 0x1 amp sl gt gt cdir oport 160 Mellanox Technologies m 2 1 1 0 0 Thus on a pristine 3D torus 1 e in the absence of failed fabric switches torus 2QoS consumes 8 SL values SL bits 0 2 and 2 VL values VL bit 0 per QoS level to provide deadlock free ro
31. comment out the following line hardware ethernet 00 02 c9 00 00 bb Mellanox Technologies 235 Rev 2 1 1 0 0 Appendix B SRP Target Driver B 1 B 2 The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possible to work with and support a lot of IO modes on real or virtual devices in the back end 1 scst vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices Prerequisites and Installation 1 SRP targer is part of the OpenFabrics OFED software stacks Use the latest OFED distribu tion package to install SRP target On distribution default kernels you can run scst vdisk blockio mode to obtain good performance ae 2 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from http scst sourceforge net downloads html b Untar scst 1 0 1 1 S tar zxvf scst 1 0 1 1 tar gz 3 ed scst 1 0 1 01 c Install scst 1 0 1 1 as follows make amp amp make install How to Run A On an SRP Target machine 1 Please refe
32. e80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0x1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR 9 10 bportstate Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a switch port then ibportstate can be used to disable enable or reset the port validate the port s link width and speed against the peer port Synopsis ibportstate Se ssl EN 1 EN e es lt smlicha A C lt ca_name gt P lt ca_port gt t timeout ne gt lt dest dr path lid guid gt lt portnum gt lt op gt lt value gt Output Files Table 31 lists the various flags of the command Table 31 ibportstate Flags and Options Default Flag Eo 1 If Not Description ad Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr_show Optional Show send and receive errors time outs and others Mellanox Technologies 203 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 31 ibportstate Flags and Options Continued Optional Detant Flag E dur If Not Description BEY Specified v erbose Optional Increase verbosity level May be used several times for addi
33. fn 02 00 0 bar 0xdef00000 size 0x100000 Chip revision is A0 dev mst mt25418 pci_msix0 BCT direct access bus dev fn 02 00 0 bar 0xdeefe000 size 0x2000 dev mst mt25418 pci uar0 PCI direct access bus dev fn 02 00 0 bar 0xdc800000 size 0x800000 2 Your InfiniBand device is the one with the postfix pci cr0 In the example listed above this will be dev mst mt25418 pci cr0 Step3 Burn firmware Burma firmware image from a mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 gt flint d dev mst mt25418 pci cr0 i fw 25408 2 1 8000 MCX353A FCA Al bin burn Step 4 Reboot your machine after the firmware burning is completed 42 Mellanox Technologies m 2 1 1 0 0 25 Installing MLNX OFED using YUM 2 5 4 Setting up MLNX OFED YUM Repository Step 1 Download the tarball to your host The image s name has the format MLNX OFED LINUX ver OS label gt lt CPU arch gt tgz You can download it from http www mellanox com gt Products gt Software InfiniBand Drivers Step 2 Extract the MLNX OFED tarball package to a shared location in your network tar xzf MLNX OFED LINUX lt MLNX OFED version rhel6 4 x86 64 tgz Step 3 Download and install Mellanox Technologies GPG K EY The key can be downloaded via the following link http www mellanox com downloads
34. lt count gt v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pc P lt lt PM gt lt Value gt gt lw 1x 4x 12x ls 2 99 157 skip lt ibdiag check s gt load db db file gt 194 Mellanox Technologies m 2 1 1 0 0 Options c count Min number of packets to be sent across each link default 10 V Enable verbose mode i Provides a report of the fabric qualities t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system tao ct p lt port num gt Specifies the local device s port num used to connect to the IB fabric ONO te Clint Specifies the directory where the output files will be lw lt 1x 4x 12x gt Specifies the expected link width lg 2555 105 Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm e Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen skip lt skip option s gt Skip the executions of the selected checks Skip options one or more can be specified dup_guids zero guids pm logical state part ipoib all wt
35. lt file name gt Write out the discovered topology into the given file This flag is useful if you later want to check for changes from the current state of the fabric A directory named ibdiag ibnl is also created by this option and holds the IBNL files required to load this topology To use these files you will need to set the environment variable named IBDM IBNL PATH to that directory The directory is located in tmp or in the output directory provided by the o flag load db lt file name gt gt Load subnet data from the given db file and skip subnet discovery stage Note Some of the checks require actual subnet discovery and therefore would not run when load db is specified These checks are Duplicated zero guids link state SMs placed default tmp status h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values Output Files Table 27 ibdiagnet of ibutils Output Files Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet lst List of all the nodes ports and links in the fabric Mellanox Technologies 195 Rev 2 1 1 0 0 Table 27 ibdiagnet of ibutils Output Files Output File Description ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet mcfdbs A dump of the m
36. mlx4 core This parameter may be used to set the port type uniformly for all installed Con nectX HCAs or it may specify an individual configuration for each HCA This parameter should be specified as an options line in the file etc modprobe d mlx4 core conf For example to configure all HCAs to have Portl as IB and Port2 as ETH insert the following line options mlx4 core port type array 1 2 To set HCAs individually you may use a string of Domain bus device function x y For example if you have a pair of HCAs whose PFs are 0000 04 00 0 and 0000 05 00 0 you may specify that the first will have both ports as IB and the second will have both ports as ETH as follows options mlx4 core port type array 0000 04 00 0 1 1 0000 05 00 0 2 2 Mellanox Technologies 99 J Rev 2 1 1 0 0 Driver Features Only the PFs are set via this mechanism The VFs inherit their port types from their asso ciated PF de 4 13 7 2 Virtual Function InfiniBand Ports Each VF presents itself as an independent vHCA to the host while a single HCA is observable by the network which is unaware of the vHCAs No changes are required by the InfiniBand sub system ULPs and applications to support SR IOV and vHCAs are interoperable with any exist ing non virtualized IB deployments Sharing the same physical port s among multiple VHCAs is achieved as follows Each vHCA port presents its own virtual GID table The virtual GID table for
37. mstdump isw and i2c For additional details please refer to the MFT User s Manual docs 1 4 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB and Eth network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager 1 5 RDMA over Converged Ethernet RoCE RoCE allows InfiniBand IB transport applications to work over Ethernet network RoCE encapsulates the InfiniBand transport and the GRH headers in Ethernet packets bearing a dedi cated ether type 0x8195 Thus any VERB application that works in an InfiniBand fabric can work in an Ethernet fabric as well RoCE is enabled only for drivers that support VPI currently only mlx4 When working with RDMA applications over Ethernet link layer the following points should be noted The presence of a Subnet Manager SM is not required in the fabric Thus operations that require communication with the SM are managed in a different way in RoCE This does not affect the API but only the actions such as joining multicast group that need to be taken when using the API Since LID is a layer 2 attribute of the InfiniBand protocol stack it is not set for a port and is displayed as zero when querying the port With RoCE the alternate path is not set for RC QP and therefore AP
38. with old OpenSM running on a little endian machine siga licks ex This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID routing engine R engine name gt This option chooses routing engine s to use instead of default Min Hop algorithm Multiple routing engines can be specified separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail If all configured routing engines fail OpenSM will always attempt to route with Min Hop unless no fallback is included in the list of routing engines Supported engines updn file ftree lash dor torus 2QoS 140 Mellanox Technologies m 2 1 1 0 0 do mesh analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing lash start vl vl number Sets the starting VL to use for the lash routing algorithm Defaults to 0 sm sl sl number Sets the SL to use to communicate with the SM SA Defaults to 0 connect roots z This option enforces routing engines up down and fat tree to make connectivity between root switches and in this way be IBA compliant In many cases this can violat
39. www t10 org ftp t10 drafts spc3 spc3r21b pdf Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf Basic functionality task management and limited error handling 4 1 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or change the value of SRP_LOAD in etc infiniband openib conf to yes For the changes to take effect run etc init d openibd restart die When loading the ib srp module it is possible to set the module parameter srp sg tablesize This is the maximum number of gather scatter entries per I O adi default 12 Mellanox Technologies 47 Rev 2 1 1 0 0 Driver Features 4 1 2 1 1 SRP Module Parameters When loading the SRP module the following parameters can be set viewable by the modinfo ib_srp command cmd sg entries allow ext sg topspin workarounds reconnect delay fast io fail tmo dev loss tmo Default number of gather scatter entries in the SRP command default is 12 max 255 Default behavior when there are more than cmd sg entries S G entries after mapping fails the request when false default false Enable workarounds for Topspin Cisco SRP target bugs Time between successive reconnect attempts Time between successive reconnect attempts of SRP initiator to a disconnected target until dev loss tmo timer expires if enabled after that the SCSI target wi
40. xp link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the positive x direction while xm link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the negative x direction All the link keywords for a given seed must specify the same from switch In general it is not necessary to configure both the positive and negative directions for a given coordinate either is sufficient However the algorithm used for topology discovery needs extra information for torus dimensions of radix four see TOPOLOGY DISCOVERY in torus 2QoS 8 For such cases both the positive and negative coordinate directions must be specified Based on the topology specified via the torus mesh keyword torus 2QoS will detect and log when it has insufficient seed configuration x dateline position y dateline position z dateline position In order for torus 2QoS to provide the guarantee that path SL values do not change under any conditions for which it can still route the fabric its idea of dateline position must not change rel ative to physical switch locations The dateline keywords provide the means to configure such behavior The dateline for a torus dimension is always between the switch with coordinate 0 and the switch with coordinate radix 1 for that dimension By default the common switch in a torus seed is take
41. 010 6 00 81010 2 0 0 8 010 8 00 80 2 6 9 6 8 0 278 00 8 273 SA In order to use the configuration file run host1 dhclient cf dhclient conf ibi Mellanox Technologies 59 J Rev 2 1 1 0 0 Driver Features 4 3 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the installation script with a configuration file using the n option containing the full IP configu ration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface A static IPoIB configuration e An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring IP addresses The following code lines are an excerpt from a sample IPoIB configuration file Static settings all values provided by this file IPADDR ib0 11 4 3 175 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 IPADDR ib0 11 4 ETMASK ib0 255 255 0 0 ETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on the first eth n interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 IPADDR IpO MA ETMASK ib0 255 255 0 0 ETWORK ib0 11 4 0 0
42. 1 0 0 InfiniBand Fabric Diagnostic Utilities ibv_asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv_asyncwatch Examples 1 Display asynchronous events gt ibv_asyncwatch mlx4 0 async event FD 4 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX family adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analysis The following describes a work flow for local HCA adapter sniffing e Run ibdump with the desired options Run the application that you wish its traffic to be analyzed Stop ibdump CTRL C or wait for the data buffer to fill in mem mode Open Wireshark and load the generated file How to Get Wireshark Download the current release from www wireshark org for a Linux or Windows environment See the ibdump release notes txt file for more details Although ibdump is a Linux application the generated pcap file may be analyzed on either operating system ade Synopsis ibdump options Output Files d ib dev lt dev gt use RDMA device lt dev gt default first device found The relevant devices can be listed by running the ibv devinfo command i 1b port lt port gt use port port of IB device default 1 w write file dump file name default sniffer pcap stands for stdout enables piping to tcpdump or tshark 0
43. 1 2 4 Directory Structure ute eee eeepc aa t a e gr cet 20 13 Architecture 4i sete e er anf IERI CHE PIER HR RES td das 21 1 3 mb VPEDtIVet At ble Saeed 21 1 3 2 mlx5 Driver series A EAS E Pace ees 22 1 3 3 Mad layer Cores e a ceri ad 23 1 3 4 UPS do MN en Du viet ad ni a 23 P MP O A A A id 24 1 3 6 InfiniBand Subnet Manager 00 00 cette eee 24 1 3 7 Diagnostic Utilities coca ecg eect a dO EL EDEN 24 1 3 8 Mellanox Firmware Tools 2 0 c cece ete teens 24 1 4 Quality of Services o usu emeret be petere lo wel tire Ae a mre data 25 1 5 RDMA over Converged Ethernet RoCE 0oooooooooomommmmo oo 25 Chapter 2 Installation A a HOA ERR o 27 2 1 Hardware and Software Requirements 0 0 c cece eee ences 27 2 2 Downloading Mellanox OFED 00 cece eects 27 2 3 Installing Mellanox OFED 2 00 00 cece eee eee 28 2 3 1 Pre installation Notes nreno cece ence eee ene ne 28 232 Installation SCript o 26s siete AA Ea 29 2 3 3 Installation Procedure 00 eee cee nett ences 32 2 34 Installation Results ceo bios o ree Pea dede 40 2 3 5 Post installation Notes 2 0 0 0 eee ect ene eens 41 2 3 6 Installation Logging 0 eect cette eens 41 2 4 Updating Firmware After Installation sssssseeerseeererer rese rer erna 41 2 5 Installing MLNX OFED using YUM ssooseeeeseereseeer esse reser resa 43 2 5 1 Setting up MLNX OFED YUM Repository
44. 1090120019 98 Mellanox Technologies m 2 1 1 0 0 If such ini file cannot be found in the firmware directory you may want to dump the config uration file using mstflint Run mstflint dev PCI device dc gt ini device file gt Step 4 Edit the ini file that you found in the previous step and add the following lines to the HCA section in order to support 63 VFs SRIOV enable total vfs 63 num pfs 1 GuENGN en true 1 Some servers might have issues accepting 63 Virtual Functions or more In such case please set the number of total vfs to any required value Step 5 Create a binary image using the modified ini file Run mlxburn fw fw name mlx conf modified ini file wrimage file name gt bin The file file name gt bin is a firmware binary file with SR IOV enabled that has 63 VFs It can be spread across all machines and can be burnt using mstflint which is part of the bun dle using the following command mstflint dev PCI device image file name bin b P After burning the firmware the machine must be rebooted If the driver is only restarted the machine may hang and a reboot using power OFF ON might be required 4 13 7 Configuring Pkeys and GUIDs under SR IOV 4 13 7 1 Port Type Management Port Type management is static when enabling SR IOV the connectx_port_config script will not work The port type is set on the Host via a module parameter port type array in
45. 145 Rev 2 1 1 0 0 OpenSM Subnet Manager consolidate ipv6 snm req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P Key consolidate ipv4 mask Use mask for IPv4 multicast groups multiplexing per MGID scope and P Key pid file path to file Specifies the file that contains the process ID of the opensm daemon The de max seq redisc fault is var run opensm pid Specifies the maximum number of failed discovery loops done by the SM befo mc secondary root guid lt G fines This option de mc primary root guid GUI This option de guid routing order no sca Don t use sca tte UID tter r fo re completing the w in hex gt the guid of the mul D in hex gt fines the guid of the mul sa pr full world queries allowed hole ticas ticas heavy sweep cycle secondary root switch primary roo r ports defined in guid routing orde This option allows OpenSM to respond full World Path Reco path record for each pair of ports in a fabric enable crashd t switch r file rd queries This option causes OpenSM to run Crash Daemon child process that allows backtrace dump in case of fatal terminating signals log prefix prefix text Prefix to syslog messages from Open M verbose v This option increases the log verbosity level The v option may be specified multiple times to furt
46. 3 96 4 96 qos vlarb low 0 1 cos elevil 0 1 291 370 Un Ldn ES Wap Loy 197137 Ly 19 Partition configuration file Default 0x7fff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL full 8 8 Adaptive Routing 8 8 1 Overview Adaptive Routing is at beta stage Adaptive Routing AR enables the switch to select the output port based on the port s load AR supports two routing modes Free AR No constraints on output port selection Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric switches It scans all the fabric switches deduces which switches support Adaptive Routing and configures the AR functionality on these switches Currently Adaptive Routing Manager supports only link aggregation algorithm Adaptive Rout ing Manager configures AR mechanism to allow switches to select output port out of all the ports that are linked to the same remote switch This algorithm suits any topology with several links between switches Especially it suits 3D torus mesh where there are several link in each direc tion of the X Y Z axis If some switches do not support AR they will slow down the AR Manager as it may get timeouts on the AR related queries to these switches Mellanox Technologies 181 Rev 2 1 1 0 0 OpenSM Subnet
47. 4 10 Shared Memory Region 0 0 c cece ete n 87 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand 88 442 Blow Steenng sco Gate vate tS o HE MEN 89 4 12 1 Enable Disable Flow Steering 20 0 0 cee ccc eens 89 4 12 2 Flow Domains and Priorities 5 89 4 13 Single Root IO Virtualization SR IOV 002 eee 92 4 Mellanox Technologies J m 2 1 1 0 0 4 13 1 System Requirements irs ce a r e 92 4 13 2 Set ng Up SR IOV 2 eos ak ea ERA HERI edens 92 4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup 96 4 13 4 Assigning a Virtual Function to a Virtual Machine o oooocooooo ooo 97 4 13 5 Uninstalling SR IOV Driver oooooooooooorrrrrrr e 98 4 13 6 Burning Firmware with SR IOV 0 0 0 98 4 13 7 Configuring Pkeys and GUIDs under SR IOV sssoseseeereerersrs esse res 99 4414 CORE Ditect xe a eee Ae cae RAE MEUS Eos p 105 4 14 1 CORE Direct Overview stresne darioa pos ud eR Tp e n bMS 105 ANTS Btlitool si onu Pes SS A RM ect dtes 105 4 16 Dynamically Connected Transport Service 0 0 0 esses 107 AAT PeerDitect ise ilo BER ERES REIHEN ea ie peres redes 107 4 18 Inline Receive aiae n 108 4 18 1 Querying Inline Receive Capability 22 ee 108 4 18 2 Activating Inline Receive 2 0 eect eee 108 4 19 Ethernet Performance Counters 0 0 0 0 eect ees 109 4 20 Memory Window dela ccc eee eee E o E
48. BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 4 3 3 3 Manually Configuring IPoIB This manual configuration persists only until the next reboot or driver restart die To manually configure IPoIB for the default IB partition VLAN perform the following steps Step 1 To configure the interface enter the ifconfig command with the following items The appropriate IB interface ib0 ibl etc The IP address that you want to assign to the interface The netmask keyword 60 Mellanox Technologies m 2 1 1 0 0 The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 Step 2 Optional Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib argument The following example shows how to verify the configuration host1 ifconfig ib0 b0 Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 iner ecri A3175 Bess 4 190 299 Nask 255250030 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s 4 3 4 Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface
49. Default 5 restarted upon reaching this limit This option cannot be changed on the fly 8 8 5 1 1 Per switch AR Options A user can provide per switch configuration options with the following syntax 184 Mellanox Technologies m 2 1 1 0 0 SWITCH lt GUID gt switch option 1 gt switch option 2 gt The following are the per switch options Table 21 Adaptive Routing Manager Pre Switch Options File Option File Description Values ENABLE Allows you to enable disable the AR on this Default true lt true false gt switch If the general ENABLE option value is set to false then this per switch option is ignored This option can be changed on the fly AGEING_TIME Applicable to bounded AR mode only Specifies Default 30 lt usec gt how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value In the pre switch options file this option refers to the particular switch only This option can be changed on the fly 8 8 5 1 2 Example of Adaptive Routing Manager Options File ENABLE true LOG FILE tmp ar_mgr log LOG SIZE 100 MAX ERRORS 10 ERROR WINDOW Sp SWITCH 0x12345 ENABLE true AGEING TIME 77 SWITCH 0x0002c902004050 8 AGEING TIME 44 SWITCH Oxabcde ENABLE false Mellanox Technologie
50. HCAs multicast group and report Partitions report IPoIB report In case the IB fabric includes only one CA then CA to CA paths are not reported Furthermore if a topology file is provided ibdiagnet uses the names defined in it for the output reports InfiniBand Fabric Diagnostic Utilities m 2 1 1 0 0 Error Codes Failed to fully discover the fabric Failed to parse command line options Failed to intract with IB fabric Failed to use local device or local port Failed to use Topology File Failed to load requierd Package Cn Cnm 3mm o ho p I 9 5 ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing 1s used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along th
51. Manager 8 8 2 8 8 3 8 8 3 1 8 8 3 2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug in i e it is a shared library libarmgr so that is dynamically loaded by the Subnet Manager Adaptive Routing Manager is installed as a part of Mellanox OFED installation Running Subnet Manager with Adaptive Routing Manager Adaptive Routing AR Manager can be enabled disabled through SM options file Enabling Adaptive Routing To enable Adaptive Routing perform the following 1 Create the Subnet Manager options file Run opensm c lt options file name gt 2 Add armgr to the event_plugin_name option in the file Event plugin name s event plugin name armgr 3 Run Subnet Manager with the new options file opensm F lt options file name gt Adaptive Routig Manager can read options file with various configuration parameters to fine tune AR mechanism and AR Manager behavior Default location of the AR Manager options file is etc opensm ar mgr conf To provide an alternative location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin options option in the file Options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt See an example of AR Manager options file with a
52. Name Description Firmware Release Notes for Mellanox See the Release Notes PDF file relevant to your adapter devices adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Manual See under docs folder of installed package MFT Release Notes Release Notes for the Mellanox Firmware Tools See under docs folder of installed package Mellanox Technologies 17 Rev 2 1 1 0 0 Support and Updates Webpage Please visit http www mellanox com gt Products gt InfiniBand VPI Drivers gt Linux SW Drivers for downloads FAQ troubleshooting future updates to this manual etc 18 Mellanox Technologies mm 1 Mellanox OFED Overview 1 1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack which operates across all Mellanox network adapter solutions supporting 10 20 40 and 56 Gb s InfiniBand IB 10 40 and 56 Gb s Ethernet and 2 5 or 5 0 GT s PCI Express 2 0 and 8 GT s PCI Express 3 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are supported with major operating system distributions Mellanox OFED is certified with the following products Mellanox Messaging Accelerator VMATM software Socket acceleration library that performs OS bypass for standard socket based applications Mellanox Unified Fabr
53. Oe one men as 55174680 MER A OST OTRO SOS 766366 A a Et ae CE 766315 2 Read performance counters from LID 2 all ports gt smpquery a 2 Port counters Lid 2 port 255 Ps ES 255 Counters Sec rat 0x0100 SVIDID GOBBSIET OS aspen os aus rosa too Oo 63339 IN ANTAS A A EE 255 TREND QS er te SIDE T se karg 16 AS ce e NDA Tree es 657 ROVRENOESLAY Sle DS E 0 ROVOWRE eNOS Sooke nos etre sro 70 ATEN Sa sere ET CU 488 MELO BY COS S oso ssooooas 0 BleS Coss CICLO Rn ene o 6 62 65 0 Iie ILM ise ems ENE OS Toa ske ESSENS 0 HX B I TO vcrrara lin Errata SKEN 0 Wil DLO PPE kseameersr DT GAA Or 0 RMED oi OS 129840354 DER nao E E 129529906 MINE Gee a ama M HET de 1803332 A oo das 1799018 3 Read then reset performance counters from LID 2 port 1 gt periouery r 21 Port counters Lid 2 port 1 AA o peo do 1 Counter eNe Ctr RE dS MSS ao see 0x0100 SVAN ONER ROL aora aaa tare 0 HAMKREGOMEES S e a E 0 E a at ass 0 NA A OO oa da E 0 RevRemotePhysErrors 0 RESIN MORE Ds E IEEE 0 Ina S Cand wee rendir a ers 3 CA B ere ancacacasce 0 Mellanox Technologies 215 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities IRXECIO secre ECOL B oce oio oco nec 0 ia Diane ege e as B e oon o ao 000066 0 EXCUSA OS E E 0 Win DO PRE dE ee e e E 0 Mute D e ae oso RERO ERSTER ERU 0 NUDOS ed pon a OA OI ORTA eects 0 uis quce 0 ROVPKESt eere else etse RU dele 0 9 14 bcheckerrs Validates an IB port or node a
54. Recommended BIOS Settings These performance optimizations may result in higher power consumption des 7 1 3 1 General Set BIOS power management to Maximum Performance 128 Mellanox Technologies m 2 1 1 0 0 7 1 3 2 Intel amp Sandy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors Table 17 Recommended BIOS Settings for Intel Sandy Bridge Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Enabled Hyper Threading HPC disabled Data Centers enabled CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled Mellanox Technologies 129 Rev 2 1 1 0 0 Performance 7 1 3 3 Intel Nehalem Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalem based processors Configuring the Completion Queue Stall Delay Table 18 Recommended BIOS Settings for Intel Nehalem Westmere
55. Settings on page 128 to verify that power state is disabled e Check the current CPU frequency to check whether it is configured to max available frequency cat sys devices system cpu cpu cpufreq cpuinfo cur freq 132 Mellanox Technologies m 2 1 1 0 0 7 2 4 1 Setting the Scaling Governor If the following modules are loaded CPU scaling is supported and you can improve perfor mance by setting the scaling mode to performance freq table acpi cpufreq this module is architecture dependent It is also recommended to disable the module cpuspeed this module is also architecture depen dent gt To set the scaling mode to performance use echo performance gt sys devices system cpu cpuT cpufreg scaling governor To disable cpuspeed use service cpuspeed stop 7 2 4 2 Kernel Idle Loop Tuning The mlx4 en kernel module has an optional parameter that can tune the kernel idle loop for bet ter latency This will improve the CPU wake up time but may result in higher power consump tion To tune the kernel idle loop set the following options in the etc modprobe d mlx4 conf file For MLNX OFED 2 0 x options mlx4 core enable sys tune 1 For MLNX EN 1 5 10 options mlx4 en enable sys tune 1 7 2 4 3 OS Controlled Power Management Some operating systems can override BIOS power management configuration and enable c states by default which results in a higher latency To resolve the high laten
56. Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image e Querying for firmware information Burning a firmware image to a single InfiniBand node MFT includes the following tools e mlxbur provides the following functions Generation of a standard or customized Mellanox firmware image for burning in bin binary or img format Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device Querying the firmware version loaded on an HCA board Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mella nox network adapter bridge switch device It includes query functions to the burnt firmware image and to the binary image file spark 1 OpenSM is disabled by default See Chapter 8 OpenSM Subnet Manager for details on enabling it 24 Mellanox Technologies m 2 1 1 0 0 This tool burns a firmware binary image to the EEPROM s attached to an InfiniScalelII amp switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific MADS over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace
57. cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl 84 Mellanox Technologies rev 2 1 1 0 0 The example above shows two elPoIB interfaces where eth4 runs traffic over 1b0 and eth5 runs traffic over ibl Figure 3 An Example of a Virtual Network Host pos SSS See ib0 2 ib0 3 L 4 KVM GUEST1 SSA IPoIB LAN sss elles ety ES via port 1 etho o J tapo 1 api cal l KVM GUEST2 Gon oe bro L vif0 3 The example above shows a few IPoIB instances that server the virtual interfaces at the Virtual Machines To display the services provided to the Virtual Machine interfaces cat sys class net eth0 eth vifs Example cat sys class net eth0 eth vifs SLAVE ib0 2 MAC 52 54 00 60 55 88 VLAN N A In the example above the ib0 2 IPoIB interface serves the MAC 52 54 00 60 55 88 with no VLAN tag for that interface 4 8 3 VLAN Configuration Over an elPolB Interface eIPoIB driver supports VLAN Switch Tagging VST mode which enables the virtual machine interface to have no VLAN tag over it thus allowing VLAN tagging to be handled by the Hyper visor gt To attach a Virtual Machine interface to a specific isolated tag Step 1 Verify the VLAN tag to be used has the same pkey value that is already configured on that 1b port cat sys class infiniband mlx4 0 ports
58. comes with a pre installed version of MXM v2 x and OpenMPI compiled with MXM v2 x To check the version of MXM installed on your host run rpm qi mxm 120 Mellanox Technologies m 2 1 1 0 0 Toupgrade MLNX OFED v2 0 or later with a newer MXM Step 1 Remove MXM vl l rpm e mxm Step 2 Remove the pre compiled OpenMPI rpm e mlnx openmpi gcc Step3 Install the new MXM and compile the OpenMPI with it To run OpenMPI without MXM run S molina 1ea miel mam X555 When upgrading to MXM v2 1 OpenMPI compiled with the previous versions of the MXM should be recompiled with MXM v2 1 5 3 2 Enabling MXM in OpenMPI MXM v2 1 is automatically selected by OpenMPI up to v1 6 when the Number of Processes NP is higher or equal to 128 To enable MXM for any NP use the following OpenMPI parame ter mca mtl _ mxm np lt number gt From OpenMPI v1 7 MXM is selected when the number of processes is higher or equal to 0 i e by default gt To activate MXM for any NP run 9 mpirun mca mtl mxm np 0 lt other mpirun parameters gt 5 3 3 Tuning MXM Settings The default MXM settings are already optimized To check the available MXM parameters and their default values run the opt mellanox mxm bin mxm dump config utility which is part of the MXM RPM MXM parameters can be modified in one of the following methods Modifying the default MXM parameters value as part of the mpirun mpi
59. conventions is to use the Target port GUID as the initiator_ext value for the rele vant path If you use srp_daemon with n flag it automatically assigns initiator_ext values according to this convention For example id_ext 200500A0B81146A1 ioc_guid 0002c90200402bec dgid fe800000000000000002c90200402bed pkey ffff service _id 200500a0b81146a1 initiator ext ed2b400002c90200 Notes 1 It is recommended to use the n flag for all srp daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the user or automat ically at startup by setting SRP DAEMON ENABLE to yes 4 1 2 6 High Availability HA Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each initiator is connected to the same target from several ports HCAs The DM multpath is responsi ble for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module to connect to each of them These SRP daemons also detect targets that subsequently Join the fab ric and send the ib srp module requests to connect to them as well Operation When a path from p
60. e echo add vdisk0 0 gt proc scsi tgt groups Default devices f echo add vdiskl 1 gt proc scsi tgt groups Default devices 2 Run For all distributions except SLES 11 gt modprobe ib srpt For SLES 11 gt modprobe i ib srpt For SLES 11 please ignore the following error messages in var log messages when loading ib srpt to SLES 11 distribution s kernel ib srpt no symbol version for scst unregister ib srpt Unknown symbol scst unregister ib srpt no symbol version for scst register ib srpt Unknown symbol scst register ib srpt no symbol version for scst unregister target template ib srpt Unknown symbol scst unregister target template B On Initiator Machines On Initiator machines manually perform the following steps Mellanox Technologies 237 Rev 2 1 1 0 0 1 Run modprobe ib srp 2 Run ibsrpdm c d dev infiniband umadX to discover a new SRP target umad0 port 1 of the first HCA umadi port 2 of the first HCA umad2 port 1 of the second HCA 3 echo new target info gt sys class infiniband srp srp mthca0 1 add target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in the system i e mthca0 root lab104 ibsrpdm c d dev infiniband umad0 id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c90200226cf4 root lab104 echo id ext 0002c9020022
61. eens 113 4 20 1 Query Capabilities cuca a oe RAD 114 4 20 2 Allocating Memory Window 0 0 c cece cece ete eens 114 4 20 3 Binding Memory Windows 000 cece c ete tt tenes 114 4 20 4 Invalidating Memory Window 0 cece e eee cette eens 114 4 20 5 Deallocating Memory Window 00 cece eet teenies 114 Chapter 5 HPC Features ci sas LES 5 1 Shared Memory ACCESS ss se ESSA nee ha daa ace dal es 115 5 1 1 Mellanox ScalableSHMEM ssesseeee cnet en ens 115 5 1 2 Running SHMEM with FCA 6 ete sea 116 5 1 3 Running ScalableSHMEM with MXM 0 0 0 cee eee eee eee 116 5 1 4 Running SHMEM with Contiguous Pages 00 0 e cece eee eee 117 5 1 5 Running ScalableSHMEM Application 0 00 eee eee ee eee 117 5 2 Message Passing Interface eh 117 5 241 OVERVIEW e his Dee P eee epe Ete RE ERI ee 117 5 22 Prerequisites for Running MPI 0 cece cee ees 118 5 2 3 MPI Selector Which MPI Runs 0 00 ce eee eee eee eee nes 119 5 24 Compiling MPI Applications 0 0 cece cee eee 120 5 3 MellanoX Messaging 0 ccc cece en ene eens 120 5 3 1 Compiling OpenMPI with MXM 0 ccs 120 5 3 2 Enabling MXM in OpenMPI 0 0 cece eee cette e 121 5 3 3 Tuning MXM Settings sms osten sees RE e E dU eI CARTES 121 5 3 4 Configuring Multi Rail Support nsss nessau seana 122 5 3 5 Configuring MXM over t
62. en Ethernet e Mid layer core Verbs MADs SA CM CMA uVerbs uMADs e Upper Layer Protocols ULPs PoIB RDS SRP Initiator and SRP NOTE RDS was not tested by Mellanox Technologies MPI Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta e OpenSM InfiniBand Subnet Manager Utilities Diagnostic tools Performance tests Firmware tools MFT Source code for all the OFED software modules for use under the conditions men tioned in the modules LICENSE files Documentation 1 2 5 Firmware The ISO image includes the following firmware items Firmware images mlx format for ConnectX 3 ConnectX 3 Pro Connect IB net work adapters Firmware configuration INI files for Mellanox standard network adapter cards and custom cards FlexBoot for ConnectX 3 HCA devices 1 2 4 Directory Structure The ISO image of MLNX OFED LINUX contains the following files and directories mlnxofedinstall This is the MLNX OFED LINUX installation script ofed_uninstall sh This is the MLNX OFED LINUX un installation script RPMS folders Directory of binary RPMs for a specific CPU architecture firmware Directory of the Mellanox IB HCA firmware images including Boot over IB e src Directory of the OFED source tarball mlnx add kernel support sh
63. fast io fail tmo echo 20 sys class srp remote ports port xxx reconnect delay 48 Mellanox Technologies m 2 1 1 0 0 4 1 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 4 1 2 4 explains how to do this automatically e Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id ext GUID value ioc guid GUID value dgid port GID value pkey ffff service id service 0 value gt sys class infiniband srp srp mlx hca number port number add target See Section 4 1 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes Itis possible to include additional parameters in the echo command max cmd per lun Default 62 e max sect short for max sectors sets the request size of a command e jo class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 tl retry count a number in the range 2 7 specifying the IB RC retry count Default 2 comp vector a number in the range 0 n 1 specifying the MSI X completion vector Some HCA
64. gt Optional but All ports of Print information for the specified port only requires the specified of the specified device specifyinga device device name Examples 1 List the status of all available InfiniBand devices and their ports gt ibstatus Infiniband device mlx4 0 port 1 status default gid base lid sm lid Sales phys state rates fe80 0000 0000 0000 0000 0000 0007 3896 0x3 0x3 4 ACTIVE 5 LinkUp 20 Gb sec 4X DDR Infiniband device mlx4 0 port 2 status default gid base lid sm lid States phys state RES e80 0000 0000 0000 0000 0000 0007 3897 0x1 0x1 4 ACTIVE 5 LinkUp 20 Gb sec 4X DDR Infiniband device mthca0 port 1 status default gid base lid sm lid state phys state iet E fe80 0000 0000 0000 0002 c900 0101 d151 0x0 0x0 2e ION 5 LinkUp 10 Gb sec 4X Infiniband device mthca0 port 2 status default gid base lid sm lid state phys state iens fe80 0000 0000 0000 0002 c900 0101 d152 0x0 0x0 222 JINGLE 5 LinkUp 10 Gb sec 4X 202 Mellanox Technologies m 2 1 1 0 0 2 List the status of specific ports of specific devices gt ibstatus mthca0 1 mlx4 0 2 Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 state 23 JUNI phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mlx4 0 port 2 status default gid
65. is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat tree Routing Algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is constrained to rank ing rules 4 LASH Routing Algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distrib uting the paths between layers LASH is an alternative deadlock free topology agnostic routing algo rithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node 5 DOR Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh 6 Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2QoS provides deadlock free routing while supporting two quality of service QoS levels Additionally it can route around multiple failed fabric links or a single failed fabric switch without introducing deadloc
66. its configuration to the specified file and exit This is a way to generate OpenSM configuration file template guid g lt GUID in hex gt This option specifies the local port GUID value with which OpenSM should bind OpenSM may be Mellanox Technologies 139 Rev 2 1 1 0 0 OpenSM Subnet Manager lline 11 bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port LMC gt his option specifies the subnet s LMC value he number of LIDs assigned to each port is 2 LMC C values gt 0 allow multiple paths between ports lt T E The LMC value must be in the range 0 7 L L C values gt 0 should only be used if the subnet topology actually provides multiple paths between ports i e multiple interconnects between switches Without 1 OpenSM defaults to LMC 0 which allows one path between any two ports priority p lt PRIORITY gt This option specifies the SM s PRIORITY This will effect the handover cases where master ete s chosen by priority and GUID Range goes rh rom 0 lowest priority to 15 highest smkey k SM Key This option specifies the SM s SM Key 64 bits This will effect SM authentication Note that OpenSM version 3 2 1 and below used the default value 1 in a host byte order it is fixed now but you may need this option to interoperate
67. lLic l ies Source and destination LIDs source may be omit ted gt the local port is assumed to be the source 6l lt i NI ooa Directed route from the local node which is the source and the destination node c lt count gt The minimal number of packets to be sent across each link default 100 Enable verbose mode t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p lt port num gt Specifies the local device s port number used to connect to the IB fabric 0 cou Cites Specifies the directory where the output files will be placed default tmp lw lt 1x 4x 12x gt Specifies the expected link width le 2 55 10 Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm pE Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen h help Prints the help page information V version Prints the version of the tool RSS Prints the tool s environment variables and their values Output Files Table 28 ibdiagpath Output Files Output File Description ibdiagpath log A dump of all the application reports generated accor
68. lt ib port gt pkeys Step 2 Create a VLAN interface in the Hypervisor over the eIPoIB interface vconfig add lt eIPoIB interface gt lt vlan tag gt Step 3 Attach the new VLAN interface to the same bridge that the virtual machine interface is already attached to brctl addif lt br name gt lt interface name gt Mellanox Technologies 85 J Rev 2 1 1 0 0 Driver Features 4 8 4 4 9 For example to create the VLAN tag 3 with pkey 0x8003 over that port in the eIPoIB interface eth4 run vconfig add eth4 3 brctl addif br2 eth4 3 Setting Performance Tuning Use 4K MTU over OpenSM For further information please refer to Section 8 4 1 File Format on page 151 Default 0xffff ipoib mtu 5 ALL full Use MTU for 4K 4092 Bytes In UD mode the maximum MTU value is 4092 Bytes Make sure that all interfaces including the guest interface and its virtual bridge have the same MTU value MTU 4092 Bytes For further information of MTU settings please refer to the Hypervisor User Manual Tune the TCP IP stack using sysctl dom0 domu sbin sysctl perf tuning Other performance tuning for KVM environment such as vCPU pinning and NUMA tuning may apply For further information please refer to the Hypervisor User Manual Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over physical con tiguous pages It enables a user application to ask low level
69. lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps MellanoX Messaging MellanoX Messaging MXM provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hard ware This includes a variety of enhancements that take advantage of Mellanox networking hard ware including Multiple transport support including RC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching ntra node shared memory communication These enhancements significantly increase the scalability and performance of message commu nications in the network alleviating bottlenecks within the parallel communication libraries The latest MXM software can be downloaded from the Mellanox website MLNX OFED v2 0 or later comes with a pre installed version of MXM v2 x and OpenMPI compiled with MXM v2 x Compiling OpenMPI with MXM Step 1 Install MXM from RPM rpm ihv mxm x y z 1 x86 64 rpm MXM will be installed automatically in the opt me11anox mxm folder Step 2 Enter OpenMPI source directory and run cd SOMPI HOME configure with mxm opt mellanox mxm lt other configure parameters make all amp amp make install oo MLNX OFED v2 0 or later
70. me e sor eS if ret gt 0 CQ returned a we aE wc ex wc flags amp IBV WC WITH TIMESTAMP This wc contains a timestamp timestamp wc ex timestamp Timestamp is given in raw hardware time 80 Mellanox Technologies m 2 1 1 0 0 CQs that are opened with the ibv create cq ex versb should be always be polled with the ibv poll cq ex verb ae 4 6 2 4 Querying the Hardware Time Querying the hardware for time is done viathe ibv query values ex verb For example ret ibv query values ex context IBV VALUES HW CLOCK amp queried values if ret amp amp queried values comp mask amp IBV VALUES HW CLOCK queried time queried values hwclock To change the queried time in nanoseconds resolution use the IBV VALUES HW CLOCK NS flag along with the nwclock ns field ret ibv query values ex context IBV VALUES HW CLOCK NS amp queried values if ret amp amp queried values comp mask amp IBV VALUES HW CLOCK NS queried time ns queried values hwclock ns Querying the Hardware Time is available only on physical functions native machines 4 7 Atomic Operations Atomic Operations are applicable to the mlx4 driver only ade 4 7 1 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those defined by the IB spec Atomicity guarantees Atomic Ack generation ordering rules and error behavior for this set of extended Atom
71. not allow LID routing communication between switches that are located inside spine switch systems The reason is that there is no way to allow a LID route between them that does not break the Up Down rule One ramification aa of this is that you cannot run SM on switches other than the leaf switches of the fabric UPDN Algorithm Usage Activation through OpenSM e Use R updn option instead of old u to activate the UPDN algorithm Use a root guid file gt for adding an UPDN guid file that contains the root nodes for ranking If the a option is not used OpenSM uses its auto detect root nodes algo rithm Notes on the guid list file 156 Mellanox Technologies m 2 1 1 0 0 1 A valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 8 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also prevents credit loop dead locks If the r
72. ofed RPM GPG KEY Mellanox wget http www mellanox com downloads ofed RPM GPG KEY Mellanox 2013 08 20 13 52 30 http www mellanox com downloads ofed RPM GPG KEY Mellanox Resolving www mellanox com 72 3 194 0 Connecting to www mellanox com 72 3 194 0 80 connected HTTP request sent awaiting response 200 OK Length 1354 1 3K text plain Saving to RPM GPG KEY Mellanox j jp LAE SOS 2013 08 20 13 52 30 247 MB s RPM GPG KEY Mellanox saved 1354 1354 Step 4 Install the key sudo rpm import RPM GPG KEY Mellanox Step 5 Check that the key was successfully imported rpm q gpg pubkey qf NAME VERSION RELEASE t SUMMARY n grep Mellanox gpg pubkey a9e4b643 520791ba gpg Mellanox Technologies lt support mellanox com gt Step 6 Create a YUM repository configuration file called etc yum repos d mlnx_ofed repo with the following content mlnx ofed name MLNX OFED Repository baseurl file lt path to extracted MLNX OFED package gt enabled 1 gpgkey file lt path to the downloaded key RPM GPG KEY Mellanox gt gpgcheck 1 Mellanox Technologies 43 J Rev 2 1 1 0 0 Installation Step 7 Check that the repository was successfully added yum repolist Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management You can use subscrip tion manager to register repo id repo name status minx o
73. output file alias for the w option Do not use for backward compatibility 222 Mellanox Technologies m 2 1 1 0 0 b max burst log2 burst 092 of the maximal burst size that can be captured with 1 no packets loss Each entry takes MTU bytes of memory default 12 4096 entries s silent do not print progress indication mem mode lt size gt when specified packets are written to file only after the capture is stopped It is faster than default mode less chance for packet loss but takes more memory In this mode ibdump stops after lt size gt bytes are captured decap Decapsulate port mirroring headers Should be used when capturing RSPAN traffic h help Display this help screen v version Print version information Examples 1 Run ibdump ibdump IB device e usd Qu IB port SNO Dump file sniffer pcap Sniffer WQEs max burst size 4096 Initiating resources searching for IB devices in host Port active mtu 2048 MR was registered with addr 0x60d850 lkey 0x28042601 rkey 0x28042601 flags 0x1 QP was created QP number 0x4004a Ready to capture Press c to stop Mellanox Technologies 223 Rev 2 1 1 0 0 Appendix A Mellanox FlexBoot A 1 A 2 A 3 A 3 1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology FlexBoot supports remote Boot over InfiniBand BoIB and over Ethernet Using Mellanox Virtual Protocol Interconnect VPI technolo
74. parameter value in the attached rule Note Since the VLAN ID in the Ethernet header is 12bit long the following parameter should be used flow spec eth mask vlan tag htons 0x0fff All zero mask ignore the parameter value in the attached rule When setting the flow type to NORMAL the incoming traffic will be steered according to the rule spec ifications ALL DEFAULT and MC DEFAULT rules options are valid only for Ethernet link type since InfiniBand link type packets always include QP number For further information please refer to the relevant man pages ibv destroy flow int ibv destroy flow struct ibv flow flow id Input parameters ibv destroy flow requires struct ibv flow which is the return value of iov create flow in case of success Output parameters Returns 0 on success or the value of errno on failure For further information please refer to the ibv destroy flow man page e Ethtool Ethtool domain is used to attach an RX ring specifically its QP to a specified flow Please refer to the most recent ethtool manpage for all the ways to specify a flow Examples ethtool U eth5 flow type ether dst 00 11 22 33 44 55 loc 5 action 2 90 Mellanox Technologies m 2 1 1 0 0 All packets that contain the above destination MAC address are to be steered into rx ring 2 its underlying QP with priority 5 within the ethtool domain ethtool U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2 All p
75. per a single VM However the number of VFs varies upon the working mode requirements The protocol types are e Port I IB Port 2 Ethernet Mellanox Technologies 95 J Rev 2 1 1 0 0 Driver Features port type _array 2 2 Ethernet Ethernet port type_array 1 1 IB IB port type array 1 2 VPI IB Ethernet NO port type array module parameter ports are IB Step 9 Reboot the server gt If the SR IOV is not supported by the server the machine might not come out of boot load shoal Step 10 Load the driver and verify the SR IOV is supported Run lspci grep Mellanox 03 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 03 00 1 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 2 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 3 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 4 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 5 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Where e 03 00 represents the Physical Function e 03 00 X represents the Virtual Function connected to the Physical Function 4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup To enable S
76. qos swe sl2vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos type high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sent is high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded If the 255 value is used the low priority VLs may be starved e A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to ach
77. radix 4 dimensions In the event the torus is significantly degraded i e there are many missing switches or links it may happen that torus 2QoS is unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condition 164 Mellanox Technologies m 2 1 1 0 0 occurs if torus 2QoS is misconfigured 1 e the radix of a torus dimension as configured does not match the radix of that torus dimension as wired and many switches links in the fabric will not be placed into the torus 8 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with Q Since torus 2QoS depends on such functionality for correct operation always invoke OpenSM with Q when torus 2QoS is in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines sup ported by OpenSM except torus 2QoS there is a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used for unicast QoS configuration will be honored by torus 2QoS For multicast QoS configuration only SL values O and 8 should be used with torus 2QoS Since SL to VL map configuration mu
78. routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example consider that same 2D 6x5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree E I I I I I 3 l I I l I I 2 M I I I I I 1 I I I I I I y 0 x 1 2 3 4 5 Mellanox Technologies 163 Rev 2 1 1 0 0 OpenSM Subnet Manager Two things are notable about this master spanning tree First assuming the x dateline was between x 5 and x 0 this spanning tree has a branch that crosses the dateline However just as for unicast crossing a dateline on a 1D ring here the ring for y 2 that is broken by a failure cannot contribute to a torus credit loop Second this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric That unfortunately is a compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appro priately selecting a root switch In the 2D 6x5 torus example assume now that the switch at 3 2 i e the root for a pristine fabric fails Torus 2QoS will generate the following master spanning tree for that case 4 l l I 3 I I I I 2 I I I
79. s allo cate multiple n MSI X vectors per HCA port If the IRQ affinity masks of these interrupts have been configured such that each MSI X interrupt is handled by a different CPU then the comp vector parame ter can be used to spread the SRP completion workload over multiple CPU s cmd sg entries a number in the range 1 255 that specifies the maximum number of data buffer descrip tors stored in the SRP_CMD information unit itself With allow ext sg 0 the parameter cmd sg entries defines the maximum S G list length for a single SRP CMD and commands whose S G list length exceeds this limit after S G list collapsing will fail e initiator ext Please refer to Section 9 Multiple Connections To list the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk 1 This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 4 1 2 2 1 SRP sysfs Parameters Interface for making ib srp connect to a new target One can request ib srp to connect to a new target by writing a comma separated list of login parameters to this sysfs attribute The supported parameters are id ext A 16 digit hexadecimal number specifying the eight byte identifier extension in the 16 byte SRP target port identifier The target port iden tifier is sent by ib srp to the
80. sent successfully vport lt i gt tx broadcast byte S Broadcast packet bytes sent successfully vportci tx errors Packets dropped due to transmit errors 112 Mellanox Technologies m 2 1 1 0 0 Table 12 SW Statistics Counter Description Ix lro aggregated Number of packets aggregated rx lro flushed Number of LRO flush to the stack IX lro no desc Number of times LRO description was not found rx alloc failed Number of times failed preparing receive descriptor IX csum good Number of packets received with good checksum IX csum none Number of packets received with no checksum indication tx chksum offload Number of packets transmitted with checksum offload tx queue stopped Number of times transmit queue suspended tx wake queue Number of times transmit queue resumed tx timeout Number of times transmitter timeout tx tso packets Number of packet that were aggregated Table 13 Per Ring SW Statistics where i is the ring I per configuration Counter Description rx lt i gt _packets Total packets successfully received on ring i rx lt i gt _bytes Total bytes in successfully received packets on ring 1 tx lt i gt _packets Total packets successfully transmitted on ring i tx lt i gt bytes Total bytes in successfully transmitted packets on ring i 4 20 Memory Window Memory Window allows the application to have a more flexible cont
81. specific target port GUID end qos ulps Similar to the advanced policy definition matching of PR MPR queries is done in order of appearance in the QoS policy file such as the first match takes precedence except for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched 8 6 6 1 IPolB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the mul ticast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent ipoib g S ipoib pkey Ox7fff SL any pkey Ox7fff lt SL gt 8 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The follow ing two match rules are equivalent sdp SSI any service id 0x0000000000010000 0x000000000001ffff SL 8 6 0 5 RDS
82. strict p 0 1 der AA 230770 tc 0 ratelimit 3 Gbps tsa ets bw 30 uos Skprio Skprio Skprio Skprio Skprio Skprio Skprio tos 8 tos 24 toss 16 skprio skprio AO OD I LO E PERA A s skprio pa um Skprio Skprio Skprio fase pe poy Paes od skprio skprio Ee on mos 7 tc 1 ratelimit 4 Gbps tsa ets bw 70 Mellanox Technologies 73 J Rev 2 1 1 0 0 Driver Features weg d weg 2 tog 3 uds 2 mares 2 Coes sas siria up 4 mos y up 6 4 5 8 2 tcand tc_wrap py The tc tool is used to setup sk prio to UP mapping using the mgprio queue discipline In kernels that do not support maprio such as 2 6 34 an alternate mapping is created in sysfs The tc_wrap py tool will use either the sysfs or the tc tool to configure the sk prio to UP mapping Usage tc_wrap py i lt interface gt options Options version frein show program s version number and exit show this help message and exit u SKPRIO UP skprio up SKPRIO UP maps sk prio to UP LIST is 16 comma separated UP index of element is sk prio i INTF interface INTF Interface name Example set skprio 0 2 to UPO and skprio 3 7 to UP1 on eth4 UP skp skp skp skp skp skp skp skp skp skp skp skp We 1 skp skp skp skp rio 198 rio BLO rio rio GOs d eroe iae i ios 1 ol aio 1 rio rio rio rio tos 8 e Eo A A e
83. target in the SRP LOGIN REQ request Mellanox Technologies 49 J ioc guid dgid pkey service_id max sect max cmd per lun io class initiator ext cmd sg entries allow ext sg sg tablesize comp vector Rev 2 1 1 0 0 Driver Features A 16 digit hexadecimal number specifying the eight byte I O controller GUID portion of the 16 byte target port identifier A 32 digit hexadecimal number specifying the destination GID A four digit hexadecimal number specifying the InfiniBand partition key A 16 digit hexadecimal number specifying the InfiniBand service ID used to establish communication with the SRP target How to find out the value of the service ID is specified in the documentation of the SRP target A decimal number specifying the maximum number of 512 byte sectors to be transferred via a single SCSI command A decimal number specifying the maximum number of outstanding commands for a single LUN A hexadecimal number specifying the SRP I O class Must be either Oxff00 rev 10 or 0x0100 rev 16a The I O class defines the format of the SRP initiator and target port identifiers A 16 digit hexadecimal number specifying the identifier extension por tion of the SRP initiator port identifier This data is sent by the initiator to the target in the SRP_LOGIN_REQ request A number in the range 1 255 that specifies the maximum number of data buffer descriptors stored in the SRP_CMD information unit i
84. the IB RC retry count 4 1 2 3 SRP Tools ibsrpdm srp daemon and srpd Service Script To assist in performing the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above echo command Step 2 e A service script srpd which may be started at stack startup The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented ibsrpdm ibsrpdm is using for the following tasks 1 Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device sys class infiniband_mad umad0 execute the following command ibsrpdm This command will output information on each SRP Target detected in human readable form Sample output 10 Unit Info port LID 0103 port GID e800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 IO class 0100 ID LSI Storage Systems SRP Driver 200400a0b81146a1 service entries 1 service 0 200400a0b81146al SRP T10 200400A0B81146A1 b To detect
85. the InfiniBand ports consists of a single entry at index 0 that maps to a unique index in the physical GID table The vHCA of the PF maps to physical GID index 0 To obtain GIDs for other vHCAs alias GUIDs are requested from the SM These GIDs are mapped to vHCAs as follows vHCA number x is assigned the GID GUID at index x of the physical GID table Each vHCA port presents its own virtual PKey table The virtual PKey table presented to a VF is a mapping of selected indexes of the physical PKey table The host admin can control which PKey indexes are mapped to which virtual indexes using a sysfs interface see Section on page 100 The physical PKey table may contain both full and partial memberships of the same PKey to allow different membership types in different virtual tables Each vHCA port has its own virtual port state A vHCA port is up if the following conditions apply The physical port is up e The virtual GID table contains the GIDs requested by the host admin The SM has acknowledged the requested GIDs since the last time that the physical port went up Other port attributes are shared such as GID prefix LID SM LID LMC mask To allow the host admin to control the virtual GID and PKey tables of VHCAs a new sysfs lov sub tree has been added under the PF InfiniBand device 4 13 7 2 1SRIOV sysfs Administration Interfaces on the Hypervisor Administration of GUIDs and PKeys is done via the sysfs interface in the Hy
86. the Virtual Machine goes to the virtual bridge in the Hypervisor and from the bridge to the eIPoIB interface eIPoIB interface is the Ethernet interface that enslaves the IPoIB interfaces in order to send receive packets from the Ethernet interface in the Virtual Machine to the IB fabric beneath 4 8 1 Enabling the elPolB Driver Once the mlnx_ofed driver installation is completed perform the following Step 1 Open the etc infiniband openib conf file and include E IPOIB LOAD yes Step 2 Restart the InfiniBand drivers etc init d openibd restart Mellanox Technologies 83 J Rev 2 1 1 0 0 Driver Features 4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver When eth ipoib is loaded number of eIPoIB interfaces are created with the following default naming scheme ethX where X represents the ETH port available on the system To check which eIPoIB interfaces were created cat sys class net eth ipoib interfaces For example on a system with dual port HCA the following two interfaces might be created eth4 and eth5 cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl These interfaces can be used to configure the network for the guest For example if the guest has a VIF that is connected to the Virtual Bridge bro then enslave the eIPoIB interface to bro by running brctl addif br0 ethX In RHEL KVM environment there are other methods to create configure your virtual net w
87. the public key host1 cat id rsa pub Ssh rsa AAAAB3NzaClyc2EAAAABIWAAAQEA zVY8VBHQh90kZN70A1ibUQ74RXm4zHeczyVxpYHaDPyDmgezbYMKrCIVz d10bHt ZkCOrpLYviUOoUHd3fvNTfMs0gcGg08PysUft12FyYjira2Plxyg6mkHLGGqVutfEMmABZ3wNCUg6J2X 3G uiuSWXeubZmbXcMrP wAIWByfH8ajwo6A5WioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorX NCOZBe4kTnUqm63nQ2z1qVMdL9FrCmalxIOu94SQJAjwONevaMz FKEHe7YHg6YrNfXunfdbEurzB524TpPcrod ZlfCQ lt username gt thostl Step 4 Now you need to add the public key to the authorized keys2 file on the target machine hostl cat id rsa pub xargs ssh host2 V echo gt gt home lt username gt ssh authorized keys2 lt username gt host2 s pass word Enter password host1 For a local machine simply add the key to authorized keys2 host1 cat id rsa pub gt gt authorized keys Step 5 Test host1 ssh host2 uname Linux 5 2 3 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mella nox OFED installer can be listed in the MPI selector see the mpi
88. to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp If IPoIB configuration files are included ifcfg ib lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine pr etc sysconfig network on a SuSE machine A patch for DHCP is required for supporting IPoIB For further information please see the REAME which is available under the docs dhcp directory ade Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hard ware address To overcome this problem DHCP over InfiniBand messages convey a client iden tifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 4 2 Configuring the DHCP Server on page 225 4 3 3 1 1 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate config uration file needs to be created By default the DHCP server
89. usecs N rx Sets the interrupt coalescing settings when the adaptive frames N moderation is disabled Note usec settings correspond to the time to wait after the last packet is sent received before triggering an inter rupt ethtool a eth lt x gt Queries the pause frame settings ethtool A eth lt x gt rx on off tx Sets the pause frame settings onloff ethtool g eth lt x gt Queries the ring size values ethtool G eth lt x gt rx lt N gt tx Modifies the rings size lt N gt ethtool S eth lt x gt Obtains additional device statistics ethtool t eth lt x gt Performs a self diagnostics test ethtool s eth lt x gt msglvl N Changes the current driver message level ethtool T eth lt x gt Shows time stamping capabilities ethtool 1 eth lt x gt Shows the number of channels ethtool L eth lt x gt rx lt N gt tx Sets the number of channels lt N gt 4 16 Dynamically Connected Transport Service Dynamically Connected transport DCT is currently at beta level Please be aware that the content below 1s subject to change Dynamically Connected transport DCT service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynam ically connect and disconnect from any remote node DCT connections only stay connecte
90. ver gt lt OS label gt lt CPU arch gt iso You can download it from http www mellanox com gt Products gt Software gt InfiniBand Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO image Run the following command and compare the result to the value provided on the download page host1 md5sum MLNX OFED LINUX lt ver gt lt 0S label gt iso Mellanox Technologies 27 Rev 2 1 1 0 0 Installation 2 3 Installing Mellanox OFED The installation script minxofedinstal1 performs the following Discovers the currently installed kernel Uninstalls any software stacks that are part of the standard operating system distribution or another vendor s commercial stack Installs the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identifies the currently installed InfiniBand and Ethernet network adapters and automat ically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages Pre existing configuration files will be saved with the extension conf rpmsave aie If you need to install Mellanox OFED on an entire homogeneous cluster a common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on al
91. vml uses VF 0000 02 00 1 and vm2 uses VF 0000 02 00 2 Stepc Configure the virtual to physical PKey mapping for the VMs echo 0 gt 0000 02 00 1 ports 1 pkey idx 1 echo 1 gt 0000 02 00 1 ports 1 pkey idx 0 echo 0 gt 0000 02 00 2 ports 1 pkey idx 1 echo 2 gt 0000 02 00 2 ports 1 pkey idx 0 vml pkey index 0 will be mapped to physical pkey index 1 and vm2 pkey index 0 will be mapped to physical pkey index 2 Both vm1 and vm2 will have their pkey index 1 mapped to the default pkey Step d On Host2 do the following cd sys class infiniband mlx4 0 iov echo 0 gt 0000 03 00 1 ports 1 pkey idx 1 echo 1 gt 0000 03 00 1 ports 1 pkey idx 0 echo 0 gt 0000 03 00 2 ports 1 pkey idx 1 echo 2 gt 0000 03 00 2 ports 1 pkey idx 0 Step e Once the VMs are running you can check the VM s virtualized PKey table by doing on the vm cat sys class infiniband mlx4 0 ports 1 2 pkeys 0 1 Step 3 Start up the VMs and bind VFs to them Step 4 Configure IP addresses for ib0 on the host and on the guests Mellanox Technologies 103 Rev 2 1 1 0 0 Driver Features 4 13 7 3 Ethernet Virtual Function Configuration when Running SR IOV 4 13 7 3 1VLAN Guest Tagging VGT and VLAN Switch Tagging VST When running ETH ports on VGT the ports may be configured to simply pass through packets as is from VFs Vlan Guest Tagging or the administrator may configure the Hypervisor to silently force packets to be associated with a VLan Qo
92. with a specific PKey in the PR MPR query Any ULP application with a specific target IB port GUID in the PR MPR query Since any section of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows qos ulps default 0 default SL end qos ulps It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible key words qos ulps default 0 default SL sdp port num 30000 0 SL for application running on top of SDP when a destination TCP IPport is 30000 Sdp port num 10000 20000 8 0 sdp 1 default SL for any other application running on top of SDP rds B 2 s SM iO RDS ito ipoib pkey 0x0001 0 SL for IPoIB on partition with pkey 0x0001 ipoib 4 default IPoIB partition pkey 0x7FFF any service id 0x6234 6 match any PR MPR query with a specific Service ID 174 Mellanox Technologies rev 2 1 1 0 0 any pkey Ox0ABC 6 match any PR MPR query with a specific PKey srp target port guid 0x1234 5 SRP when SRP Target is located on a specified IB port GUID any target port guid Ox0ABC OxFFFFF 6 match any PR MPR query with a
93. 00000239 00 00 02 0000200910539 A 5 Subnet Manager OpenSM This section applies to ports configured as InfiniBand only FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accom plish this Note that OpenSM may be run on the same host running the DHCP server but it is not mandatory For details on OpenSM see OpenSM Subnet Manager on page 139 To use OpenSM caching for large InfiniBand clusters gt 100 nodes it is recommended to use the OpenSM options described in Section 8 2 1 opensm Syntax on m page 139 A 6 BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device The priority of this list can be modified through BIOS setup A 7 Operation A 7 1 Prerequisites Make sure that your client is connected to the server s TheFlexBoot image is already programmed on the adapter card see Section A 3 For InfiniBand ports only Start the Subnet Manager as described in Section A 5 The DHCP server should be configured and started see Section 4 3 3 1 IPoIB Config uration Based on DHCP on page 58 Configure and start at least one of the services iSCSI Target see Section A 9 and or TFTP see Section A 6 Mellanox Technolog
94. 006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt ib0 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 2 DOWN gt ibl Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 1 DOWN gt eth2 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 2 DOWN gt eth3 Down sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev mlx4 0 port 1 gt eth5 Down mlx4 0 port 1 gt ib0 Down mlx4 0 port 2 gt ibl Down mlx4 1 port 1 gt eth2 Down mlx4 1 port 2 gt eth3 Down 9 9 ibstatus Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h lt device name gt lt port gt Output Files Table 30 lists the various flags of the command Table 30 ibstatus Flags and Options Optional Default Flag p lo If Not Description oy Specified h Optional Print the help menu lt device gt Optional All devices Print information for the specified device May specify more than one device Mellanox Technologies 201 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 30 ibstatus Flags and Options Optional Detant Flag p If Not Description Mandatory a Specified lt port
95. 1 0 0 Installation 30 Mellanox Technologies m 2 1 1 0 0 2 3 2 1 mlnxofedinstall Return Codes Table 2 lists the mlnxofedinstall1 script return codes and their meanings Table 2 minxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration This can occur when the required hardware is not present on the system 172 Prerequisites are not met For example missing the required software installed or the hardware is not configured correctly 173 Failed to start the mst driver Mellanox Technologies 31 Rev 2 1 1 0 0 Installation 2 3 3 Installation Procedure Step 1 Login to the installation machine as root Step 2 Mount the ISO image on your machine hostlf mount o ro loop MLNX OFED LINUX lt ver gt lt 0S label gt lt CPU arch gt iso mnt Step 3 Run the installation script mlnxofedinstall Logs dir tmp MLNX OFED LINUX 2 1 0 0 9 10740 1ogs This programwillinstallthe MLNX OFED LINUX package on your machine Note that all otherMellanox OEM OFED or Distribution IBpackages willbe removed Do you want to continue y N y Uninstalling the previous version of MLNX OFED LINUX bin rpm nosignature e allmatches nodeps libmverbs libmverbs i686 libmver
96. 1 9 12 25 29 26 29 27 0 Mellanox Technologies 167 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 6 Quality of Service Management in OpenSM 8 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the provided policy on client requests The overall flow for such requests is as follows The request is matched against the defined matching rules such that the QoS Level def inition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 4 QoS Manager Administrator qos ulps default 3 ipoib sdp srp target port guid 0x1234 end qos ulps NERO QoS Policy Config File QoS Manager InfiniBand Subnet OSM with OFED based Nodes There are two ways to define QoS policy Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 8 6 2 Advanced Q
97. 224 A 3 Burning the Expansion ROM Image 0 00 ce cece eee 224 A 4 Preparing the DHCP Server in Linux Environment 225 A 5 Subnet Manager OpenSM 0 ccc cece cette 227 Mellanox Technologies 7 J Rev 2 1 1 0 0 A 6 BIOS Configuration nerep vat bi VAR er ehh 227 A O A E O 227 A 8 Diskless Machines 2 0 00 cece cece e 229 ALO ASCST BOO xcs skin sne ee eer di 234 Appendix B SRP Target Driver 00 rss rr eee eee 236 B 1 Prerequisites and Installation 2 0 cece ee 236 B 2 How to Rum 2s el ae dee hee RAE Mende bed 236 B 3 How to Unload Shutdown 0 cece cece eee 239 Appendix C mlx4 Module Parameters llle 240 G l mlx4 ab Parameters zo sau aee ph nce peni e eias 240 C 2 mlx4 core Parameters oeras ria er a M RORIS Le 240 C3 mlx4 ven Parametros Ere CE do E d RR 241 Appendix D mlx5 Module Parameters 00 cee eee rr rr rr rss 242 Appendix E Lustre Compilation over MLNX OFED 243 8 Mellanox Technologies m 2 1 1 0 0 List of Figures Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards o oooooo o 21 Figure 2 I O Consolidation Over InfiniBand 0 0 cece eee 64 Figure 3 An Example of a Virtual Network sseeseeeeseeee e 85 Ripur 4 QoS Manager osse a PIT OI I eMe e pp ee ee dug 168 Figure 5 Example QoS Deployment on InfiniBa
98. 6cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c90200226cf4 gt sys class infiniband srp srp mthca0 1 add target OR You can edit etc infiniband openib conf to load the SRP driver and SRP High Avail ability HA daemon automatically that is set SRP LOAD yes and SRPHA_ENABLE yes To set up and use the HA feature you need the dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of an SRP Target setup file kkxkkkkkkkkkkkkkkkkkkxkkk srpt sh kkxkxkkkkkkkkkkkkkkkkkkkkkkkkkkkxk bin sh modprobe scst scst_threads 1 modprobe scst_vdisk scst vdisk ID 100 ec ec ec ec ec ec ec ec no no no no no no no no open vdisk0 dev cciss cld0 BLOCKIO gt proc scsi_tgt vdisk vdisk open vdiskl dev sdb BLOCKIO gt proc scsi tgt vdisk vdisk open vdisk2 dev sdc BLOCKIO gt proc scsi tgt vdisk vdisk open vdisk3 dev sdd BLOCKIO proc scsi tgt vdisk vdisk add vdisk0 0 gt proc scsi tgt groups Default devices add vdiskl 1 gt proc scsi tgt groups Default devices add vdisk2 2 proc scsi tgt groups Default devices add vdisk3 3 proc scsi tgt groups Default devices modprobe ib srpt 238 Mellanox Technologies m 2 1 1 0 0 echo add mgmt proc scsi tgt trace
99. CA manager creates a topol ogy based collective tree and orchestrates an efficient collective operation using the CPUs in the servers that are part of the collective operation FCA accelerates MPI collective operation perfor mance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime MLNX_OFED v2 0 or later comes with a pre installed version of FCA v2 x FCA is built on the following main principles Topology aware Orchestration The MPI collective logical tree is matched to the physical topology The collective logical tree is constructed to assure Maximum utilization of fast inter core communication Distribution of the results Communication Isolation 122 Mellanox Technologies m 2 1 1 0 0 Collective communications are isolated from the rest of the traffic in the fabric using a private virtual network V Lane eliminating contention with other types of traffic After MLNX OFED installation FCA can be found at opt me11anox fca folder For further information on configuration instructions please refer to the FCA User Manual 5 5 ScalableUPC Unified Parallel C UPC is an extension of the C programming language designed for high per formance computing on large scale parallel machines The language provides a uniform program ming model for both shared and distributed memory hardware The programmer is presented with a single shared pa
100. Detant Flag ae dat If Not Description MUR Specified C Optional Use the specified channel adapter or router ca name P ca port Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout_ms MADs msec gt lt lid guid gt Mandatory Use the specified port s or node s LID GUID with G flag with G option lt port gt Mandatory Use the specified port without G flag Examples 1 Check aggregated node counter for LID 0x2 gt ibcheckerrs 2 warn counter SymbolErrors 65535 threshold 10 lid 2 port 255 warn counter LinkRecovers 255 threshold 10 lid 2 port 255 warn counter LinkDowned 12 threshold 10 lid 2 port 255 warn counter RevErrors 565 threshold 10 lid 2 port 255 warn counter XmtDiscards 441 threshold 100 lid 2 port 255 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port all FAILED 2 Check port counters for LID 2 Port 1 gt ibcheckerrs v 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 3 Check the LID2 Port 1 using the specified threshold file gt cat threshl SymbolErrors 10 LinkRecovers 10 LinkDowned 10 RcvErrors 10 RcvRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDiscards 100 XmtConstraintErrors 100 RcvConstraintErrors 100 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 Mellanox Technologies 217 Rev 2 1 1 0 0 gt ibcheckerrs v
101. E Mellanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Y oknearn 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 04 959 3245 Copyright 2014 Mellanox Technologies All Rights Reserved Mellanox Mellanox logo BidgeX ConnectX Connect IB CORE Direct Infini Bridge InfiniHost InfiniScale MetroX MLNX OS PhyX ScalableHPC SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroDX Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2877 m 2 1 1 0 0 Table of Contents Table of Contents 4 24 ir e e o ERE e as Eist of FIGURES oso eb ESSA A ad Lastof Tables 7 23 2 E ce RR ER ti DU Chapter 1 Mellanox OFED Overview oooooocoocmoroocomomomomomo 19 1 1 Introduction to Mellanox OFED 00 ccc ees 19 1 2 Mellanox OFED Package o0oooooooooorrororar s 19 ELL ISO MA owes Sls ee EBORE EAR ERR SOUS NIS 19 1 2 2 Software Components iss os Ree ERU ERE UHR AURA RR 19 1 2 3 A endi desea RNSOARNR R A Aaa 20
102. FED release of the Open Fabrics www openfabrics org soft ware stack For more information about the DAT collaborative go to the following site http www datcollaborative org Mellanox Technologies 23 J Rev 2 1 1 0 0 Mellanox OFED Overview 1 3 5 MPI Message Passing Interface MPI is a library specification that enables the development of paral lel software libraries to utilize parallel computers clusters and heterogeneous networks Mella nox OFED includes the following MPI implementations over InfiniBand Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Bench mark and Presta 1 3 6 InfiniBand Subnet Manager All InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OF ED See Chap ter 8 OpenSM Subnet Manager 1 3 7 Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data center managers e 1butils Mellanox Technologies diagnostic utilities infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools 1 3 8 Mellanox Firmware Tools The Mellanox Firmware
103. FILTER PTP V2 L2 SYNC 802 AS1 Ethernet Delay req packet HWTSTAMP FILTER PTP V2 L2 DELAY REQ PTP v2 802 AS1 any layer any kind of event packet HWISTAMP FILTER PTP V2 EVENT PTP v2 802 AS1 any layer Sync packet HWISTAMP FILTER PTP V2 SYNC PTP v2 802 AS1 any layer Delay req packet HWTSTAMP FILTER PTP V2 DELAY REQ NH Note for receive side time stamping currently only HWTSTAMP FILTER NONE and HWTSTAMP FILTER ALL are supported 4 6 1 2 Getting Time Stamping Once time stamping is enabled time stamp is placed in the socket Ancillary data recvmsg can be used to get this control message for regular incoming packets For send time stamps the outgo ing packet is looped back to the socket s error queue with the send time stamp s attached It can be received with recvmsg flags MSG ERRQUEUE The call returns the original outgoing packet data including all headers preprended down to and including the link layer the scm timestamping control message and a sock extended err control message with ee errno ENOMSG and ee origin SO EE ORIGIN TIMESTAMPING A socket with such 78 Mellanox Technologies m 2 1 1 0 0 a pending bounced packet is ready for reading as far as select is concerned If the outgoing packet has to be fragmented then only the first fragment is time stamped and returned to the sending socket When time stamping is enabled VLAN
104. ID in hex gt vlr lt file gt r routing u fat_tree o output_path lt directory gt skip lt stage gt skip plugin library name gt pc P counter lt lt PM gt lt value gt gt pm pause time lt seconds gt ber test ber use data ber thresh lt value gt extended speeds lt dev type gt pm per lane 1s 2 5 5 10 14 25 FDR105 lw 1x 4x 8x 12x w write topo file file name gt t topo file file out ibnl dir lt directory gt screen num errs lt num gt smp window lt num gt gmp window lt num gt max hops lt max hops gt V version h help H deep help Mellanox Technologies 191 Rev 2 1 1 0 0 Options 1 device lt dev name gt p port lt port num gt g guid lt GUID in hex vlr file r routing i ass o output path directory skip stage skip plugin library name gt pc P counter lt lt PM gt lt value gt gt pm pause time seconds Eu Specifies the name of the device of the port used to connect to the IB fabric in case of multiple devices on he local system Specifies the local device s port number used to connect to the IB fabric Specifies the local port GUID value of the port used to connect to the IB fabric If GUID given is 0 than ibdiagnet displays a list of possible port GUIDs and waits for user input
105. IMESTAMPING RX HARDWARE return the original unmodified time stamp as generated by the hardware SOF TIMESTAMPING RX SOFTWARE aE SOF TIMESTAMPING RX HARDWARE dg Quei Oye fails then do it in software Mellanox Technologies 75 J Rev 2 1 1 0 0 Driver Features SOF TIMESTAMPING RAW HARDWARE return original raw hardware time stamp SOF TIMESTAMPING SYS HARDWARE return hardware time stamp transformed to the system time base SOF TIMESTAMPING SOFTWARE return system time stamp generated in software SOF TIMESTAMPING TX RX determine how time stamps are generated SOF TIMESTAMPING RAW SYS determine how they are reported 76 Mellanox Technologies kev 2 1 1 0 0 To enable time stamping for a net device Admin privileged user can enable disable time stamping through calling ioctl sock SIOCSHWT STAMP amp ifreq with following values Send side time sampling Enabled by ifreq hwtstamp config tx type when possible values for hwtstamp config gt tx type enum hwtstamp tx types No outgoing packet will need hardware time stamping should a packet arrive which asks for it no hardware time stamping will be done s HWTSTAMP TX OFF Enables hardware time stamping for outgoing packets the sender of the packet decides which are to be time stamped by setting SOF TIMESTAMPING TX SOFTWARE before sending the packet i HWTSTAMP TX ON Enables time stamping fo
106. ING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seg 0 ttl 64 time 0 079 ms 64 bytes from 11 4 3 176 icmp seq 1 ttl 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seq 3 ttl 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seg 4 ttl 64 time 0 065 ms 11 4 3 176 pirig statistics 62 Mellanox Technologies m 2 1 1 0 0 5 packets transmitted 5 received 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 3 6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver e Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg ib0 The only meaningful bonding policy in IPoIB is High Availability bonding mode num ber 1 or active backup Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only supported value is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions In the bonding master configuration file e g ifcfg bond0 in addition to Linux bond ing semantics use the follo
107. M is not supported Since the SM is not present querying a path 1s impossible Therefore the path record structure must be filled with the relevant values before establishing a connection Hence it is recommended working with RDMA CM to establish a connection as it takes care of filling the path record structure The GID table for each port is populated with N 1 entries where N is the number of IP addresses that are assigned to all network devices associated with the port including VLAN devices alias devices and bonding masters The only exception to this rule is a bonding master of a slave in a DOWN state In that case a matching GID to the IP address of the master will not be present in the GID table of the slave s port The first entry in the GID table at index 0 for each port 1s always present and equal to the link local IPv6 address of the net device that is associated with the port Note that even if the link local IPv6 address is not set index O is still populated Mellanox Technologies 25 J Rev 2 1 1 0 0 Mellanox OFED Overview GID format can be of 2 types IPv4 and IPv6 IPv4 GID is a IPv4 mapped IPv6 address while IPv6 GID is the IPv6 address itself 1 For the IPv4 address A B C D the corresponding IPv4 mapped IPv6 address is ffff A B C D 26 Mellanox Technologies m 2 1 1 0 0 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host ma
108. Mellanox TECHNOLOGIES Mellanox OFED for Linux User Manual Rev 2 1 1 0 0 Last Updated 18 February 2014 www mellanox com Rev 2 1 1 0 0 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT CPRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAG
109. N nodes can be specified by V GV or V io guid fileV option These nodes will be allowed to use switches the wrong way around a specific number of times specified by H or V max reverse hopsY With the proper max reverse hops and lo guid file values you can ensure full connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt gt N3 With a max reverse hops value of 2 N1 N2 and N3 will all have routes between them Using max reverse hops creates routes that use the switch in a counter stream way This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or similar Also having routes the other way around can cause credit loops 8 5 4 2 Activation through OpenSM e Use R ftree option to activate the fat tree algorithm LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm 1s invoked instead de 8 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path rout ing algorithm that enables topology agnostic deadlock free routing within communication net works When computing the routing function LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual l
110. OS sk prio to UP mapping is lost S Performing the Raw Ethernet QP mapping forces the QP to transmit using the given UP If packets with VLAN tag are transmitted UP in the VLAN tag will be overwritten with a the given UP 4 5 6 Map Priorities with tc wrap py mlnx qos Network flow that can be managed by QoS attributes is described by a User Priority UP A user s sk priois mapped to UP which in turn is mapped into TC Indicating the UP When the user uses sk prio it is mapped into a UP by the tc tool This is done by the tc wrap py tool which gets a list of lt 16 comma separated UP and maps the sk prio to the specified UP For example tc wrap py ieth0 u 1 5 maps sk prio 0 Ofetho device to UP 1 and sk prio 1to UP 5 e Setting set egress map in VLAN maps the skb priority of the VLAN to a vlan qos The v1an qos is represents a UP for the VLAN device n RoCE rdma set option with RDMA OPTION ID TOS could be used to set the UP When creating QPs the s1 field in ibv modify ap command represents the UP Indicating the TC Mellanox Technologies 69 J Rev 2 1 1 0 0 Driver Features 4 5 7 4 5 7 1 4 5 7 2 4 5 7 3 4 5 8 4 5 8 1 After mapping the skb priority to UP one should map the UP into a TC This assigns the user priority to a specific hardware traffic class In order to do that minx_qos should be used minx qos gets a list of a mapping between UPs to TCs For example minx qos i
111. Optional Apply query to all ports l Optional Loop ports r Optional Reset the counters after reading them C Optional Use the specified channel adapter or router lt ca_name gt P lt ca_port gt Optional Use the specified port R Optional Reset the counters t Optional Override the default timeout for the solicited lt timeout_ms MADs msec gt V ersion Optional Show version info lt lid guid gt Optional LID or GUID port reset mask Examples perfquery r 32 1 read performance counters and reset perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports perfquery R 32 2 OxOfff reset only error counters of port 2 perfquery R 32 2 0xf000 reset only non error counters of port 2 1 Read local port s performance counters gt perfquery Port counters Lid 6 port 1 Portoelet e a Re e e SET 1 Connterselec e e meret eee 0x1000 SM id 0 LINKRECOVELS Sega menna e a se e E 0 Sn KDOWIIdsseseee e TUN Uds 0 ROXEPEOES Ue e oet m PUN EIUS 0 REVREMOESPAY SEE M 0 214 Mellanox Technologies m 2 1 1 0 0 REVERSO 0 AME DAS Cac eT 0 MMIC CONUS ranm tiero nS Pris on c o 5060256 0 RCM craneo coo cog ao ame 0 ES EM Ee TAGS Db one oooh e 0 Ix onem D MEME 0 Vo erre a ra 0 Mt e 55178210 ao meee nS
112. Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Disabled Hyper Threading Disabled Recommended for latency and message rate sensitive applications CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled 7 1 3 4 AMD Processors The following table displays the recommended BIOS settings in machines with AMD based pro CeSsors Table 19 Recommended BIOS Settings for AMD Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Disabled HPC Optimizations Enabled CPU frequency select Max performance 130 Mellanox Technologies m 2 1 1 0 0 Table 19 Recommended BIOS Settings for AMD Processors BIOS Option Values Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 7 2 Performance Tuning for Linux You can u
113. R IOV and Para Virtualization on the same setup Step 1 Create a bridge vim etc sysconfig network scripts ifcfg bridge0 DEVICE bridgeO TYPE Bridge IPADDR 12 195 15 1 ETMASK 255 255 0 0 BOOTPROTO static ONBOOT yes M CONTROLLED no DELAY 0 96 Mellanox Technologies m 2 1 1 0 0 Step 2 Change the related interface in the example below bridge is created over eth5 DEVICE eth5 BOOTPROTO none STARTMODE on HWADDR 00 02 c9 2e 66 52 TYPE Ethernet NM_CONTROLLED no ONBOOT yes BRIDGE bridge0 Step 3 Restart the service network Step 4 Attach a virtual NIC to VM ifconfig a eth6 Link encap Ethernet HWaddr 52 54 00 E7 77 99 imer sob II 195 15 5 Becat 1952255925599 Mask255 259 0 0 inet6 addr fe80 5054 ff fee7 7799 64 Scope Link UP BROADCAST RUNNING MULTICAST MTU 1500 Metric 1 RX packets 481 errors 0 dropped 0 overruns 0 frame 0 TX packets 450 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 22440 21 9 KiB TX bytes 19232 18 7 KiB Interrupt 10 Base address 0xa000 4 13 4 Assigning a Virtual Function to a Virtual Machine This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine 4 13 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server Step 1 Run the virt manager Step 2 Double click on the virtual machine and open its Properties Step 3 Go to Details gt Add hardware gt PCI host device wc He Virtual Machine View Send K
114. RP partitions that were mounted b Stop service srpd Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shut down etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown iSCSI Extensions for RDMA iSER iSCSI Extensions for RDMA iSER is currently at beta level Please be aware that the content below 1s subject to change ade Overview iSCSI Extensions for RDMA 1SER extends the iSCSI protocol to RDMA It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies ISER Initiator The SER initiator is controlled through the 1SCSI interface available from the iscsi initiator utils package Make sure iSCSI is enabled and properly configured on your system before proceeding with ISER Targets settings such as timeouts and retries are set the same as any other iSCSI targets If targets are set to auto connect on boot and targets are unreachable it may take a long time to continue the boot process if timeouts and max retries are set too high Example for discovering and connecting targets over i
115. SER iscsiadm m discovery o new o old t st I iser p lt ip port gt 1 iSER also supports RoCE without any additional configuration required To bond the RoCE interfaces set the fail over mac option in the bonding driver 56 Mellanox Technologies Rov 2 1 1 0 0 4 3 IP over InfiniBand 4 3 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service The IPoIB driver ib 1poib exploits the following capabilities VLAN simulation over an InfiniBand network via child interfaces High Availability via Bonding Varies MTU values up to 4k in Datagram mode up to 64k in Connected mode Uses any ConnectX IB ports one or two Inserts IP UDP TCP checksum on outgoing packets e Calculates checksum on received packets Support net device TSO through ConnectX LSO capability to defragment large data grams to MTU quantas Dual operation mode datagram and connected Large MTU support through connected mode IPoIB also supports the following software based enhancements Giant Receive Offload NAPI Ethtool support 4 3 2 IPolB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Datagram except for Connect IBTM adapter card which uses IPoIB with Connected mode as default For better scalability and per
116. SM in testability mode Without d no debug options are enabled Selo lt li c Display this usage info then exit 8 2 2 Environment Variables The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet lst opensm fdbs and opensm mcfdbs By default this directory is var log OSM CACHE DIR Mellanox Technologies 147 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 2 3 8 2 4 8 2 4 1 8 3 opensn stores certain data to the disk such that subsequent runs are consistent The default directory used is var cache opensm The following file is included in it e guid21id stores the LID range assigned to each GUID Signaling When OpenSM receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSRI can be used to trigger a reopen of var log opensm 1log for logrotate pur poses Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter hostl opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var l
117. SS ALLOCATE MR bit as part of the sharing bits Usage Turns on via the ibv reg mr one or more of the sharing access bits The sharing bits are part of the ibv reg mr man page Turns on the IBV ACCESS ALLOCATE MR bit Step2 Request to register to a shared MR A new verb called ibv reg shared mr is added to enable sharing an MR To use this verb the application supplies the MR ID that it wants to register for and the desired access mode to that MR The desired access is validated against its given permissions and upon successful creation the physical pages of the original MR are shared by the new MR Once the MR is shared it can be used even if the original MR was destroyed The request to share the MR can be repeated multiple times and an arbitrary number of Memory Regions can potentially share the same physical memory locations Usage Uses the handle field that was returned from the ibv reg mr as the mr handle Supplies the desired access mode for that MR Supplies the address field which can be either NULL or any hint as the required output The address and its length are returned as part ofthe ibv mr struct To achieve high performance it 1s highly recommended to supply an address that is aligned as the origi nal memory region address Generally it may be an alignment to 4M address For further information on how to use the ibv reg shared mr verb please refer to the ibv reg shared mr man page an
118. Similar to SDP RDS PR query is matched by Service ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds JE any service id 0x00000000010648CA SL Mellanox Technologies 175 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the tar get IB port GUID The following two match rules are equivalent srp target port guid 0x1234 lt SL gt any target port guid 0x1234 SL Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 8 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traffic and that s why it is the only ULP that did not appear in the qos ulps section 8 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are Max VLs the maximum number of VLs that will be on the subnet H
119. T threshl 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 9 15 mstflint Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access If you purchased a standard Mellanox Technologies network adapter card please down load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please A contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches command parameters Output Files Table 36 lists the various switches of the utility and Table 37 lists its commands Table 36 mstflint Switches Sheet 1 of 3 InfiniBand Fabric Diagnostic Utilities Affected Switch Relevant Description Commands h Print the help menu hh Print an extended help menu d evice All Specify the device to which the Flash is connected lt device gt guid burn sg GUID base value 4 GUIDs are automatically assigned to the lt GUID gt following values guid gt node GUID guid 1 gt portl guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value guids burn sg 4 GUIDs must be speci
120. TCP timestamps option for better CPU utilization sysctl w net ipv4 tcp timestamps 0 Mellanox Technologies 131 Rev 2 1 1 0 0 Performance 7 2 3 7 2 4 Enable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp_sack 1 Preserving Your Performance Settings after a Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows lt sysctl namel gt lt valuel gt lt sysctl name2 gt lt value2 gt lt sysctl name3 gt lt value3 gt lt sysctl name4 gt lt value4 gt For example Tuning the Network Adapter for Improved IPv4 Traffic Performance on page 131 lists the following setting to disable the TCP timestamps option sysctl w net ipv4 tcp timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 Tuning Power Management Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent Check the maximum supported CPU frequency cat sys devices system cpu cpu cpufreq cpuinfo max freq e Check that core frequencies are consistent cat proc cpuinfo grep cpu MHz Check that the output frequencies are the same as the maximum supported Ifthe CPU frequency is not at the maximum check the BIOS settings according to tables in is section Recommended BIOS
121. The param eters lt startlid gt and lt endlid gt specify the MLID range s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries C Optional Use the specified channel adapter or router lt ca_name gt P ca port Optional Use the specified port Mellanox Technologies 207 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 32 ibportstate Flags and Options Optional Da Flag Mandato If Not Description ay Specified t Optional Override the default timeout for the solicited lt timeout_ms MADs msec gt lt destdr path Optional Destination s directed path LID or GUID lid guid gt lt startlid gt Optional Starting LID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination POC Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1
122. U The remote processor is unaware that its memory has been read or written unless the programmer implements a mechanism to accomplish this 5 1 1 Mellanox ScalableSHMEM The ScalableSHMEM programming library is a one side communications library that supports a unique set of parallel programming features including point to point and collective routines syn chronizations atomic operations and a shared memory paradigm used between the processes of a parallel programming application Mellanox ScalableSHMEM is based on the API defined by the OpenSHMEM org consortium The library works with the OpenFabrics RDMA for Linux stack OFED and also has the ability to utilize MellanoX Messaging libraries MXM as well as Mellanox Fabric Collective Accelera tions FCA providing an unprecedented level of scalability for SHMEM programs running over InfiniBand The latest ScalableSHMEM software can be downloaded from the Mellanox website Mellanox Technologies 115 Rev 2 1 1 0 0 HPC Features 5 1 2 Running SHMEM with FCA The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI or ScalableSHMEM process onto Mella nox InfiniBand managed switch CPUs As a system wide solution FCA utilizes intelligence on Mellanox InfiniBand switches Unified Fabric Manager and MPI nodes without requiring addi tional hardware The FCA manager creates a topology based colle
123. aSe VELS alterado tardes 1 Cao ON ON Ro aan 1 Node Pype neee tease eave cesos stars a a Channel Adapter Misa prono a e das 2 SV SE SM EUT ete RI NODE EIOS 0x0002c9030000103b GUL eraneccensiocas ceca CE IEE 0x0002c90300001038 e E E a epe ert 0x0002c90300001039 Part CaP torio 128 Da 0x634a REVIS TONS cious sien a cis UE RUNS 0x000000a0 loca roo sie 1 Vendor Le eerte tenete sleds lea 0x0002c9 9 13 perfquery Queries InfiniBand ports performance and error counters Optionally it displays aggregated counters for all ports of a node It can also reset counters after reading them or simply reset them Synopsis pearpeyy 9 Sd sel lj ie PG eel memes Pel lt ca port AN E lt timeout_ms gt V lt lid guid gt port reset mask Output Files Table 34 lists the various flags of the command Table 34 perfquery Flags and Options Optional Default Flag E fon If Not Description Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d Mellanox Technologies 213 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 34 perfquery Flags and Options Optional Deut Flag vu diio If Not Description ay Specified G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 a
124. ackets that contain the above destination IP address and source port are to be steered into rx ring 2 When destination MAC is not given the user s destination MAC is filled automatically e ethtool u eth5 Shows all of ethtool s steering rule When configuring two rules with the same priority the second rule will overwrite the first one so this ethtool interface is effectively a table Inserting Flow Steering rules in the kernel requires support from both the ethtool in the user space and in kernel v2 6 28 MLXA Driver Support The mlx4 driver supports only a subset of the flow specification the ethtool API defines Asking for an unsupported flow specification will result with an invalid value failure The following are the flow specific parameters Table 5 Flow Specific Parameters ether tcp4 udp4 ip4 Mandatory dst src ip dst ip Optional vlan src ip dst ip src src ip dst ip vlan port dst port vlan e RFS RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUs that are used by flow s owner applications This domain allows the RFS mechanism to use the flow steering infrastructure to support the RFS logic by implementing the ndo xx flow steer which in turn calls the underlying flow steering mechanism with the RFS domain Enabling the RFS requires enabling the ntuple flag via the ethtool For example to enable ntuple for eth0 run ethtoo
125. ags and Options nsaan cece cect eh 207 Table37 smpquery Flags and Options 0 0c cece eee eee rr rer ere eens 210 Table 38 perfquery Flags and Options 0 unanunua eee ene n 213 Table 39 ibcheckerrs Flags and Options 0 cece cece eh 216 Table 40 mstflint Switches 218 Table 41 mstflint Commands 0 ce an 220 Mellanox Technologies 11 Rev 2 1 1 0 0 Document Revision History Table 1 Document Revision History Release Date Description 2 1 1 0 0 February 18 2014 Updated the following section Section 2 3 3 Installation Procedure on page 32 Section 4 13 2 Setting Up SR IOV on page 92 Section 8 6 1 Overview on page 168 2 1 1 0 0 December 2013 Added the following sections Section 2 3 6 Installation Logging on page 41 Section 4 6 2 RoCE Time Stamping on page 79 and its subsections Section 4 17 PeerDirect on page 107 Section 4 18 Inline Receive on page 108 Section 4 19 Ethernet Performance Counters on page 109 Section 4 20 Memory Window on page 113 Section 4 1 2 1 1 SRP Module Parameters on page 42 Section 4 1 2 1 2 SRP Remote Ports Parameters on page 42 Section 4 1 2 2 1 SRP sysfs Parameters on page 43 Section srpd on page 46 Section 4 6 1 3 Querying Time Stamping Capabilities via ethtool on page 79 Updated the following sections
126. all the SRP Targets reachable by the SRP Initiator via another umad device use the following command ibsrpdm d lt umad device gt 2 Assistance in creating an SRP connection Mellanox Technologies 51 Rev 2 1 1 0 0 Driver Features To generate output suitable for utilization in the echo command of Section 4 1 2 2 add the c option to ibsrpdm ibsrpdm c Sample output id ext 200400A0881146A1 ioc quid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 To establish a connection with an SRP Target using the output from the ibsrpdm c example above execute the following command echo n id ext 200400A0B81146A1 ioc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband srp srp mlx4 0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command 3 Discover reachable SRP Targets given an InfiniBand HCA name and port rather than by just srpd runing sys class infiniband mad umad N where lt N gt is a digit The srpd service script allows automatic activation and termination of the srp daemon utility on all system live InfiniBand ports srp daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsr pdm functionality described above srp daemo
127. an ager to allow it to configure the AR on this switch This option can be changed on the fly AR_MODE Adaptive Routing Mode Default bounded lt bounded free gt free no constraints on output port selection bounded the switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets This option can be changed on the fly AGEING_ TIME Applicable to bounded AR mode only Specifies Default 30 lt usec gt how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value This option can be changed on the fly MAX ERRORS When number of errors exceeds Values for both options 0 lt N gt MAX ERRORS of send receive errors or time Oxffff ERROR WINDOW outs in less than ERROR_WINDOW seconds MAX ERRORS 0 zero tolle lt N gt the AR Manager will abort returning control aa va configuration oni first k to the Subnet Manager Sa bac E ger ERROR WINDOW 0 mecha This option can be changed on the fly nism disabled no error checking Default 5 LOG FILE lt full AR Manager log file Default var log armgr log path gt This option can be changed on the fly LOG_SIZE lt size This option defines maximal AR Manager log 0 unlimited log file size in MB gt file size in MB The logfile will be truncated and
128. and device mlx4 0 port 1 status default gid fe80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sm lid 0x3 state 228 JUNE phys state 5 LinkUp rate 20 Gb sec 4X DDR gt ibportstate C mlx4 0 3 1 query PortInfo POE mirog Laoi pone l KS E A MEE I LU Initialize Physlnkotateo eT CE IS LinkUp OMA TINS HOSOI bya on poo o 1X or 4X IimisWardiehEmable d d ECCE SEE 1X or 4X UnA o c e oo odood0oc0009 AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps MKS PEC ACI PPM NU EMIT 5 0 Gbps 2 Query the status of two channel adapters using directed paths gt ibportstate C mlx4 0 D 0 1 PortInfo Port mitos DR pada slid 656357 tel 655357 por 1 DiXnkotabo d Se mu e A MATTIS Udo Initialize ANS a gesehen ed me LinkUp nano T bo ooo C 1X or 4X A 6 cocco erem teet g 1X or 4X Dm HUCK VACCINES PT TES AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps inmkSpeediinablied a RETE TET 2 5 Gbps or 5 0 Gbps I mSpeedASCINGSE Ne rene eoo teen ts 5 0 Gbps gt ibportstate C mthca0 D 0 1 PortInfo Rart taros NN parda Slaicl 653337 clilicl 055307 0 pou I EN ee e eM EN E RE Tee RETE Down PRYS CINKO dit OS PT TUR NOE sree cre Polling Voting huele aS O OLG croce ooo o6 1X or 4X A n c o noo mero e c6 068 1X or 4X Tabs huele VXeYC beoe a oat good 05065 AX LinkSpeedSupported eese 2 5 Gbps LinkSpeedEnabled sss 2 5 Gbps Mellanox Tech
129. anox Technologies mem Appendix E Lustre Compilation over MLNX OFED This procedure applies to RHEL SLES OSs only ae To compile Lustre version 2 3 65 and higher 8 configure with o2ib usr src ofa kernel default make rpms To compile older Lustre versions EXTRA LNET INCLUDE I usr src ofa kernel default include include usr src ofa kernel default include linux compat 2 6 h configure with o2ib usr src ofa kernel default EXTRA LNET INCLUDE I usr src ofa kernel default include include usr src ofa kernel default include linux compat 2 6 h make rpms For Lustre 2 1 3 due to a duplicate definition of INVALID UID macro the following patch must be applied lustre 2 1 3 lustre include lustre cfg h 2012 09 17 14 26 46 000000000 0200 lustre 2 1 3 lustre include lustre cfg h new 2013 09 07 10 45 07 121772824 0200 Re 288 7 288 9 RR include lustre lustre user h ifndef INVALID_UID define INVALID UID als endif Mellanox Technologies 243
130. at path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to the one specified in the p option Otherwise an error is reported When ibdiagpath queries for the performance counters along the path between the source and destination ports it always traverses the LID route even if a directed route Is specified If along the LID route one or more links are not in the ACTIVE state ibdi Aa agpath reports an error Moreover the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source Synopsis ibdiagpath n lt src name dst name gt 1 lt src lid dst lid gt d JL jpg oco le SCORE ew Se ooie s lt sys name gt ic lt dev index gt c p lt port num gt o lt out eiz gt ely lt A es als lt 2 5 5 20 gt sam jae P lt lt PM counter gt lt Trash Limit gt gt Mellanox Technologies 197 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Options n lt src name dst name gt Names of the source and destination ports as defined in the topology file source may be omit ted gt local port is assumed to be the source l lt isiee
131. ayers in such a way as to avoid deadlock 158 Mellanox Technologies m 2 1 1 0 0 from HCA between and switch does not need virtual layers as deadlock will not arise gt LASH analyzes routes and ensures deadlock freedom between switch pairs The link between switch and HCA ae In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by main taining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 3 Once this stage has been completed it is highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It ca
132. band iov 0000 02 00 3 port port num gid idx 0 The value returned will present which guid index to modify on Domo Step 2 Modify the physical GUID table via the admin guids sysfs interface To configure the GUID at index n on port port num cd sys class infiniband mlx4 0 iov ports port num admin guids echo your desired guid gt n Example cd sys class infiniband mlx4 0 iov ports 1 admin guids echo 0x002fffff8118 gt 3 1 echo 0x0 means let the SM assign a value to that GUID echo Oxffffffffffffffff means delete that GUID echo any other value means request the SM to assign this GUID to this index Step3 Read the administrative status of the GUID index To read the administrative status of GUID index m on port n cat sys class infiniband mlx4 0 iov ports n admin guids m Step 4 Check the operational state of a GUID sys class infiniband mlx4 0 iov ports n gids where n 1 or 2 The values indicate what gids are actually configured on the firmware hardware and all the entries are R O Step 5 Compare the value you read under the admin guids directory at that index with the value under the gids directory to verify the change requested in Step 3 has been accepted by the SM and programmed into the hardware port GID table Mellanox Technologies 101 Rev 2 1 1 0 0 Driver Features If the value under admin guids m is different that the value under gids lt m gt the request is st
133. bs devel libmverbs devel i686 libmqe libmge i686 libmqe devel libmqge devel i686 Sta Ins ns GH sg ns Sl leah rasp ns GH sg ns Sh dial rar ns olo Tea ray ns iS ral rar ns il teal den k Ins D mpi Preparing lnx ofa kernel reparing mod mlnx ofa kernel reparing Inx ofa kernel devel reparing mod kernel mft mlnx reparing nem mlnx talling kmod knem mlnx 1 1 90mlnx2 RPM reparing mod knem mlnx reparing mmunotify mlnx talling kmod ummunotify mlnx 1 reparing mod ummunotify mlnx reparing selector talling mlnx ofa kernel RPM talling kmod minx ofa kernel talling mlnx ofa kernel devel talling kmod kernel mft mlnx talling knem mlnx RPM talling ummunotify mlnx RPM talling mpi selector RPM rting MLNX OFED LINUX 2 1 0 0 9 installa COM gon 32 Mellanox Technologies Insta Prepa fed repa libib Prepa ibib repa ibib repa ibib repa ibib repa ibib repa ibib repa ibm rep ibm rep ibm rep ibm rep ibm rep ibm rep ibm rep ibm repa ibcx repa ibcx repa ibcx repa ibcx repa libex Oo D
134. can share the device without interfering with each other mlx4 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer Mellanox Technologies 21 J Rev 2 1 1 0 0 Mellanox OFED Overview mlx4 en A 10 40GigE driver under drivers net ethernet mellanox mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer 1 3 2 mlx5 Driver m1x5 is the low level driver implementation for the Connect IB adapters designed by Mella nox Technologies Connect IB operates as an InfiniBand adapter The mlx5 driver is com prised of the following kernel modules mlx5 core Acts as a library of common functions e g initializing the device after reset required by the Connect IB adapter card mlx5 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer libmIx5 libmlx5 is the provider library that implements hardware specific user space functionality If there is no compatibility between the firmware and the driver the driver will not load and a mes sage will be printed in the dmesg The following are the Libmlx5 environment variables e MLX5 FREEZE ON ERROR CQE Causes the process to hang in a loop when completion with error which is not flushed with error or retry exceeded occurs Otherwise disabled e MLX5 POST SEND PREFER BF Configures every work request that can use blue flame will use blue flame Otherwise blue flame depends on the
135. carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the target IP and port number ULPs might also provide QoS Class The CMA then creates Ser vice ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes 4 4 3 Supported Policy The QoS poli
136. check the release notes e ib ipoib ko A 8 1 1 Example Adding an IB Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section 4 3 3 1 IPoIB Config uration Based on DHCP and is connected to the client machine 3 An initrd file Mellanox Technologies 229 Rev 2 1 1 0 0 4 To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the IB Driver to the initrd File The following procedure modifies critical files used in the boot procedure It must be 4 executed by users with expertise in the boot process Improper application of this pro cedure may prevent the diskless machine from booting Step 1 Back up your current initrd file Step 2 Make a new working directory and change to it host1 mkdir tmp initrd ib hostis cd tmp initrd ib Step 3 Normally the initrd image is zipped Extract it using the following command hostile gzip dc lt initrd image cpio sid The initrd files should now be found under tmp initrd_ib Step 4 Create a directory for the InfiniBand modules and copy them mkdir p tmp initrd ib lib modules ib cd lib modules uname r updates kernel drivers host host host host h
137. chine with Mellanox InfiniBand and or Ethernet adapter hardware installed 2 4 Hardware and Software Requirements Table 1 Software and Hardware Requirements Requirements Description Platforms A server platform with an adapter card based on one of the following Mellanox Technologies InfiniBand HCA devices e MT27508 ConnectX 3 VPI IB EN firmware fw ConnectX3 e MT4113 Connect IB IB firmware fw Connect IB For the list of supported architecture platforms please refer to the Mellanox OFED Release Notes file Required Disk Space 1GB for Installation Device ID For the latest list of device IDs please visit Mellanox website Operating System Linux operating system For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file Installer Privileges The installation requires administrator privileges on the target machine 2 2 Downloading Mellanox OFED Step 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHost entries in the display The following example shows a system with an installed Mellanox HCA lspci v grep Mellanox 06 00 0 Network controller Mellanox Technologies MT27500 Family ConnectX 3 Subsystem Mellanox Technologies Device 0024 Step 2 Download the ISO image to your host The image s name has the format MLNX_OFED_LINUX lt
138. chnologies 111 Rev 2 1 1 0 0 Driver Features Table 10 Port Pause where lt i gt is in the range 0 7 Counter Description tx pause transition prio i gt The number of transmitter transitions from XON state paused to XOFF state non paused Table 11 VPort Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF Counter Description vport lt i gt _rx_unicast_packet S Unicast packets received successfully vportci rx unicast bytes Unicast packet bytes received successfully vportci rx multicast pack ets Multicast packets received successfully vportci rx multicast byte S Multicast packet bytes received successfully vport lt i gt rx broadcast pac kets Broadcast packets received successfully vportci rx broadcast byte S Broadcast packet bytes received successfully vport lt i gt rx dropped vportci rx errors Received packets discarded due to out of buffer condition Received packets discarded due to receive error condition vport lt i gt tx unicast packet S Unicast packets sent successfully vport lt i gt tx unicast bytes Unicast packet bytes sent successfully vportci tx multicast pack ets Multicast packets sent successfully vportci tx multicast byte S vport lt i gt tx broadcast pac kets Multicast packet bytes sent successfully Broadcast packets
139. chnologies 159 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 6 8 5 7 8 5 7 1 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses short est paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a hypercube dimension or amesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimension Use R dor option to activate the DOR algorithm Torus 2QoS Routing Algorithm Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics The torus 2QoS routing engine can provide the following functionality on a 2D 3D torus Free of credit loops routing Two levels of QoS
140. ctX3 rel mlx dev dev mst mt4099 pci cr conf MCX341A XCG Ax ini Create the text file etc modprobe d mlx4 core conf if it does not exist otherwise delete its contents Insert an option line in the etc modprobe d mlx4 core conf file to set the number of VFs the protocol type per port and the allowed number of virtual functions to be used by the physical function driver probe vf For example options mlx4 core num vfs 5 port type array 1 2 probe vf 1 If SR IOV is supported to enable SR IOV if it is not enabled it is sufficient to set sriov en true in the INI If the HCA does not support SR IOV please contact Mellanox Support support mellanox com 94 Mellanox Technologies m 2 1 1 0 0 Parameter Recommended Value num vfs e Ifabsent or zero no VFs will be available e Tfits value is a single number in the range of 0 63 The driver will enable the num v s VFs on the HCA and this will be applied to all ConnectX HCAs on the host Tfits format is a string The string specifies the num v s parameter separately per installed HCA The string format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to enable for that HCA For example num vfs 5 The driver will enable 5 VFs on the HCA and this will be applied to all ConnectX HCAs on the host num vfs 00 04 0 5 00 07 0 8 The driver will enable 5 VFs on the HCA positioned in BDF 00 04 0 and 8 on
141. ctive tree and orchestrates an efficient collective operation using the switch based CPUs on the MPI ScalableSHMEM nodes FCA accelerates MPI ScalableSHMEM collective operation performance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime FCA is disabled by default and must be configured prior to using it from the Scal ableSHMEM gt To enable FCA by default in the ScalableSHMEM 1 Edit the opt mellanox openshmem 2 2 etc openmpi mca params conf file 2 Set the scoll_fca_enable parameter to 1 scoll fca enable 1 3 Set the scoll_fca np parameter to 0 scoll fca_np 0 gt To enable FCA in the snmemrun command line add the following mca scoll fca enable 1 mca scoll fca enable np 0 To disable FCA mca scoll fca enable 0 mca coll fca enable 0 For more details on FCA installation and configuration please refer to the FCA User Manual found in the Mellanox website 5 1 3 Running ScalableSHMEM with MXM MellanoX Messaging MXM library provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hardware This includes a variety of enhancements that take advantage of Mellanox networking hardware including Multiple transport support including RC XRC and UD Proper management of HCA resources and memory structures Efficient memory reg
142. cy which is specified in a stand alone file is divided into the following four sub sections I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription Mellanox Technologies 65 J Rev 2 1 1 0 0 Driver Features ll Fabric Setup Defines how the SL2VL and VLArb tables should be setup In OFED this part of the policy is ignored SL2VL and VLArb tables should be config ured in the OpenSM options file opensm opts Ill QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Path Bits are not implemented in OFED IV Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the fol lowing fields SRC and DST to lists of port groups e Service ID to a list of Service ID values or ranges e QoS Class to a list of QoS Class values or ranges 4 4 4 CMA Features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is
143. cy issue please follow the instructions below 1 Edit the boot grub grub conf file or any other bootloader configuration file 2 Add the following kernel parameters to the bootloader command intel idle max cstate 0 processor max cstate 1 3 Reboot the system Example title RH6 2x64 root hd0 0 kernel vmlinuz RH6 2x64 2 6 32 220 e16 x86 64 root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0 console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max cstate 1 Mellanox Technologies 133 Rev 2 1 1 0 0 Performance 7 2 5 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface eth1 then turns off Rx interrupt moderation and last shows the new setting gt ethtool c ethl Coalesce parameters for ethl Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 rx usecs 16 rx frames 88 rx usecs irq 0 rx frames irq 0 gt ethtool C ethl adaptive rx off rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce paramet
144. d while they are active This results in smaller memory footprint less overhead to set connections and higher on chip cache utilization and hence increased performance DCT is supported only in mlx5 and is at beta level 4 17 PeerDirect PeerDirect uses an API between IB CORE and peer memory clients e g GPU cards to provide access to an HCA to read write peer memory for data buffers As a result it allows RDMA based Mellanox Technologies 107 Rev 2 1 1 0 0 Driver Features 4 18 4 18 1 over InfiniBand RoCE application to use peer device computing power and RDMA intercon nect at the same time without copying the data between the P2P devices For example PeerDirect is being used for GPUDirect RDMA Detailed description for that API exists under MLNX OFED installation please see docs readme and user manual PEER MEMORY API txt Inline Receive When Inline Receive is active the HCA may write received data in to the receive WQE or CQE Using Inline Receive saves PCIe read transaction since the HCA does not need to read the scatter list therefore it improves performance in case of short receive messages On poll CQ the driver copies the received data from WQE CQE to the user s buffers Therefore apart from querying Inline Receive capability and Inline Receive activation the feature is trans parent to user application os When Inline Receive is active user application must provide a valid virtual address for the
145. d or to the ibv shared mr sample program which demonstrates a basic usage of this verb Further information on the ibv shared mr sample program can be found in the ibv shared mr man page 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand XRC allows significant savings in the number of QPs and the associated memory resources required to establish all to all process connectivity in large clusters It significantly improves the scalability of the solution for large clusters of multicore end nodes by reducing the required resources For further details please refer to the Annex A14 Supplement to InfiniBand Architecture Speci fication Volume 1 2 1 A new API can be used by user space applications to work with the XRC transport The legacy API is currently supported in both binary and source modes however it is deprecated Thus we recommend using the new API The new verbs to be used are ibv open xrcd ibv close xrcd ibv create srq ex ibv get srq num ibv create qp ex 88 Mellanox Technologies m 2 1 1 0 0 ibv open qp Please use ibv xsrq pingpong for basic tests and code reference For detailed information regarding the various options for these verbs please refer to their appropriate man pages 4 12 Flow Steering Flow Steering is applicable to the mlx4 driver only Flow steering is a new model which steers network flows based on flow specifications to specific QPs Those flows can b
146. ded even though no errors had been detected to prevent their being deliverable to a higher layer protocol rx_length_errors Number of received frames that were dropped due to an error in frame length IX over errors Number of received frames that were dropped due to overflow IX CIC errors Number of received frames with a bad CRC that are not runts jabbers or alignment errors Mellanox Technologies 109 Table 7 Port IN Counters Rev 2 1 1 0 0 Driver Features Counter Description rx_jabbers Number of received frames with a length greater than MTU octets and a bad CRC rx in range length error Number of received frames with a length type field value in the decimal range 1500 46 42 is also counted for VLANtagged frames rx out range length error Number of received frames with a length type field value in the decimal range 1535 1501 Ix lt 64 bytes packets Number of received 64 or less octet frames rx 127 bytes packets Number of received 65 to 127 octet frames IX 255 bytes packets Number of received 128 to 255 octet frames rx 511 bytes packets Number of received 256 to 511 octet frames rx 1023 bytes packets Number of received 512 to 1023 octet frames Ix 1518 bytes packets Number of received 1024 to 1518 octet frames rx 1522 bytes packets Number of received 1519 to 1522 octet frames rx 1548 bytes packets Number
147. ding to the provided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links 198 Mellanox Technologies m 2 1 1 0 0 Error Codes 1 The path traced is un healthy 2 Failed to parse command line options 3 More then 64 hops are required for traversing the local port to the Source port and then to the Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File 6 Failed to load required Package 9 6 ibv devices Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv devices Examples 1 List the names of all available InfiniBand devices gt ibv devices device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 9 7 ibv devinfo Queries InfiniBand devices and prints about them information that is available for use from user space Synopsis ibv devinfo d lt device gt i lt port gt 1 v Output Files Table 29 lists the various flags of the command Table 29 ibv devinfo Flags and Options Optional Deut Flag n dator If Not Description y Specified d lt device gt Optional First found Run the command for the provided IB ib dev lt device gt device device device 1 lt port gt Optional All device Query the specified device port lt port gt ib port lt port gt ports Mellanox Technologies 199
148. dresses via ip link as above Spoofchecking Spoof checking is currently available only on upstream kernels newer than 3 1 ip link set dev PF device vf NUM spoofchk on off 4 13 7 3 3RoCE Support RoCE is supported on Virtual Functions and VLANs may be used with it For RoCE the hyper visor GID table size is of 16 entries while the VFs share the remaining 112 entries When the 104 Mellanox Technologies m 2 1 1 0 0 number of VFs is larger than 56 entries some of them will have GID table with only a single entry which is inadequate 1f VF s Ethernet device is assigned with an IP address When setting num vfs in mlx4 core module parameter it is important to check that the number of the assigned IP addresses per VF does not exceed the limit for GID table size 4 14 CORE Direct 4 14 4 CORE Direct Overview CORE Direct provides a solution for off loading the MPI collectives operations from the soft ware library to the network CORE Direct accelerates MPI applications and solves the scalability issues in large scale systems by eliminating the issues of operating systems noise and jitter It addresses the collectives communication scalability problem by off loading a sequence of data dependent communications to the Host Channel Adapter HCA This solution provides the hooks needed to support computation and communication overlap Additionally it provides a means to reduce the effects of system noise and applicatio
149. drivers to allocate contiguous mem ory for it as part of ibv reg mr Additional performance improvements can be reached by allocating Queue Pair QP and Com pletion Queue CQJ buffers to the Contiguous Pages To activate set the below environment variables with values of PREFER CONTIG or CONTIG For QP MLX QP ALLOC TYPE e For CQ MLX CQ ALLOC TYPE The following are all the possible values that can be allocated to the buffer Table 3 Buffer Values Possible Value Description ANON Use current pages ANON small ones Default value HUGE Force huge pages CONTIG Force contiguous pages PREFER CONTIG Try contiguous fallback to ANON small pages PREFER HUGE Try huge fallback to ANON small pages 86 Mellanox Technologies m 2 1 1 0 0 Table 3 Buffer Values Possible Value Description ALL Try huge fallback to contiguous if failed fallback to ANON small pages Values are NOT case sensitive Usage The application calls the ibv reg mr API which turns on the 18v ACCESS ALLOCATE MR bit and sets the input address to NULL Upon success the address field of the struct ibv mr will hold the address to the allocated memory block This block will be freed implicitly when the ibv dereg mr is called The following are environment variables that can be used to control error cases contiguity Table 4 Parameters Used to Control Err
150. dule parameters e g num_vfs and probe_vf For example to map mlx4_ 15 to device function number 04 00 0 in the current version we use options mlx4_ib dev_assign_str 04 00 0 15 as opposed to the previous version in which we used options mlx4 ib dev assign str 04 00 0 f C 2 mlx4 core Parameters set 4k mtu debug level MSIE enable sys tune block loopback num vfs probe vf Obsolete attempt to set 4K MTU to all ConnectX ports int Enable debug tracing if gt 0 int 0 don t use MSI X 1 use MSI X gt 1 limit number of MSI X irqs to msi x non SRIOV only int Tune the cpu s for better performance default 0 int Block multicast loopback packets if 0 default 1 int Either a single value e g 5 to define uniform num vfs value for all devices functions or a string to map device func tion numbers to their num vfs values e g 0000 04 00 0 55002196 he so i153 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for num vfs value e g 15 string Either a single value e g 3 to indicate that the Hypervi Sor driver itself should activate this number of VFs for each HCA on the host or a string to map device function numbers to their probe vf values e g 0000 04 00 0 3 002b 1c 0b a 13 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for probe vf value e g 13 string 240 Mellanox Technologie
151. e pure deadlock free algorithm so use it carefully Cast cache A This option enables unicast routing cache to prevent routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g in case of host reboot This option becomes very handy when the cluster size is thousands of nodes lid matrix file M file name This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded LTS ille U siile saute This option specifies the name of the LFTs file from where switch forwarding tables will be loaded sadb file S lt file name gt This option specifies the name of the SA DB dump file from where SA database will be loaded root_guid file a lt path to file gt Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one Mellanox Technologies 141 Rev 2 1 1 0 0 OpenSM Subnet Manager to a line cn guid file u path to file gt Set the compute nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line io guid file G path to file Set the 1 0 nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line port shifting Attempt to shift port routes around to remove alignmen
152. e 65 4 4 3 Supported Policy sd sense cece tte t teen eens 65 4 4 4 CMA EF at res cnt og eno hoe Se Cee tes a las ti eL S NEA 66 AS OpensM Features secos REP RUN URS ERU E S GE da 67 4 5 Quality of Service Ethernet llis 67 4 5 1 Quality of Service Overview llle 67 4 5 2 Mapping Traffic to Traffic Classes 0 0 0 esee 67 4 5 3 Plain Ethernet Quality of Service Mapping 00 ce eee eee ee 67 4 5 4 RoCE Quality of Service Mapping reser terra 68 4 5 5 Raw Ethernet QP Quality of Service Mapping 0 0 0 esses 69 4 5 6 Map Priorities with tc wrap py mlnx qos 0 c eee eee 69 4 5 7 Quality of Service Properties 0 ccc eet rr rr ra 70 4 5 8 Quality of Service Tools lisse 70 4 6 Ethernet Time Stamping 0 cece ccc eens 75 4 6 1 Ethernet Time Stamping Service 0 cece cee nee 75 4 6 2 ROCE Time Stamping ss sius ers ye es bs Sowa dales cu xt 79 AT Atomic Operations A is 81 4 7 1 Enhanced Atomic Operations 00 0 cece tte 81 4 8 Ethernet Tunneling Over IPoIB Driver eIPoIB 00 0 0 eee ee 82 4 8 1 Enabling the eIPoIB Driver 0 0 cece nee 83 4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver 84 4 8 3 VLAN Configuration Over an eIPoIB Interface 0 0000s 85 4 8 4 Setting Performance Tuning 0 0 eects 86 4 9 Contiguous Pages esa ind a BRAS es Bal UA RENE ERN 86
153. e DHCP configuration file e dhcp patch patch file for DHCP v3 1 3 Burning the Expansion ROM Image Burning the Image on ConnectX 2 ConnectX 3 This section is valid for ConnectX 2 devices with firmware versions 2 9 1000 or later and ConnectX 3 firmware versions 2 30 3000 or later o Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot flexboot version release notes txt 2 Firmware Burning Tools 224 Mellanox Technologies m 2 1 1 0 0 You need to install the Mellanox Firmware Tools MFT package version 3 0 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under www mellanox com gt Products gt InfiniBand VPI Drivers gt Firmware Tools Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run mst start mst status The device name will be of the form mt lt dev_id gt pci cr0 conf0 2 Create and burn the composite image Run flint dev lt mst device name gt brom lt expansion ROM image gt Example on Linux flint dev dev mst mt26428 pci_cr0 brom ConnectX 26428 ROM X X XXX mrom Example on Windows flint dev mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Removing the Expansion ROM Image Remove the expansion ROM image Run flint dev mst device name dr
154. e IB ports Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination e Using port LIDs Tool option 1 In this mode the source and destination ports are defined by means of their LIDs If the fabric is con figured to allow multiple LIDs per port then using any of them is valid for defining a port Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the T option 9 3 ibdiagnet of ibutils2 IB Net Diagnostic This version of ibdiagnet is included in the ibutils2 package and it is run by default after installing Mellanox OFED To use this ibdiagnet version run ibdiagnet Please see ibutils2 release notes txt for additional information and known issues Ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis i device lt dev name gt p port lt port num gt g guid lt GU
155. e ees 134 B27 TRO ADN oe ee otek A A ee ee ae ee ee os i 136 7 2 8 Tuning Multi Threaded IP Forwarding 00 cece cece rr rea 138 Chapter 8 OpenSM Subnet Manager cece cece ec eee cece ee eee eee 139 Belo OVERVIEW o oiu te Se SEEN ERE EE ieee he ee Hen Pees 139 8 2 opensm Description llis en 139 82 1 opensm Syntax cime eoe eo bee eke ee eb rye AUR vie cae sme 139 8 22 Environment Variables o ooooooooooooorrrrrrr e 147 8 2 35 Signaling 5 et eh e ene ePi ut 148 8 24 R t rimg Openstu ivo A Fe a eSNG T SRN Os 148 8 3 osmtest Description 0 cece teen rer tre ers sa 148 A ESO ott See lane LOU RIA ite Maes EIE 149 8 3 2 Runningosmtest llle 151 8 4 Partitions enone eet pae ooo tehare weet DNE NEUE NEN NER UE ee 151 8 44 File Format ss seb Meenas sete ee pei RETRTEPOSXR P S STAR RES 151 8 5 Routing Algorithms 0 e 154 8 5 1 Effect of Topology ChangeS ooooooooorrrrrrr eee 155 8 5 2 Min Hop AlgorithWM oooooooooooorrrrr t eens 155 835 3 UPDN Alora o a eh ba phen etus 156 8 5 4 Fat tree Routing Algorithm sssseseeresrrrrrrrrrrr rr r rr e 157 8 5 5 LASH Routing Algorithm sssssreseerrrrrerrrerrrr rer rr rr rr rr rr rr 158 8 5 6 DOR Routing Algorithm sssssresreresrrrrsrererr rer rr rr rr rr rr rr 160 8 5 7 Torus 2QoS Routing Algorithm 00 0 160 8 6 Quality of Service Management in OpenSM 0 0 0 ccc ee eee 168
156. e either unicast or multicast network flows In order to maintain flexibil ity domains and priorities are used Flow steering uses a methodology of flow attribute which is a combination of L2 L4 flow specifications a destination QP and a priority Flow steering rules may be inserted either by using ethtool or by using InfiniBand verbs The verbs abstraction uses a different terminology from the flow attribute ibv_flow_attr defined by a combination of speci fications struct ibv flow spec 4 12 1 Enable Disable Flow Steering Flow Steering is disabled by default and regular L2 steering is performed instead BO Steering When using SR IOV flow steering is enabled if there is an adequate amount of space to store the flow steering table for the guest master To enable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Set the parameter 1og num mgm entry size to 1 by writing the option m1x4 core log num mgm entry size 1 Step3 Restart the driver To disable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Remove the options mlx4 core log num mgm entry size 1 Step3 Restart the driver 4 12 2 Flow Domains and Priorities Flow steering defines the concept of domain and priority Each domain represents a user agent that can attach a flow The domains are prioritized A higher priority domain will always super sede a lower priority domain when their flow specifications
157. e entire driver s file but does not unload the driver root swl022 usr sbin ofed uninstall sh This program will uninstall all OFED packages on your machine Do you want to continue y N y Running usr sbin vendor pre uninstall sh Removing OFED Software installations Running bin rpm e allmatches kernel ib kernel ib devel libibverbs libibverbs devel libibverbs devel static libibverbs utils libmlx4 libmlx4 devel libibcm libibcm devel libibumad libibumad devel libibumad static libibmad libibmad devel libibmad static librdmacm librdmacm utils librdmacm devel ibacm opensm libs opensm devel perftest com pat dapl compat dapl devel dapl dapl devel dapl devel static dapl utils srptools infini band diags guest ofed scripts opensm devel warning etc infiniband openib conf saved as etc infiniband openib conf rpmsave Running tmp 2818 ofed vendor post_uninstall sh Restart the server 4 13 6 Burning Firmware with SR IOV The following procedure explains how to create a binary image with SR IOV enabled that has 63 VFs However the number of VFs varies according to the working mode requirements To burn the firmware Step 1 Step 2 Step 3 Verify you have MFT installed in your machine Enter the firmware directory according to the HCA type e g ConnectX 9 3 The path is mlnx_ofed firmware lt device gt lt FW version Find the ini file that contains the HCA s PSID Run ibv devinfo grep board id board id MT
158. e same resources with the Physical Function and its number of ports equals those of the Physical Function SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual machines direct hardware access to network resources hence increasing its performance In this chapter we will demonstrate setup and configuration of SR IOV in a Red Hat Linux envi ronment using Mellanox ConnectX VPI adapter cards family 4 13 1 System Requirements To set up an SR IOV environment the following is required MLNX OFED Driver Aserver blade with an SR IOV capable motherboard BIOS e Hypervisor that supports SR IOV such as Red Hat Enterprise Linux Server Version 6 Mellanox ConnectX amp VPI Adapter Card family with SR IOV capability 4 13 2 Setting Up SR IOV Depending on your system perform the steps below to set up your BIOS The figures used in this section are for illustration purposes only For further information please refer to the appropriate BIOS User Manual 92 Mellanox Technologies m 2 1 1 0 0 Step 1 Enable SR IOV in the system BIOS BIOS SEIUP UTILITY SR IUU Supported Enabled Step 2 Enable Intel Virtualization Technology BIOS SETUP UTILITY Intel R Virtualization Tech Enabled Step 3 Install a hypervisor that supports SR IOV Step 4 Depending on your system update the boot grub grub conf file to include a similar com mand line load parameter for the Linux ker
159. e user or encounters connection errors such as no SM in the fabric To execute SRP daemon as a daemon on all the ports e srp daemon sh found under usr sbin srp daemon sh sends its log to var log srp daemon log Start the srpd service script run service srpd start e It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRP DAEMON ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Section 4 1 2 6 Mellanox Technologies 53 J Rev 2 1 1 0 0 Driver Features For the changes in openib conf to take effect run etc init d openibd restart 4 1 2 5 Multiple Connections from Initiator InfiniBand Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port 1 e SRP connections use the same path the configuration is enabled using a different initiator_ext value for each SRP connection The initiator_ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator_ext value on each path The
160. ease contact your hardware vendor for help on firmware updates PF Error message Device 1 Device 0000 05 00 0 Part Number Description POIDS MT 0DB0110010 Versions Current Available FW 2 9 1000 N A SESTUSE No matching image found Step 4 Reboot the machine if the installation script performed firmware updates to your network adapter hardware Otherwise restart the driver by running etc init d openibd restart Note The script adds the following lines to etc security limits conf for the userspace components such as MPI soft memlock unlimited hard memlock unlimited These settings set the amount of memory that can be pinned by a user space application to unlimited If desired tune the value unlimited to a specific amount of RAM Note For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be run ning on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 8 OpenSM Subnet Manager Step 5 InfiniBand only Run the hca self test ofed utility to verify whether or not the InfiniBand link is up The utility also checks for and displays additional information such as HCA firmware version Kernel architecture Driver version Number of active HCA ports along with their states Node GUID Mellanox Technologies 39 J Rev 2 1 1 0 0 Installation
161. edora Distributions lib modules uname r extra mlnx ofa kernel on RHEL and other RedHat like Distribu tions lib modules uname r updates dkms on Ubuntu Firmware The firmware of existing network adapter devices will be updated if the following two conditions are fulfilled a You run the installation script in default mode that is without the option without fw update 40 Mellanox Technologies m 2 1 1 0 0 b The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image gt If an adapter s Flash was originally programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image ade ncase your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file 2 3 5 Post installation Notes Most of the Mellanox OFED components can be configured or reconfigured after the installation by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 2 3 6 Installation Logging While installing MLNX_OFED the install log for eac
162. en ports and dimension Order for controlling Dimension Order Routing DOR oreover this option provides the means to define non default routing port order dimn ports file 0 path to file gt DEPRECATED This option provides the means to define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR honor guid2lid x This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and is valid By default this is FALSE const multicast This option forces OpenSM to conserver previously built multicast trees log file f lt log file name gt This option defines the log to be the given file By default the log goes to var log opensm log For the log to go to standard output use f stdout log limit L size in MB This option defines maximal log file size in MB When specified the log file will be truncated upon reaching this limit aras los file This option will cause deletion of the log file if it previously exists By default the log file is accumulative Pconfig P lt partition config file gt This option defines the optional partition configuration file The default name is etc opensm partitions conf no part enforce N DEPRECATED This option disables partition enforcement on switch external ports 144 Mellanox Technologies m 2 1 1 0 0 sexet emiomos 4 gado ie
163. ent identifier ERIZOS CONOS Step 10 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd ib init and add the following lines at the point you wish the IB driver to be loaded Hm The order of the following commands for loading modules is critical echo loading ipv6 sbin insmod lib modules ipv6 ko echo loading IB driver sbin insmod lib modules ib ib addr ko sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insmod lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod lib modules ib iw cm ko Mellanox Technologies 231 Rev 2 1 1 0 0 Step 11 Step 12 Step 13 Step 14 Step 15 sbin insmod lib modules ib rdma cm ko sbin insmod lib modules ib rdma ucm ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4 ib ko sbin insmod lib modules ib ib mthca ko The following command loading ipoib helper ko is not required for all OS kernels Please check the release notes de sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib ipoib ko In case of interoperability issues between iSCSI and Large Receive Offload LRO change the last command above as follows to disable LRO sbin insmod lib modules ib ib ipoib
164. er servicing all strict priority TCs will be split according to this ratio Since this is a minimal guarantee there is no maximum enforcement This means in the same example that if TC1 did not use its share of 20 the reminder will be used by TCO Rate Limit Rate limit defines a maximum bandwidth allowed for a TC Please note that 1096 deviation from the requested values is considered acceptable Quality of Service Tools mlnx qos mlnx qos is a centralized tool used to configure QoS features of the local host It communicates directly with the driver thus does not require setting up a DCBX daemon on the system The minx qos tool enables the administrator of the system to Inspect the current QoS mappings and configuration 70 Mellanox Technologies m 2 1 1 0 0 The tool will also display maps configured by TC and vconfig set egress map tools in order to give a centralized view of all QoS mappings Set UP to TC mapping Assign a transmission algorithm to each TC strict or ETS Set minimal BW guarantee to ETS TCs Setrate limit to TCs For unlimited ratelimit set the ratelimit to 0 de Usage mlnx qos i interface options Options version show program s version number and exit h help show this help message and exit p LISI prio gelbe maps UPs to TCs LIST is 8 comma seperated TC numbers Example 0 0 0 0 1 1 1 1 maps UPs 0 3 to TCO and UPs Z io NOI s LIST tsa LIST Transmi
165. ers for ethl Adaptive RX off TX off pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 rx usecs irq 0 rx frames irq 0 7 2 6 Tuning for NUMA Architecture 7 2 6 1 Tuning for Intel Sandy Bridge Platform The Intel Sandy Bridge processor has an integrated PCI express controller Thus every PCIe adapter OS is connected directly to a NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCle adapter is connected In order to identify which NUMA node is the adapter s node the system BIOS should support ACPI SLIT gt To see if your system supports PCIe adapter s NUMA node detection cat sys class net interface device numa_node cat sys devices PCI root PCIe function numa_node 134 Mellanox Technologies m 2 1 1 0 0 Example for supported system cat sys class net eth3 device numa node 0 Example for unsupported system cat sys class net ib0 device numa node 1 7 2 6 1 1 Improving Application Performance on Remote NUMA Node Verbs API applications that mostly use polling will have an impact when using the remote NUMA node libmlx4 has a build in enhancement that recognizes an application that is pinned to a remote NUMA node and activates a flow that improves the out of the box latency and throughput However the NUMA node recognition must be enabled as described in section Tuning for Intel
166. erver side You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating sys tem Step 3 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target ign line Lun 0 Path dev sda5 Type fileio Example of an iSCSI Target iqn line Target iqn 2007 08 7 3 4 10 iscsiboot Step 4 Start your 1SCSI Target Example hostl etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section 4 3 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI target Filename grign KOE patas ess Sita Sais CSI target icmp The following is an example for configuring an IB ETH device to boot from an iSCSI target host hostl filename 234 Mellanox Technologies m 2 1 1 0 0 For a ConnectX device with ports configured as InfiniBand comment out the following line option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 For a ConnectX device with ports configured as Ethernet
167. es provide an excellent low level interface for PGAS applications A SHMEM program is of a single program multiple data SPMD style All the SHMEM pro cesses referred as processing elements PEs start simultaneously and run the same program Commonly the PEs perform computation on their own sub domains of the larger problem and periodically communicate with other PEs to exchange information on which the next communi cation phase depends The SHMEM routines minimize the overhead associated with data transfer requests maximize bandwidth and minimize data latency the period of time that starts when a PE initiates a transfer of data and ends when a PE can use the data SHMEM routines support remote data transfer through put operations data transfer to a different PE get operations data transfer from a different PE and remote pointers allowing direct references to data objects owned by another PE Additional supported operations are collective broadcast and reduction barrier synchronization and atomic memory operations An atomic memory operation is an atomic read and update oper ation such as a fetch and increment on a remote or local data object SHMEM libraries implement active messaging The sending of data involves only one CPU where the source processor puts the data into the memory of the destination processor Likewise a processor can read data from another processor s memory without interrupting the remote CP
168. eth0 p 0 0 0 0 1 1 1 1 maps UPs 0 3 to Tco and Ups 4 7 to Tc1 Quality of Service Properties The different QoS properties that can be assigned to a TC are Strict Priority see Strict Priority Minimal Bandwidth Guarantee ETS see Minimal Bandwidth Guarantee ETS Rate Limit see Rate Limit Strict Priority When setting a TC s transmission algorithm to be strict then this TC has absolute strict prior ity over other TC strict priorities coming before it as determined by the TC number TC 7 is highest priority TC 0 is lowest It also has an absolute priority over non strict TCs ETS This property needs to be used with care as it may easily cause starvation of other TCs A higher strict priority TC is always given the first chance to transmit Only if the highest strict priority TC has nothing more to transmit will the next highest TC be considered Non strict priority TCs will be considered last to transmit This property is extremely useful for low latency low bandwidth traffic Traffic that needs to get immediate service when it exists but is not of high volume to starve other transmitters in the sys tem Minimal Bandwidth Guarantee ETS After servicing the strict priority TCs the amount of bandwidth BW left on the wire may be split among other TCs according to a minimal guarantee policy If for instance TCO is set to 80 guarantee and TC1 to 20 the TCs sum must be 100 then the BW left aft
169. ey Q 2 0 Add new virtual hardware ORO x Adding Virtual Hardware This assistant will guide you through adding a new piece of virtual hardware First select what type of hardware you wish to add Hardware type L Storage Y Network Parallel Physical Host Device BD video B watchdog X Cancel sp Eorward s Add Hardware Remove Mellanox Technologies 97 Rev 2 1 1 0 0 Driver Features Step 4 Step 5 Step 6 Step 7 Choose a Mellanox virtual function according to its PCI device e g 00 03 1 If the Virtual Machine is up reboot it otherwise start it Log into the virtual machine and verify that it recognizes the Mellanox card Run lspci grep Mellanox 00 03 0 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Add the device to the etc sysconfig network scripts ifcfg ethx configuration file The MAC address for every virtual function is configured randomly therefore it is not nec essary to add it 4 13 5 Uninstalling SR IOV Driver gt To uninstall SR IOV driver perform the following Step 1 Step 2 Step 3 For Hypervisors detach all the Virtual Functions VF from all the Virtual Machines VM or stop the Virtual Machines that use the Virtual Functions Please be aware stopping the driver when there are VMs that use the VFs will cause machine to hang Run the script below Please be aware uninstalling the driver deletes th
170. fed MLNX_OFED Repository 108 rpmforge RHEL 6Server RPMforge net dag 4 597 repolist 8 351 2 5 2 Installing MLNX_OFED using the YUM Tool After setting up the YUM repository for MLNX_OFED package perform the following Step 1 View the available package groups by invoking yum grouplist grep MLNX OFED LNX OFED ALL LNX OFED BASIC LNX OFED GUEST LNX OFED HPC LNX OFED HYPERVISOR LNX OFE LNX OFE LNX OFE te ME MM iz ANI rc A tr e Step 2 Install the desired group yum groupinstall MLNX OFED ALL Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management You can use subscrip tion manager to register Setting up Group Process Resolving Dependencies Running transaction check gt Package ar mgr x86 64 0 1 0 0 11 g22fff4a will be installed rds devel x86 64 6m1nx 1 rds tools x86 64 6m1nx 1 srptools x86 64 0 0 0 4mlnx3 OFED 2 0 2 6 7 11 ge863cb7 Complete 082510 sm 2 5 3 Updating Firmware After Installation Installing MLNX OFED using the YUM tool does not automatically update the firmware To update the firmware to the version included in MLNX OFED package you can either e Run the minxofedinstall script with the fw update on1y flag or Update the firmware to the latest version available on Mellanox Technologies Web site as described in section Section 2 4 Updating Firmware After Installation o
171. ffic class IPoIB Service Level 3 Policy min 10 BW App A Server 8 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each exam ple provides the QoS level assignment and their administration via OpenSM configuration files 8 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI Separate from I O load Min BW of 70 Storage Control Lustre MDS Low latency Storage Data Lustre OST Min BW 3096 Administration MPlis assigned an SL via the command line host1 mpirun s1 0 e OpenSM QoS policy file In the following policy file example replace OST and MDS with the real port GUIDs 178 Mellanox Technologies m 2 1 1 0 0 qos ulps default 0 default SL for MPT any target port guid OST1 0ST2 0ST3 0ST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL for Lustre MDS end qos ulps e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 2 1 qos vlarb low 0 96 1 224 cos SAL 02 Sy boy Op 1 19 13 Lo Lay 15 15 15 19 8 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels Application traffic IPoIB UD and CM and SDP Isolated from storage Min BW of 50 e SRP Mi
172. fied here The specified GUIDs are lt GUIDs gt assigned the following values repectively node portl port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 218 Mellanox Technologies m 2 1 1 0 0 Table 36 mstflint Switches Sheet 2 of 3 Affected Switch Relevant Description Commands mac burn sg MAC address base value Two MACS are automatically lt MAC gt assigned to the following values mac gt portl mac 1 gt port2 Note This switch is applicable only for Mellanox Technolo gies Ethernet products macs burn sg Two MACs must be specified here The specified MACs are lt MACs gt assigned to port and port2 repectively Note This switch is applicable only for Mellanox Technolo gies Ethernet products blank_guids burn Burn the image with blank GUIDs and MACs where applica ble These values can be set later using the sg command see Table 37 below No com Force clear the Flash semaphore on the device No command is clear_semap mands allowed when this switch is used hore allowed Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage burn verify Binary image file lt image gt qq burn query Run a quick query When specified mstflint will not perform full image integrity checks dur
173. for collective operations if the number of processes in the job is greater than the ca np value default 64 fca verbose level Sets verbosity level for the FCA modules fca ops op list op list comma separated list of collective operations e fca ops op list Enables disables only the specified operations e fca ops Enables disables all operations By default all operations are enabled Allowed operation names are barrier br bcast bt reduce rc allgather ag Each operation can be also enabled disabled via environment variable GASNET FCA ENABLE BARRIER GASNET FCA ENABLE BCAST GASNET FCA ENABLE REDUCE Note All the operations are enabled by default 5 5 2 1 Enabling FCA Operations through Environment Variables in ScalableUPC This method can be used to control UPC FCA offload from environment using job scheduler srun utility The valid values are 1 enable 0 disable To enable a specific operation with shell environment variables in ScalableUPC export GASNET FCA ENABLE BARRIER 1 export GASNET FCA ENABLE BCAST 1 export GASNET FCA ENABLE REDUCE 1 124 Mellanox Technologies oo m 2 1 1 0 0 5 5 2 2 Controlling FCA Offload in ScalableUPC using Environment Variables gt To enable FCA module under Scalable UPC export GASNET FCA ENABLE CMD LINE 1 gt To set FCA verbose level export GASNET FCA VERBOSE CMD LINE 10 gt To
174. formance 7 2 8 Tuning Multi Threaded IP Forwarding gt To optimize NIC usage as IP forwarding 1 Set the following options in etc modprobe d mlx4 conf For MLNX OFED 2 0 x options mlx4 en inline thold 0 options mlx4 core high rate steer 1 For MLNX EN 1 5 10 options mlx4 en num lro 0 inline thold 0 options mlx4 core high rate steer 1 2 Apply interrupt affinity tuning 3 Forwarding on the same interface set irq affinity bynode sh numa node interface 4 Forwarding from one interface to another set irq affinity bynode sh numa node lt interfacel gt lt interface2 gt 5 Disable adaptive interrupt moderation and set status values using ethtool C adaptive rx off 138 Mellanox Technologies mem 8 OpenSM Subnet Manager 8 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Management Model 13 Subnet Management 14 and Subnet Administration 15 8 2 opensm Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet
175. formance we recommend using the Datagram mode However the mode can be changed to Connected mode by editing the file etc infiniband openib conf andsetting SET IPOIB CM yes The SET_IPOIB_CM parameter is set to auto by default to enable the Connected mode for Con nect IB card and Datagram for all other ConnectX cards After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode Mellanox Technologies 57 Rev 2 1 1 0 0 Driver Features 4 3 3 IPolB Configuration Unless you have run the installation script mlnxofedinstall with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port like any other network adapter card i e you need to prepare a file called ifcfg ib lt n gt for each port The first port on the first HCA in the host is called interface 1b0 the second port is called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 3 3 1 or on a static configuration Section 4 3 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 3 3 3 4 3 3 1 IPoIB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP is performed similarly
176. gTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop 206 Mellanox Technologies m 2 1 1 0 0 Synopsis ibrovte Sa Ed Iw vi E dese E ae Su Es lt smluck MEC em mansa P lt ca_port gt I t timeout ms gt lt dest dr path lid guid gt lt star tlid gt lt endlid gt Output Files Table 32 lists the various flags of the command Table 32 ibportstate Flags and Options a Default Flag UR i If Not Description m Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a 11 Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v V ersion Optional Show version info a 11 Optional Show all LIDs in range including invalid entries n o dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is acomma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables
177. gies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target iSCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox Con nectX products FlexBoot is based on the open source project iPXE available at http www ipxe org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then it connects to a DHCP server to obtain its assigned IP address and network parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some other service For an InfiniBand port Mellanox FlexBoot implements a network driver with IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Linux software package see www mellanox com gt Products gt InfiniBand VPI SW Drivers The binary code is exported by the device as an expansion ROM image FlexBoot Package The FlexBoot package is provided as a tarball tgz extension Uncompress it using the com mand tar zxf lt package file name gt The tarball contains PXE binary files with the mrom extension for the supported adapter devices See the release notes file FlexBoot lt flexboot_version gt _release_notes txt for details The package includes the following files e dhcpd conf sampl
178. gnet vlr Mellanox Technologies 165 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 7 6 Torus 2QoS Configuration File Syntax The file torus 2QoS conf contains configuration information that is specific to the OpenSM rout ing engine torus 2QoS Blank lines and lines where the first non whitespace character is are ignored A token is any contiguous group of non whitespace characters Any tokens on a line fol lowing the recognized configuration tokens described below are ignored torus mesh x radix m M t T y radix m M t T z radix m M t T Either torus or mesh must be the first keyword in the configuration and sets the topology that torus 2QoS will try to construct A 2D topology can be configured by specifying one of x radix radix or z radix as 1 An individual dimension can be configured as mesh open or torus looped by suffixing its radix specification with one of m M t or T Thus mesh 3T 4 5 and torus 3 4M 5M both specify the same topology Note that although torus 2QoS can route mesh fabrics its ability to route around failed compo nents is severely compromised on such fabrics A failed fabric componentis very likely to cause a disjoint ring see UNICAST ROUTING in torus 2QoS 8 xp link sw0 GUID swl_GUID yp link sw0 GUID swl GUID zp link sw0 GUID sw1_ GUID xm link sw0 GUID swl GUID ym link sw0 GUID swl_GUID zm link sw0 GUID swl GUID These keywords are used to seed the torus mesh topology For example
179. guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 023 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 000 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 023 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 024 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 5 Dump all non empty mlids of switch with Lid 3 ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0 il 2 Poss 012343678 901234506 78901234 MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 0xc022 0xc023 0xc024 0xc040 0xc041 0xc042 12 valid mlids dumped x X OK X x XX XX XxX Xx Xx xXx X Mellanox Technologies 209 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities 9 12 smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsis smpquery h d e v D G s lt smlid gt V C lt ca_name gt P lt ca_port gt t lt timeout_ms gt node name map lt node name map gt lt op gt dest dr_path lid guid gt op params
180. h selected package will be saved in a sepa rate log file The path to the directory containing the log files will be displayed after running the installation script in the following format Logs dir tmp MLNX_OFED_LINUX lt version gt lt PID gt logs Example Logs dir tmp MLNX OFED LINUX 2 1 0 0 2 31701 logs 2 4 Updating Firmware After Installation In case you ran the mlnxofedinstall script with the without fw update option and now you wish to manually update firmware on your adapter card s you need to perform the following steps P If you need to burn an Expansion ROM image please refer to Burning the Expan sion ROM Image on page 224 The following steps are also appropriate in case you wish to burn newer firmware that you have downloaded from Mellanox Technologies Web site http www mella Adi nox com gt Downloads gt Firmware Mellanox Technologies 41 Rev 2 1 1 0 0 Installation Step 1 Start mst hostl mst start Step 2 Identify your target InfiniBand device for firmware update 1 Get the list of InfiniBand device names on your machine hostl mst status MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre 12C module is not loaded MST devices dev mst mt25418 pciconf0 PCI configuration cycles access bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is A0 dev mst mt25418 pci cr0 PCI Chace ACCOSS bus dev
181. he ager role of a Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administra An application normally part of the Subnet Manager that tor SA implements the interface for querying and manipulating subnet management data Subnet Manager One of several entities involved in the configuration and con SM trol of the an IB fabric Unicast Linear For warding Tables LFT A table that exists in every switch providing the port through which packets should be sent to each LID Virtual Protocol Interconnet VPI A Mellanox Technologies technology that allows Mellanox channel adapter devices ConnectX to simultaneously con nect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports Related Documentation Table 4 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 Release 1 2 1 is provided by IBTA TEEE Std 802 3aeTM 2002 Amendment to IEEE Std 802 3 2002 Document PDF 8894996 Physical Layer Specifications Amendment Media Access Control MAC Parameters for 10 Gb s Operation 16 Mellanox Technologies The InfiniBand Architecture Specification that Part 3 Carrier Sense Multiple Access with Colli sion Detection CSMA CD Access Method and Parameters Physical Layers and Management m 2 1 1 0 0 Table 4 Reference Documents Document
182. he Ethernet Fabric o ooo oooooooomoo 122 5 4 Fabric Collective Accelerator 0 0 0 cc ccc eet era 122 23 25 rScalableUPG ccc ghee Gees SES hee a ws Se ee NUUS RES NW SS 123 5 5 1 Installing ScalableUPC 2 0 cect teenies 124 5 5 2 FCA Runtime Parameters 00 0 cee cc re 124 5 5 3 Various Executable Examples 0000 c cece cece eee sees 125 Mellanox Technologies 5 Rev 2 1 1 0 0 Chapter 6 Working With VPI ccc ccc ccc cece cee cece cece cece ee 126 6 1 Port Type Management 20 eee 126 6 2 Alto Sensing A oue dee Lb RR e cad e a 127 6 2 1 Enabling Auto Sensing 0 0 00 eee 127 Chapter 7 Perforniance RC ERR 128 7 1 General System Configurations 0 0 0 0 000 cece eee 128 7 1 1 PCI Express PCIe Capabilities llle 128 7 1 2 Memory Configuration lisse 128 7 1 3 Recommended BIOS Settings ssssememsererrererrerrrrrrr rer rer rr rea 128 7 2 Performance Tuning for Linux 00 0 131 7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 131 7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 131 7 2 3 Preserving Your Performance Settings after a Rebo0t 132 7 2 4 Tuning Power Management 0 cece cect rr rr rea 132 7 2 5 Interrupt Moderation 0 00 c cect teen rr rea 134 7 2 6 Tuning for NUMA Architecture 0 2 0 0 cec
183. he given image as is without running any checks Sg Set GUIDs ri lt out file gt dc lt out file gt Read the firmware image on the Flash into the specified file Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase sector rw lt addr gt Read one DWORD from Flash ww lt addr gt data Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt size data Write a data block to Flash without sector erase rb lt addr gt size out file swreset Read a data block from Flash SW reset the target InfniScale IV device This command is supported only in the In Band access method 220 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities m 2 1 1 0 0 Possible command return values are 0 successful completion 1 error has occurred 7 the burn command was aborted because firmware is current Examples 1 Find Mellanox Technologies s ConnectX amp VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE gt sbin lspci d 15b3 634a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in
184. her increase the verbosity level See the D option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing The V is equivalent to D OxFF d 2 See the D option for more information about log verbosity D lt a gt 146 Mellanox Technologies m 2 1 1 0 0 This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without D OpenSM defaults to ERROR INFO 0x3 Specifying D 0 disables all messages Specifying D OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option debug d number This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support d10 Put Open
185. hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 ade 2 Verify the ConnectX firmware using its ID using the results of the example above gt mstflint d 04 00 0 v ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634f 0x000 20 DDR OK 0x00006350 0x0000 29b 0x008f4c DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x0004749c 0x0005913f 0x011ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 0x0007a124 0x0007bdff 0x001cdc DDR OK 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007f0af 0x000518 Configuration OK 0x0007f0b0 0x0007f0fb 0x00004c Jump addresses OK 0x0007 0fc 0x0007 2a7 0x0001ac FW Configuration OK FW image verification succeeded Image is bootable Mellanox Technologies 221 9 16 9 17 Rev 2 1
186. ic Manager UFM software Powerful platform for managing demanding scale out computing fabric environments built on top of the OpenSM industry standard routing engine Fabric Collective Accelerator FCA FCA is a Mellanox MPl integrated software package that utilizes CORE Direct technology for implementing the MPI collectives communications 1 2 Mellanox OFED Package 1 2 1 ISO Image Mellanox OFED for Linux MLNX_OFED_LINUX is provided as ISO images or as a tarball one per supported Linux distribution and CPU architecture that includes source code and binary RPMs firmware utilities and documentation The ISO image contains an installation script called mlnxofedinstal1 that performs the necessary steps to accomplish the following Discover the currently installed kernel Uninstall any InfiniBand stacks that are part of the standard operating system distribu tion or another vendor s commercial stack Install the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 2 2 Software Components MLNX OFED LINUX contains the following software components Mellanox Host Channel Adapter Drivers mlx5 mlx4 VPI which is split into multiple modules e mlx4 core low level helper mlx4 ib IB mlx5 ib e mlx5 core Mellanox Technologies 19 J Rev 2 1 1 0 0 Mellanox OFED Overview mlx4
187. ic operations is the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec 4 7 1 1 Masked Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic response va if compare add va amp compare add mask then Mellanox Technologies 81 Rev 2 1 1 0 0 Driver Features 4 7 1 2 4 8 82 Mellanox Technologies tva va swap mask swap amp swap mask return atomic response The additional operands are carried in the Extended Transport Header Atomic response genera tion and packet format for MskCmpSwap is as for standard IB Atomic operations Masked Fetch and Add MFetchAdd The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the field boundary parameter specifies the field boundaries The pseudocode below describes the operation bit_adder ci bl b2 co value ci bl b2 ee IM valus E 2 return value 1 define MASK IS SET mask attr 11 mask amp a
188. ie 2048 o a a O rte 0 VIC ao nara Odd VL0 7 O MEAT TAE 0x00 cdo o en dade saec on 4 WHINAOELGINCEIOS 0 005306000 0004000000 8 MINDS LONC AD a ert eI ET 8 lin WERDE cto cen HGNC 0x00 NL GATOS to AS 2048 VES ERE GUEST serie reer ests 0 HOG Wah Gi M dr ato dun anotar 31 DMI AA e VL0 3 PARC e el AR 0 Part MMEOCSOMICDS sods noone onda od 0 Aud ee Pr S 0 ER QWOTI E DIES M NU NM E 0 MES ONSE ECCE ks ca 0 PRI OMS ososdoedoscosodt os 0 sio tunes ooo A 0 Co a neers 128 lhiembREerequ ao 0 SUDNEEIIMCOUES 0 09 5 0 od E CREE 18 BespTume Vaio tr Ier eter neta t A 16 WOO QUI SEINEN atada 8 a RAT M RT e 8 Maxon e e E M 0 oooO daba 0 2 Query SwitchInfo by GUID gt smpquery G switchinfo 0x000b8cffff004016 Switch info Lid 3 A O O B ease C DR cat OROCERO DEO 49152 Randombidh 298 os co oo eto cog oo 0 Meds tr etes atus e da ECTS 1024 Mame a NIIT DESC RS M 8 DERE OME AG Bros e Uds 0 DAMA oon 5 soe oes oo on 0 DEMOS NOTE 5 T 0 InfiniBand Fabric Diagnostic Utilities 212 Mellanox Technologies m 2 1 1 0 0 EMO eene EE 18 Sate Changes ente d ies ss cveue eve 0 A tutte eI E E 0 Pant Ent once Cape s onoonconandanoano 32 TDC o poo ooo eee dla ca i Oxoicloxxmaeliersue CINE 8 5o 5o ne copseanosos 1 milite rRaw inbound IB 5 5 aoo demo oe 1 PerPo LH c cec ere eroe oo meses 1 EnnancedPo nt Or rr IM 0 3 Query Nodelnfo by direct route smpquery D nodeinfo 0 Node info DR path slid 65535 dlid 65535 0 B
189. ies 227 Rev 2 1 1 0 0 A 7 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot to be the first on the boot device priority list see Section A 6 On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 Note also that the driver waits up to 90 seconds for ad each port to come up If MLNX FlexBoot iPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by the Subnet Manager In case sensing the port protocol fails the port will be configured as an InfiniBand y port For ConnectX Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ 00 0 open Link down TX 0 TXE O RX 0 RXE 0 Link status The socket is not connected Waiting for link up on net0 ok After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from For ConnectX InfiniBand Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIO2 00 0 open Link down TX O TXE O RX 0 RXE 0 Link status The socke
190. ieve effective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet qos ca max vls 15 qos ca high limit 6 qos ca vlarb high 0 4 qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 264 7 64 eps dl S 01 28 479709 pap Sy AMO Mil e TES ETAT qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 qos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 gos_swe sl2vl 0 1 27 34 907 7 8 9 10 11 12 13 1477 In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x 4KB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off 8 6 8 Deployment Example Figure 5 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs Mellanox Technologies 177 Rev 2 1 1 0 0 OpenSM Subnet Manager Figure 5 Example QoS Deployment on InfiniBand Subnet Traffic class SDP Service level 2 Policy min 20 BW Traffic class Partition A Service level 0 y Policy min 40 App erie eis App A Server Traffic class SRP Service Level 1 Policy min 30 BW Tra
191. igh limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 VLArb low table Low priority VL Arbitration table IBA 7 6 9 template e VLArb high table High priority VL Arbitration table IBA 7 6 9 template SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by qos type string Here is a full list of the currently supported sets qos ca QoS configuration parameters set for CAs qos rtr parameters set for routers qos sw parameters set for switches port 0 qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos ca max vls 15 qos ca high limit 0 GOS Cel vilei miea 084 80 230 330 480 3810 98 0 780 180 80 1080 1uLg0 1280 1580 1480 gos ca viley low 030 134 234 95d AS a 14 Sa ona 04 MA Re 34 d pe Ca SLA Wily 2 3 49 07 19879 MO LL 12 Su Y qos swe max vls 15 qos swe high limit 0 qos swe vlarb high 0 4 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 gos swe vilarby low 06071342547 dried 4 Osa 14 824 954 NOTA IA TA SSA aA 176 Mellanox Technologies m 2 1 1 0 0
192. iguration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved con figuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are eth Ethernet b Infiniband auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified if there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device This utility also has a non interactive mode sbin connectx port config d device PCI device ID gt c conf lt portl port2 gt 126 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 2 1 1 0 0 6 2 Auto Sensing Auto Sensing enables the NIC to automatically sense the link type InfiniBand or Ethernet based on the link partner and load the appropriate driver stack InfiniBand or Ethernet For example if the first port 1s connected to an InfiniBand switch and the second to Ethernet switch the NIC will automatically load the first switch as InfiniBand and the second as Ethernet
193. ill in progress 4 13 7 2 3Partitioning IPoIB Communication using PKeys PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by mapping a non default full membership PKey to virtual index 0 and mapping the default PKey to a virtual pkey index other than zero The below describes how to set up two hosts each with 2 Virtual Machines Host 1 vm 1 will be able to communicate via IPoIB only with Host2 vm1 and Host1 vm2 only with Host2 vm2 In addition Host1 Dom0 will be able to communicate only with Host2 Dom0 over 1b0 vm1 and vm2 will not be able to communicate with each other nor with Dom0 This is done by configuring the virtual to physical PKey mappings for all the VMs such that at virtual PKey index 0 both vm 1s will have the same pkey and both vm 2s will have the same PKey different from the vm 1 s and the Dom0 s will have the default pkey different from the vm s pkeys at index 0 OpenSM must be used to configure the physical Pkey tables on both hosts The physical Pkey table on both hosts Dom0 will be configured by OpenSM to be index 0 Oxffff index 1 0xb000 index 2 0xb030 The vm1 s virt to physical PKey mapping will be pkey idx 0 1 pkey idx 1 0 The vm2 s virt to phys pkey mapping will be pkey idx 0 2 pkey idx 1 0 so that the default pkey will reside on the vms at index instead of at index 0 The IPoIB QPs are created to use the PKey at index 0 As a resul
194. ime Parameters 0 0 0 I teen eens 124 Table 20 Recommended PCIe Configuration 00 cc cece ere rr ere rena 128 Table 21 Recommended BIOS Settings for Intel Sandy Bridge Processors 129 Table 22 Recommended BIOS Settings for Intel Nehalem Westmere Processors 130 Table 23 Recommended BIOS Settings for AMD Processors 0000 cece eee eens 130 Table 24 Adaptive Routing Manager Options File 2 0 c cece eee eee 184 Table 25 Adaptive Routing Manager Pre Switch Options File esses 185 Table 26 Congestion Control Manager General Options File o oooooooooo o 188 Table 27 Congestion Control Manager Switch Options File o o ooooooomomom 188 Table 28 Congestion Control Manager CA Options File o o oo ooooooomomomom 188 Table 29 Congestion Control Manager CC MGR Options File oooooooooo o 189 Table 30 ibdiagnet of ibutils2 Output Files 2 0 0 0 cece eee 193 Table 31 ibdiagnet of ibutils Output Files 2 0 ec II 195 Table 32 ibdiagpath Output Files 20 0 0 cette nen en enna 198 Table 33 ibv_devinfo Flags and Options 0 0 cece ccc eee eh 199 Table 34 ibstatus Flags and Options 0 cc cece eee rer nen tree eas 201 Table 35 ibportstate Flags and Options 00 ce cece teen en tree eas 203 10 Mellanox Technologies m 2 1 1 0 0 Table 36 ibportstate Fl
195. in a round robin fashion across such links and so changing the order that CA ports are visited changes the distribution of routes across such links This may be advantageous for some specific traffic patterns The default is to visit CA ports in increasing port order on destination switches Duplicate values in the list will be ignored EXAMPLE Look for a 2D since x radix is one 4x5 torus torus 14 5 y is radix 4 torus dimension need both ym link and yp link configuration yp link 0x200000 0x200005 sw y 0 z 0 gt sw 8 y 1 z 0 ym link 0x200000 0x20000f sw y 0 z 0 gt sw 8 y 3 2 0 z is not radix 4 torus dimension only need one of zm link or zp link configuration zp link 0x200000 0x200001 sw 8 y 0 z 0 gt sw y 0 z 1 next seed yp link 0x20000b 0x200010 sw y 2 z 1 gt sw 8 y 3 z 1 ym link 0x20000b 0x200006 sw y 2 z 1 gt sw 8 y 1 z 1 zp link 0x20000b 0x20000c sw 8 y 2 z 1 gt sw y 2 2 2 y dateline 2 Move the dateline for this seed z dateline 1 back to its original position If OpenSM failover is configured for maximum resiliency one instance should run on a host attached to a switch from the first seed and another instance should run on a host attached to a switch from the second seed Both instances should use this torus 2005 conf to ensure path SL values do not change in the event of SM failover port order defines the order on which the ports would be chosen for routing pex gue 7 10 1
196. in options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt AR Manager options file contains two types of parameters 1 General options Options which describe the AR Manager behavior and the AR parameters that will be applied to all the switches in the fabric 2 Per switch options Options which describe specific switch behavior Note the following Adaptive Routing configuration file is case sensitive You can specify options for nonexisting switch GUID These options will be ignored until a switch with a matching GUID will be added to the fabric Adaptive Routing configuration file is parsed every AR Manager cycle which in turn is executed at every heavy sweep of the Subnet Manager Ifthe AR Manager fails to parse the options file default settings for all the options will be used Mellanox Technologies 183 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 8 5 1 General AR Manager Options Table 20 Adaptive Routing Manager Options File Option File Description Values ENABLE Enable disable Adaptive Routing on fabric Default true lt true false gt switches Note that if a switch was identified by AR Man ager as device that does not support AR AR Manager will not try to enable AR on this switch If the firmware of this switch was updated to support the AR the AR Manager will need to be restarted by restarting Subnet M
197. ing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obviously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 8 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by default if no routing algorithm is specified It can also be invoked by specifying R minhop The Min Hop algorithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter is supplied by i lt equalize ignore guids file gt ignore guids equalize ignore guids file This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis Mellanox Technologies 155 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 3 8 5 3 1 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one
198. ing where i is in the range 0 7 Counter IX prio lt i gt packets Description Total packets successfully received with priority i IX prio lt i gt bytes rx novlan packets Total bytes in successfully received packets with priority i Total packets successfully received with no VLAN priority rx novlan bytes Total bytes in successfully received packets with no VLAN pri ority tx prio lt i gt packets Total packets successfully transmitted with priority 1 tx prio lt i gt bytes Total bytes in successfully transmitted packets with priority i tx novlan packets Total packets successfully transmitted with no VLAN priority tx novlan bytes Total bytes in successfully transmitted packets with no VLAN priority Table 10 Port Pause where i is in the range 0 7 Counter Description IX pause prio lt i gt IX pause duration prio i gt The total number of PAUSE frames received from the far end port The total time in microseconds that far end port was requested to pause transmission of packets IX pause transition prio lt i gt The number of receiver transitions from XON state paused to XOFF state non paused tx pause prio lt i gt The total number of PAUSE frames sent to the far end port tx pause duration prio lt i gt The total time in microseconds that transmission of packets has been paused Mellanox Te
199. ing the query operation This may shorten execution time when running over slow interfaces e g LC MTUSB 1 nofs burn Burn image in a non failsafe manner skip_is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sector difference is detected byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages y es All Non interactive mode Assume the answer is yes to all ques tions no All Non interactive mode Assume the answer is no to all ques tions Mellanox Technologies 219 Rev 2 1 1 0 0 Table 36 mstflint Switches Sheet 3 of 3 Affected Switch Relevant Description Commands vsd burn Write this string of up to 208 characters to VSD upon a burn lt string gt command burn Burn vsd as it appears in the given image do not keep existing use image p VSD on Flash S dual image burn Make the burn process burn two images on Flash The current default failsafe burn process burns a single image in alternat ing locations V Print version info Table 37 mstflint Commands Command Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Venify the entire Flash bb Burn Block Burn t
200. int log number of RDMARC buffers per QP default 4 int log maximum number of CQs per HCA default 16 int log maximum number of multicast groups per HCA default 13 int log maximum number of default 19 int log maximum number of memory translation table segments per HCA default max 20 2 MTTs for register all of the host mem ory limited to 30 int Enable Quality of Service support in the HCA default off bool Reset device on internal errors if non zero default 1 in SRIOV mode default is 0 int memory protection table entries per HCA Threshold for using inline data int Default and max value is 104 bytes Saves PCI read operation transaction packet less then threshold size will be copied to hw buffer directly Enable RSS for incoming UDP traffic uint On by default Once disabled no RSS for incoming UDP traffic will be done Priority based Flow Control policy on TX 7 0 Per priority bit mask uint Priority based Flow Control policy on RX 7 0 Per priority bit mask uint Rev 2 1 1 0 0 Appendix D mlx5 Module Parameters The mlx5 ib module supports a single parameter used to select the profile which defines the number of resources supported The parameter name for selecting the profile is prof sel The supported values for profiles are e 0 for medium resources medium performance 1 for low resources e 2 for high performance int default 242 Mell
201. is partition defmember full limited specifies default membership for port guid list Default is limited Currently recognized flags are ipoib indicates that this partition may be used for IPoIB as a result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate MTU and scope should be specified as defined in the IBTA specifica tion for example mtu 4 for 2048 To use 4K MTU edit that entry to mtu 5 5 indicates 4K MTU to that specific partition PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters The line can be wrapped after after a Partition Definition and between A PartitionName does not need to be unique but PKey doe
202. istration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication 116 Mellanox Technologies m 2 1 1 0 0 These enhancements significantly increase the scalability and performance of message com muni cations in the network alleviating bottlenecks within the parallel communication libraries 5 1 4 Running SHMEM with Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over contiguous pages It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv reg mr gt To activate MLNX OFED 2 0 and the contiguous pages allocator with SHMEM Run the following argument to enable compound pages with SHMEM opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 If using compound pages is not possible then the user will fall back to regular hugepages mechanism To force use of compound pages allocator Run the following command opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 x MR FORCE CONTIG PAGES 1 For further information on the Contiguous Pages please refer to Section 4 9 Contiguous Pages on page 86 5 1 5 Running ScalableSHMEM Application The ScalableSHMEM framework contains the shmemrun utility which launches the executable from a service node to compute nodes This utility accepts the same command line parameters as
203. it loop that leads to deadlock So in the example above with failed switch T the location of the illegal turn at I in the path from S to D requires that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I i e hop r D cannot contribute to a credit loop Mellanox Technologies 161 Rev 2 1 1 0 0 OpenSM Subnet Manager because they cannot be used to construct a loop encircling T The hop I r uses a separate VL so it cannot contribute to a credit loop encircling T Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2Q0S can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus 5 R c 4 I I I I I I 4 _ _ _HA AA I I I I I I 3 mejl Uj I I I I I I 2 a R I I I I I I 1 e I I I I I I y 0 I I I I I I x 0 A 2 3 4 5 Suppose switches T and R have failed and consider the path from S to D Torus 2QoS will gen erate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that torus 2QoS cannot route without deadlock two failed switches adjacent in a dimensi
204. klist section in gt It is also possible that the SRP LUNs will not appear under dev mapper This can adi etc multipath conf and make sure the SRP LUNS are not black listed Automatic Activation of High Availability Set the value of SRP DAEMON ENABLE in etc infiniband openib conf to yes For the changes in openib conf to take effect run etc init d openibd restart Start srpd service run service srpd start From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper Itis possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name P Itis possible to see the output of the SRP daemon in var log srp daemon log 4 1 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop or as a by product of a complete system shutdown Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three cases 1 Without High Availability Mellanox Technologies 55 Rev 2 1 1 0 0 Driver Features 4 2 4 2 1 4 2 2 When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all S
205. ko lro 0 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin dhclient cf sbin dhclient conf ibl Save the init file Close initrd hostis cd tmp initrd ib host1 find cpio H newc o gt tmp new initrd ib img hostl gzip tmp new init ib img At this stage the modified initrd including the IB driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it properly A 8 2 Case Il Ethernet Ports The Ethernet driver requires loading the following modules in the specified order see the exam ple below mlx4 core ko mlx4 en ko A 8 2 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 3 3 1 on page 58 and connected to the client machine 3 An initrd file 232 Mellanox Technologies m 2 1 1 0 0 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with a MLNX EN Linux Driver that is appro priate for the kernel version the diskless image will run Adding the Ethernet D
206. ks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled by A or ucast cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being down A very common case that is handled by the unicast routing cache is host reboot which otherwise would cause two full routing recalculations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destina tion LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID a decision is
207. l K eth0 ntuple on RFS requires the kernel to be compiled with the coNFIG_RFS_ACCEL option This options is available in kernels 2 6 39 and above Furthermore RFS requires Device Managed Flow Steering support RFS cannot function if LRO is enabled LRO can be disabled via ethtool e All of the rest The lowest priority domain serves the following users The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP using L2 flow specifications Mellanox Technologies 91 Rev 2 1 1 0 0 Driver Features The mlx4 ipoib driver when it attaches its QP to his configured GIDS Fragmented UDP traffic cannot be steered It is treated as other protocol by hardware from the first packet and not considered as UDP traffic We recommend using 1ibibverbs v2 0 3 0 0 and 1ibm1x4 v2 0 3 0 0 and higher as of MLNX_OFED v2 0 3 0 0 due to API changes 4 13 Single Root IO Virtualization SR IOV Single Root IO Virtualization SR IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus This technology enables multiple virtual instances of the device with separate resources Mellanox adapters are capable of exposing in ConnectX 3 adapter cards 63 virtual instances called Virtual Functions VFs These virtual functions can then be provisioned separately Each VF can be seen as an addition device con nected to the Physical Function It shares th
208. l a path s search is performed with the given restrictions imposed by that level 4 5 Quality of Service Ethernet 4 5 1 Quality of Service Overview Quality of Service QoS is a mechanism of assigning a priority to a network flow socket rdma_cm connection and manage its guarantees limitations and its priority over other flows This is accomplished by mapping the user s priority to a hardware TC traffic class through a 2 3 stages process The TC is assigned with the QoS attributes and the different flows behave accordingly 4 5 2 Mapping Traffic to Traffic Classes Mapping traffic to TCs consists of several actions which are user controllable some controlled by the application itself and others by the system network administrators The following is the general mapping traffic to Traffic Classes flow 1 The application sets the required Type of Service ToS 2 The ToS is translated into a Socket Priority sk prio 3 The sk prio is mapped to a User Priority UP by the system administrator some applica tions set sk prio directly 4 The UP is mapped to TC by the network system administrator 5 TCs hold the actual QoS parameters QoS can be applied on the following types of traffic However the general QoS flow may vary among them Plain Ethernet Applications use regular inet sockets and the traffic passes via the ker nel Ethernet driver RoCE Applications use the RDMA API to transmit using QPs Raw Ethernet QP
209. l create MLNX OFED LINUX TGZ for rhel6 2 under tmp directory All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y See log file tmp mlnx ofed iso 21642 10g Building OFED RPMs Please wait Removing OFED RPMs Created tmp MLNX OFED LINUX 2 1 1 0 0 rhel6 2 x86 64 tgz minxofedinstall script For further information please see add kernel support option gt The minx add kernel support sh script can be executed directly from the adi below 2 3 2 Installation Script Mellanox OFED includes an installation script called minxofedinsta11 Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Procedure on page 32 Usage mnt mlnxofedinstall OPTIONS Options c config packages config file Example of the configuration file can be found under docs n net network config file Example of the network configuration file can be found under docs k kernel version kernel ver Use provided kernel version instead of uname r Sion p print available Print available packages for current platform and create corresponding ofed conf file without 32bit Skip 32 bit libraries installation without depcheck Skip Distro s libraries check without fw update Skip firmware update fw update only Update firmware Skip driver installation Mellanox Technologies 29 J Rev 2 1
210. l the cluster nodes use cluster aware tools such as pdsh If your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the minx add kernel support sh script located under the docs directory On Redhat and SLES distributions with errata kernel installed there 1s no need to use the mlnx add kernel support sh script The regular installation can be performed and weak adi updates mechanism will create symbolic links to the MLNX OFED kernel modules Usage mlnx add kernel support sh m mlnx ofed path to MLNX OFED directory make iso make tgz make iso Create MLNX OFED ISO image make tgz Create MLNX OFED tarball Default t tmpdir local work dir gt kmp Enable KMP format if supported k kernel kernel version Kernel version to use s kernel sources path to the kernel sources Path to kernel headers v verbose n name Name of the package to be created y yes Answer yes to all questions 1 The firmware will not be updated if you run the install script with the without fw update option 28 Mellanox Technologies m 2 1 1 0 0 Example The following command will create a MLNX OFED LINUX ISO image for RedHat 6 3 under the tmp directory MLNX OFED LINUX 2 1 1 0 0 rhel6 3 x86 64 mlnx add kernel support sh m tmp MINX OFED LINUX 2 1 1 0 0 rhel6 3 x86 64 make tgz Note This program wil
211. level echo add mgmt dbg proc scsi tgt trace level echo add out of mem gt proc scsi tgt trace level oce ke ke ek ke ke ke ke ke ke e ke ke ke e e KK KK End srpt sh kkxkkxkkkkkkkkkkkkkkkkkkkkkxkkxk B 3 How to Unload Shutdown 1 Unload ib_srpt modprobe r ib srpt 2 Unload scst and its dev_handlers first modprobe r scst_vdisk scst 3 Unload ofed etc rc d openibd stop Mellanox Technologies 239 Rev 2 1 1 0 0 Appendix C mlx4 Module Parameters In order to set n1x4 parameters add the following line s to etc modprobe conf options mlx4 core paramete and or options mlx4 ib paramete and or options mlx4 en paramete r lt value gt r lt value gt r lt value gt The following sections list the available m1x4 parameters C 1 mlx4 ib Parameters sm_guid_assign dev_assign str Enable SM alias GUID assignment if sm guid assign gt 0 Default 1 int Map device function numbers to IB device numbers GoGo 0000 SMA S00 00 W0Zosile silo eal doa o Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for IB device numbers e g 1 Max supported devices 32 string 1 In the current version this parameter is using decimal number to describe the InfiniBand device and not hexadecimal number as it was in previous versions in order to uniform the mapping of device function numbers to InfiniBand device numbers as defined for other mo
212. ll be removed Number of seconds between the observation of a transport layer error and failing all I O Increasing this timeout allows more tolerance to transport errors however doing so increases the total failover time in case of serious transport failure Note fast io fail tmo value must be smaller than the value of reconnect delay Maximum number of seconds that the SRP transport should insulate transport layer errors After this time has been exceeded the SCSI target is removed Normally it is advised to set this to 1 disabled which will never remove the scsi host In deployments where different SRP targets are connected and disconnected frequently it may be required to enable this timeout in order to clean old scsi hosts representing targets that no longer exists Constraints between parameters dev loss tmo fast io fail tmo reconnect delay cannot be all disabled or neg ative values reconnect delay must be positive number fast io fail tmo must be smaller than SCSI block device timeout fast io fail tmo must be smaller than dev loss tmo 4 1 2 1 2 SRP Remote Ports Parameters Several SRP remote ports parameters are modifiable online on existing connection To modify dev loss tmo to 600 seconds To modify reconnect delay to 10 seconds echo 600 sys class srp remote ports port xxx dev loss tmo To modify fast io fail tmo to 15 seconds echo 15 sys class srp remote ports port xxx
213. ll the default values in Example of Adaptive Routing Manager Options File on page 185 Disabling Adaptive Routing There are two ways to disable Adaptive Routing Manager 1 By disabling it explicitly in the Adaptive Routing configuration file 2 Byremoving the armgr option from the Subnet Manager options file 182 Mellanox Technologies m 2 1 1 0 0 Adaptive Routing mechanism 1s automatically disabled once the switch receives setting of the usual linear routing table LFT Therefore no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing 8 8 4 Querying Adaptive Routing Tables When Adaptive Routing is active the content of the usual Linear Forwarding Routing Table on the switch is invalid thus the standard tools that query LFT e g smpquery dump_lfts sh and others cannot be used To query the switch for the content of its Adaptive Routing table use the smparquery tool that is installed as a part of the Adaptive Routing Manager package To see its usage details run smparquery h 8 8 5 Adaptive Routing Manager Options File The default location of the AR Manager options file is etc opensm ar mgr conf To set an alter native location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin option option in the file Options string that would be passed to the plugin s event plug
214. loaded Hm The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network interface Step9 Save the init file Mellanox Technologies 233 Rev 2 1 1 0 0 A 9 A 9 1 Step 10 Close initrd host1 cd tmp initrd en hostis find cpio H newc o gt tmp new initrd en img hostl gzip tmp new init en img At this stage the modified initrd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it properly SCSI Boot Mellanox FlexBoot enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd Linux There are two instances of connection to the remote ISCSI Target the first is for getting the kernel and initra via FlexBoot and the second is for loading other parts of the OS via initrd If you choose to continue loading the OS after boot through the HCA device driver please ver ify that the initrd image includes the HCA driver as described in Section A 8 Configuring an iSCSI Target in Linux Environment Prerequisites Step 1 Make sure that an iSCSI Target is installed on your s
215. looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd con of the Mel lanox OFED for Linux installation The DHCP server must run on a machine which has loaded the IPoIB module 58 Mellanox Technologies m 2 1 1 0 0 To run the DHCP server from the command line enter dhcpd IB network interface name d Example host1f dhcpd ib0 d 4 3 3 1 2 DHCP Client Optional A DHCP client can be used 1f you need to prepare a diskless machine with an IB driver See Step 8 under Example Adding an IB Driver to initrd Linux des In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient cf client conf file IB network interface name Example of a configuration file for the ConnectX PCI Device ID 26428 called dhclient conf The value indicates a hexadecimal number interface ib1 send dhcp client identifier irit 8 00 500 8 00 5 00 905 OO 02 Be tre NOOO e 0 602 is se Example of a configuration file for InfiniHost III Ex PCI Device ID 25218 called dhclient conf The value indicates a hexadecimal number interface ibi send dhcp client identifier 2208 0 0 SSS 8104 tte 0 8 000 6
216. made as to what port should be used to get to that LID This step is common to standard and 154 Mellanox Technologies m 2 1 1 0 0 Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same target port a Useonly ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group c If none prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node 8 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign lids option is specified i se lives This option causes OpenSM to reassign LIDs to all end nodes Specify ing r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based rout
217. mpirun from the OpenMPI package For further information please refer to OpenMPI MCA parameters documentation at http www open mpi org faq category ruming Run shmemrun help to obtain ScalableSHMEM job launcher runtime parameters ScalableSHMEM contains support for environment module system http mod ules sf net The modules configuration file can be found at opt mellanox openshmem 2 2 etc shmem modulefile 5 2 Message Passing Interface 5 2 1 Overview Mellanox OFED for Linux includes the following Message Passing Interface MPI implementa tions over InfiniBand Open MPI 1 4 6 8 1 6 1 an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH2 1 7 an MPI 1 implementation by Ohio State University Mellanox Technologies 117 Rev 2 1 1 0 0 HPC Features 5 2 2 5 2 2 1 These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 14 lists some useful MPI links Table 14 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mpi org MVAPICH 2 MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections e Section 5 2 2 Prerequisites for Running MPI on page 118 Section 5 2 3 MPI Selector Which MPI Runs
218. n for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoid ing the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm QoS support has to be turned on in order that SL VL mappings are used P LMC gt 0 is not supported by the LASH routing If this is specified the default routing algorithm is invoked instead ds For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do mesh analysis This will add an additional phase that analyses the mesh to try to determine the dimension and size of a mesh If it determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs Mellanox Te
219. n BW 5096 Bottleneck at storage nodes Administration e OpenSM QoS policy file In the following policy file example replace SRPT with the real SRP Target port GUIDs b qos ulps default 0 ipoib eil sdp Bul srp target port guid SRPT1 SRPT2 SRPT3 2 Mellanox Technologies 179 Rev 2 1 1 0 0 OpenSM Subnet Manager end qos ulps e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 32 qos vlarb low 0 1 q SL Wy iy 2 9130 Ip Ldn Loy ES Loy Mp Sy ES dB 8 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels Management traffic ssh e IPoIB management VLAN partition A Min BW 10 Application traffic e IPoIB application VLAN partition B Isolated from storage and database Min BW of 30 Database Cluster traffic RDS Min BW of 30 SRP Min BW 3096 Bottleneck at storage nodes Administration e OpenSM QoS policy file In the following policy file example replace SRPT with the real SRP Initiator port GUIDs d qos ulps default 0 ipoib pkey 0x8001 zal ipoib pkey 0x8002 2 rds 3 srp target port guid SRPT1 SRPT2 SRPT3 4 180 Mellanox Technologies l m 2 1 1 0 0 end qos ulps e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 96
220. n as the origin of the coordinate system used to describe switch location The position param 166 Mellanox Technologies m 2 1 1 0 0 eter for a dateline keyword moves the origin and hence the dateline the specified amount rela tive to the common switch in a torus seed next seed If any of the switches used to specify a seed were to fail torus 2QoS would be unable to complete topology discovery successfully The next seed keyword specifies that the following link and dateline keywords apply to a new seed specification For maximum resiliency no seed specification should share a switch with any other seed specifi cation Multiple seed specifications should use dateline configuration to ensure that torus 2QoS can grant path SL values that are constant regardless of which seed was used to initiate topology discovery portgroup max ports max ports This keyword specifies the maximum number of parallel inter switch links and also the maximum number of host ports per switch that torus 2QoS can accommodate The default value is 16 Torus 2QoS will log an error message during topology discovery if this parameter needs to be increased If this keyword appears multiple times the last instance prevails port order pl p2 p3 This keyword specifies the order in which CA ports on a destination switch are visited when computing routes When the fabric contains switches connected with multiple parallel links routes are distributed
221. n can also Establish an SRP connection by itself without the need to issue the echo command described in Section 4 1 2 2 Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit Enable High Availability operation together with Device Mapper Multipath Have a configuration file that determines the targets to connect to srp daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c a o is equivalent to ibsrpdm c These srp daemon commands can behave differently than the equivalent 2 ibsrpdm command when etc srp_daemon conf is not empty srp daemon extensions to ibsrpdm 52 Mellanox Technologies m 2 1 1 0 0 To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port port num gt and to generate output suitable for echo you may execute hostl srp daemon c a o i lt InfiniBand HCA name gt p port number 4 To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run ls sys class infiniband de To both discover the SRP Targets and establish connections with them just add the e option to the above command Executing srp daemon over a port without the a option
222. n page 41 44 Mellanox Technologies m 2 1 1 0 0 2 6 Uninstalling Mellanox OFED Use the script usr sbin ofed uninstall sh to uninstall the Mellanox OFED package The script is part of the ofed scripts RPM 2 7 Uninstalling Mellanox OFED using the YUM Tool If MLNX OFED was installed using the yum tool then it can be uninstalled as follow yum groupremove lt group name gt 1 The lt group name gt must be the same group name that was previously used to install MLNX_OFED Mellanox Technologies 45 J Rev 2 1 1 0 0 Configuration Files 3 Configuration Files For the complete list of configuration files please refer to MLNX OFED configuration files txt at the following location docs readme and user manual MLNX OFED configuration files txt 3 1 Persistent Naming for Network Interfaces To avoid network interface renaming after boot or driver restart use the etc udev rules d 70 persistent net rules file Example for Ethernet interfaces PCI device 0x15b3 0x1003 mlx4 core SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 a c3 50 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth1 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 a c3 51 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth2 SUBSYSTEM net ACTION add DRIVERS
223. n service registration deregistration and lease test run event forwarding test flood the SA with queries according to the stress mode QoS info dump VLArb and SLtoVL tables e f m multicast flow q t run trap 64 65 flow this flow requires running of external tool Default all flows except QoS W wait This option specifies the wait time for trap 64 65 in seconds It is used only when running f t the trap 64 65 flow Default 10 sec d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support m max lid This option specifies the maximal LID number to be searched for during inventory file build Default 100 cual This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port 09 cogo This option displays a menu of possible local port GUID values with which osmtest could bind i inventory This option specifies the name of the inventory file Normally osmtest expects to find an inventory file which osmtest uses to validate real time information Mellanox Tech
224. n skew on application scalability The relevant verbs to be used for CORE Direct e Ibv create qp ex ibv modify cq ibv query device ex jbv post task Samples programs for reference jbv task pingpong ibv cc pingpong 4 15 Ethtool ethtool is a standard Linux utility for controlling network drivers and hardware particularly for wired Ethernet devices It can be used to Get identification and diagnostic information e Get extended device statistics e Control speed duplex autonegotiation and flow control for Ethernet devices e Control checksum offload and other hardware offload features Control DMA ring sizes and interrupt moderation Mellanox Technologies 105 Rev 2 1 1 0 0 Driver Features The following are the ethtool supported options Table 6 ethtool Supported Options Options Description ethtool i eth lt x gt Checks driver and device information For example gt ethtool i eth2 driver mlx4 en MT 0DD0120009 CX3 version 2 1 6 Aug 2013 firmware version 2 30 3000 bus info 0000 1a 00 0 ethtool k eth lt x gt Queries the stateless offload status ethtool K eth lt x gt rx onloff tx Sets the stateless offload status onjoff sg on off tso onjoff Iro TCP Segmentation Offload TSO Generic Segmentation onjoff gro on off gso onjoff Offload GSO increase outbound throughput by reducing CPU overhead It works by queuing up large buffers and letting the net
225. nce to Aa the previous CC configuration For further information on how to turn OFF CC please refer to Section 8 9 3 Configuring Con gestion Control Manager on page 186 Configuring Congestion Control Manager Congestion Control CC Manager comes with a predefined set of setting However you can fine tune the CC mechanism and CC Manager behavior by modifying some of the options To do so perform the following 1 Findthe event plugin options option in the SM options file and add the following conf file lt cc mgr options file name gt Options string that would be passed to the plugin s event plugin options ccmgr conf file lt cc mgr options file name gt 2 Run the SM with the new options file opensm F lt options file name gt 186 Mellanox Technologies m 2 1 1 0 0 To turn CC OFF set enable to FALSE in the Congestion Control Manager configura tion file and run OpenSM ones with this configuration For the full list of CC Manager options with all the default values See Configuring Congestion Control Manager on page 186 For further details on the list of CC Manager options please refer to the IB spec 8 9 4 Configuring Congestion Control Manager Main Settings To fine tune CC mechanism and CC Manager behavior and set the CC manager main settings perform the following e To enables disables Congestion Control mechanism on the fabric nodes set the follow ing parameter
226. nd Subnet 0000200005 178 Mellanox Technologies 9 J Rev 2 1 1 0 0 List of Tables Table 1 Document Revision History 00 ccc ccc ea 12 Table 2 Abbreviations and Acronyms 0 ccc cece ete rr rer tree eee 14 Table 3 Glossary idee at RE eint Bed ote iain ele hd bet 15 Table 4 Reference Documents 0 cee een mee 16 Table 5 Software and Hardware Requirements 0 0 cece cece eect nee 27 Table6 mlnxofedinstall Return Codes 0 cece teens 31 Table 7 Butter Valli sais A A oe tecto a ee 86 Table 8 Parameters Used to Control Error Cases Contiguity 0 00 cece eee eee 87 Table 9 Flow Specific Parameters 0 0 cette rer rr rer tree ene 91 Table 10 ethtool Supported Options 0 0 cece II 106 Table Port IN Conti done eas e A a E 109 Table 12 Port OUT Counters i ii ERR E Rc daa seated Ep ege 110 Table 13 Port VLAN Priority Tagging where lt i gt is in the range 0 7 oo oooo o 111 Table 14 Port Pause where lt i gt is in the range 0 7 2 0 ccc eee eens 111 Table 15 VPort Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF 112 Table 16 SW Statistics o5 nesi A De A Re e eee Ewes 113 Table 17 Per Ring SW Statistics where lt i gt is the ring I per configuration 113 Table 18 Useful MPL Links Icod es DPI NP dea 118 Table 19 Runt
227. nd reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold file using the same format as the dump can be specified using the t lt file gt option Synopsis ibcheckerrs h b v G T threshold file gt s N nocolor C ca name P ca port t timeout ms lt lid guid gt lt port gt Output Files Table 35 lists the various flags of the command Table 35 ibcheckerrs Flags and Options A Default Flag i AUN If Not Description an Specified h help Optional Print the help menu b Optional Print in brief mode Reduce the output to show only if errors are present not what they are v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 T Optional Use specified threshold file lt threshold_fi le gt S Optional Show the predefined thresholds N nocolor Optional color mode Use mono mode rather than color mode 216 Mellanox Technologies m 2 1 1 0 0 Table 35 ibcheckerrs Flags and Options Optional
228. nel For example to Intel systems add default 0 timeout 5 splashimage hd0 0 grub splash xpm gz hiddenmenu title Red Hat Enterprise Linux Server 2 6 32 36 x86 645 root hd0 0 kernel vmlinuz 2 6 32 36 x86 64 ro root dev VolGroup00 LogVol00 rhgb quiet intel iommu on initrd initrd 2 6 32 36 x86 64 img 1 Please make sure the parameter intel_iommu on exists when updating the boot grub grub conf file otherwise SR IOV cannot be loaded Mellanox Technologies 93 J Rev 2 1 1 0 0 Driver Features 1 2 Step 5 Step 6 Step 7 Step 8 Install the MLNX OFED driver for Linux that supports SR IOV Use enable sriov installation parameter to burn firmware with SR IOV support The number of virtual functions VFs will be set to 16 Verify the HCA is configured to support SR IOV root selene mstflint dev PCI Device dc 1 Verify in the HCA section the following fields appear HCA num pfs 1 corcel wis lt 00 gt SLOW em true Parameter Recommended Value num_pfs 1 Note This field is optional and might not always appear total vfs 63 sriov en true 2 Add the above fields to the INI if they are missing 3 Set the total_vfs parameter to the desired number if you need to change the num ber of total VFs 4 Reburn the firmware using the mlxburn tool if the fields above were added to the INI or the total vfs parameter was modified mlxburn iw fw Conne
229. nologies 149 Rev 2 1 1 0 0 OpenSM Subnet Manager received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See c option for related information s Stress This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description sl Single MAD response SA queries s2 Multi MAD RMPP response SA queries EO Multi MAD RMPP Path Record SA queries Without s stress testing is not performed M Multicast ModeThis option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC Multiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is per formed t timeout This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds l eleg iile This option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout v verbose This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosi
230. nologies 205 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities MKS PES Aaa 2 5 Gbps 3 Change the speed of a port First query for current configuration gt ibportstate C mba 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 NA ase Oo te eh ee Initialize IN ooo EM Ed LinkUp Wieso EET EET 1X or 4X A s c 5o oor oe eo ea 1X or 4X Natalis ehe VS CUES ever d oe te nn eof oon AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps TK SPECIAL eT TOT adoos 5 0 Gbps Now change the enabled link speed gt ibportstate C mlx4 0 D 0 1 speed 2 ibportstate C mlx4 0 D 0 1 speed 2 Initial PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 LinkSpeedEnabled 2 5 Gbps After PortInfo set Port tos DR paci alici 655957 Clic 655357 O pore 1 LinkSpeedEnabled 66 5 0 Gbps IBA extension Show the new configuration gt ibportstate C mba 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 ROO OD HIDE Initialize IN nooo REN LinkUp ES r Tea ceo coro rore aros are 1X or 4X caro ss ooo pe oa a 1X or 4X NACO Se aethere eon AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps MESE vo ceret epe onsen 5 0 Gbps IBA extension LK peca EI E Ses 5 0 Gbps 9 11 ibroute Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardin
231. o have any number of CAs the closer the tree is to be fully popu lated the more effective the shift communication pattern will be In general even if the root list is provided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will match the routing tables 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn guid file OpenSM options Mellanox Technologies 157 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 4 1 Routing between non CN Nodes The use of the cn_guid file option allows non CN nodes to be located on different levels in the fat tree In such case it is not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around Spinel Spine2 Spine 3 EA N N N1 Switch N2 Switch N3 PAN Going down to compute nodes To solve this problem a list of non C
232. oS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by e Port GUID Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group 168 Mellanox Technologies m 2 1 1 0 0 Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port Mellanox Technologies 169 Rev 2 1 1 0 0 OpenSM Subnet Manager II QoS Setup denoted by qos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts III QoS Levels denoted by qos levels Each QoS Level defines Service Level SL and a few optional fields e MTU limit e Rate limit e PKey e Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied
233. of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a statisti cal histogram is built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage P The user can override the node list manually bo If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm ade 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 3 Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet D Up Down routing does
234. of received 1523 to 1548 octet frames rx gt 1548 bytes packets Number of received 1549 or greater octet frames Table 8 Port OUT Counters Counter Description tx packets Total packets successfully transmitted tx bytes Total bytes in successfully transmitted packets tx multicast packets Total multicast packets successfully transmitted tx broadcast packets Total broadcast packets successfully transmitted tx errors Number of frames that failed to transmit tx dropped Number of transmitted frames that were dropped tx It 64 bytes packets Number of transmitted 64 or less octet frames tx 127 bytes packets Number of transmitted 65 to 127 octet frames tx 255 bytes packets Number of transmitted 128 to 255 octet frames tx 511 bytes packets Number of transmitted 256 to 511 octet frames tx 1023 bytes packets Number of transmitted 512 to 1023 octet frames 110 Mellanox Technologies Table 8 Port OUT Counters m 2 1 1 0 0 Counter Description tx 1518 bytes packets Number of transmitted 1024 to 1518 octet frames tx 1522 bytes packets Number of transmitted 1519 to 1522 octet frames tx 1548 bytes packets Number of transmitted 1523 to 1548 octet frames tx gt 1548 bytes packets Number of transmitted 1549 or greater octet frames Table 9 Port VLAN Priority Tagg
235. of reporting it To verify whether RoCE Time Stamping is available run ibv ex query device Mellanox Technologies 79 J Rev 2 1 1 0 0 Driver Features For example struct ibv_exp device attr attr ibv exp query device context amp attr idt attr comp mask amp IBV EXP DEVICE ATTR WITH TIMESTAMP MASK if attr timestamp mask Time stamping is supported with mask attr timestamp mask it attr comp mask amp IBV EXP DEVICE ATTR WITH HCA CORE CLOCK de arte Jea gare Clock i reporting the device s clock is supported attr hca core clock is the frequency in MHZ 4 6 2 2 Creating Time Stamping Completion Queue To get time stamps a suitable extended Completion Queue CQ must be created via a special call to ibv create cq ex verb cq init attr flags IBV CQ TIMESTAMP cq init attr comp mask IBV CQ INIT ATTR FLAGS cq ibv create cq ex context cqe node NULL 0 amp cq init attr This CQ cannot report SL or SLID information The value of s1 and s1 id fields in struct ibv wc ex are invalid Only the fields indicated by the wc flags field in adi struct ibv wc ex contains a valid and usable value When using Time Stamping several fields of struct ibv wc ex are not available resulting in RoCE UD RoCE traffic with VLANs failure 4 6 2 3 Polling a Completion Queue Polling a CQ for time stamp is done via the ibv poll cq ex verb ase iby poll cc elle l
236. og opensm log The first file message registers only general major events the second file opensm log includes details of reported errors All errors reported in opensm 10g should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly P Ifa fatal non recoverable error occurs opensm exits Running OpenSM As Daemon OpenSM can also run as daemon To run OpenSM in this mode enter hostl etc init d opensmd start osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administra tor osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 8 3 2 osmtest has the following test flows Multicast Compliancy test Event Forwarding test Service Record registration test RMPP stress test 148 Mellanox Technologies m 2 1 1 0 0 Small SA Queries stress test 8 3 1 Syntax osmtest OPTIONS where OPTIONS are Ml ow This option directs osmtest to run a specific flow Flow Description C 7 create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file S ru
237. om When removing the expansion ROM image you also remove Flexboot from the boot device list A 4 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB A 4 1 Installing the DHCP Server Install DHCP client server in embedded within the Linux Distribution A 4 2 Configuring the DHCP Server A 4 2 1 For ConnectX Family Devices When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions 1 Depending on the OS the device name may be superceded with a prefix Mellanox Technologies 225 Rev 2 1 1 0 0 The value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecimal digits Extracting the Port GUID Method gt To obtain the port GUID Step 1 Start mst hostl mst start hostlf mst status The following MFT commands assume that the Mellanox Firmware Tools MFT package has been installed on the client machine Step 2 Obtain the Port GUID using the device name The device name will be of the form dev mst mt lt dev_id gt _pci _cr0 conf0 flin
238. on that is not the last dimension routed by DOR here the failed switches are O and T 5 mujeres p I I I I I I 4 I I I I I I 3 teet o o Do I I I I I I 2 4 4 1 q r 4 I I I I I I 1 m S n 0 T p I I I I I I y 0 I I I I I I x 0 1 2 3 4 5 In a pristine fabric torus 2QoS would generate the path from S to D as S n O T r D With failed switches O and T torus 2QoS will generate the path S n I q r D with illegal turn at switch I and with hop I q using a VL with bit 1 set In contrast to the earlier examples the second hop after the illegal turn q r can be used to construct a credit loop encircling the failed switches 8 5 7 2 Multicast Routing Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically available in current switches there is no way to use SL VL values to separate multicast traffic from unicast traffic Thus torus 2QoS must generate multicast routing such that credit loops can 162 Mellanox Technologies m 2 1 1 0 0 not arise from a combination of multicast and unicast path segments It turns out that it is possi ble to construct spanning trees for multicast routing that have that property For the 2D 6x5 torus Wee example above here is the full fabric spanning tree that torus 2QoS will construct where x is the root s
239. one of the following two options 1 On the command line specify the file name using the option t topology file name gt 2 Define the environment variable IBDIAG_TOPO_FILE To specify the local system name to an diagnostic tool use one of the following two options 1 On the command line specify the system name using the option s lt local system name gt 2 Define the environment variable IBDIAG_SYS_NAME InfiniBand Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the fol lowing options 1 On the command line specify the port number using the option p local port number gt see below 2 Define the environment variable IBDIAG_PORT_NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option i index of local device gt 2 Define the environment variable IBDIAG_DEV_IDX InfiniBand Fabric Diagnostic Utilities m 2 1 1 0 0 9 2 3 Addressing This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies ae The following addressing modes can be used to define th
240. oot guid file is not provided a or root guid file options the topology has to be pure fat tree that complies with the following rules Treerank should be between two and eight inclusively Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches Switches of the same rank should have the same number of ports in each UP going port group Switches of the same rank should have the same number of ports in each DOWN going port group Allthe CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and it should only comply with the following rules Treerank should be between two and eight inclusively Allthe Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algo rithm allows leaf switches t
241. opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indi cators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 8 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits config F lt file name gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists create config c lt file name gt OpenSM will dump
242. or Cases Contiguity Parameters Description MLX MR ALLOC TYPE Configures the allocator type ALL Default Uses all possible allocator and selects most effi cient allocator ANON Enables the usage of anonymous pages and disables the allocator CONTIG Forces the usage of the contiguous pages allocator If contiguous pages are not available the allocation fails MLX MR MAX LOG2 CONTIG BS Sets the maximum contiguous block size order IZE e Values 12 23 Default 23 MLX MR MIN LOG2 CONTIG BS Sets the minimum contiguous block size order 1ZE Values 12 23 Default 12 4 10 Shared Memory Region Shared Memory Region is only applicable to the mlx4 driver Shared Memory Region MR enables sharing MR among applications by implementing the Register Shared MR verb which is part of the IB spec Sharing MR involves the following steps Step 1 Request to create a shared MR The application sends a request via the ibv_reg_mr API to create a shared MR The application supplies the allowed sharing access to that MR If the MR was created successfully a unique MR ID is returned as part of the struct ibv mr which can be used by other applications to register with that MR Mellanox Technologies 87 Rev 2 1 1 0 0 Driver Features The underlying physical pages must not be Least Recently Used LRU or Anonymous To disable that you need to turn on the IBV ACCE
243. ork e g macvtap For additional information please refer to the Red Hat User Manual The IPoIB daemon ipoibd detects the new virtual interface that is attached to the same bridge as the eIPoIB interface and creates a new IPoIB instances for it in order to send receive data As a result number of IPoIB interfaces ibX Y are shown as being created destroyed and are being enslaved to the corresponding ethX interface to serve any active VIF in the system according to the set configuration This process is done automatically by the ipoibd service gt To see the list of IPoIB interfaces enslaved under eth ipoib interface cat sys class net ethX eth vifs For example cat sys class net eth5 eth vifs SLAVE ib0 1 AC 9a c2 1f d7 3b 63 VLAN N A SLAVE ib0 2 AC 52 54 00 60 55 88 VLAN N A SLAVE ib0 3 AC 52 54 00 60 55 89 VLAN N A Each ethX interface has at lease one ibX Y slave to serve the PIF itself In the VIFs list of ethX you will notice that ibX 1 is always created to serve applications running from the Hypervisor on top of the ethX interface directly For InfiniBand applications that require native IPoIB interfaces e g CMA the original IPoIB interfaces ibX can still be used For example CMA and ethX drivers can co exist and make use of IPoIB ports CMA can use ib0 while eth0 ipoib interface will use ibX Y interfaces gt To see the list of eIPoIB interfaces cat sys class net eth ipoib interfaces For example
244. ort1 to a target fails the ib srp module starts an error recovery process If this process gets to the reset host stage and there is no path to the target from this port ib srp will remove this scsi host After the scsi host is removed multipath switches to another path to this target from another port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path 54 Mellanox Technologies m 2 1 1 0 0 Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA srp daemon c e R 300 i InfiniBand HCA name p port number This step can be performed by executing srp daemon sh which sends its log to var log srp daemon log Now it is possible to access the SRP LUNs on dev mapper E It is possible for regular non SRP LUNs to also be present the SRP LUNs may be identified by their names You can configure the etc multipath conf file to change adi multipath behavior occur if the SRP LUNs are in the black list of multipath Edit the blac
245. ost host host host host host host host host host host host host host U ds A ds M do A do A UE ds S do do M Cae do M Ur JC do UE ds MC dodo cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp cp in in in in in in in in in in in ne dum in in in finiband core ib_addr ko tmp initrd ib lib modules ib finiband core ib_core ko tmp initrd ib lib modules ib finiband core ib mad ko tmp initrd ib lib modules ib finiband core ib_sa ko tmp initrd ib lib modules ib finiband core ib_cm ko tmp initrd ib lib modules ib finiband core ib uverbs ko tmp initrd ib lib modules ib tiniband core ib ucm ko tmp initrd ib lib modules ib finiband core ib umad ko tmp initrd ib lib modules ib finiband core iw_cm ko tmp initrd ib lib modules ib finiband core rdma cm ko tmp initrd ib lib modules ib finiband core rdma ucm ko tmp initrd ib lib modules ib t mlx4 mlx4 core ko tmp initrd ib lib modules ib finiband hw mlx4 mlx4 ib ko tmp initrd ib lib modules ib finiband hw mthca ib mthca ko tmp initrd ib lib modules ib finiband ulp ipoib ipoib helper ko tmp initrd ib lib modules ib finiband ulp ipoib ib ipoib ko tmp initrd ib lib modules ib Step 5 IB requires loading an IPv6 module If you do not have it in your initrd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko V tmp initrd ib lib modules Step 6 To load the modules you need the insmod executable If yo
246. overlap Setting a lower priority value will result in higher priority In addition to the domain there is priority within each of the domains Each domain can have at most 2 12 priorities in accordance with its needs The following are the domains at a descending order of priority User Verbs allows a user application QP to be attached into a specified flow when using ibv create flow and ibv destroy flow verbs Mellanox Technologies 89 J Rev 2 1 1 0 0 Driver Features ibv create flow struct ibv flow ibv create flow struct ibv qp qp struct ibv flow attr flow Input parameters e struct ibv qp the attached QP e struct ibv flow attr attaches the QP to the flow specified The flow contains mandatory control parameters and optional L2 L3 and L4 headers The optional headers are detected by setting the size and num of specs fields struct ibv flow attr can be followed by the optional flow headers structs struct ibv flow spec ib struct ibv flow spec eth struct ibv flow spec ipv4 struct ibv flow spec tcp udp For further information please refer to the ibv create flow man page Be advised that from MLNX OFED v2 0 3 0 0 and higher the parameters both the value and the mask should be set in big endian format de Each header struct holds the relevant network layer parameters for matching To enforce the match the user sets a mask for each parameter The supported masks are All one mask include the
247. ovide unchanging path SL values in the presence of subnet manager failover provided that all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that prevent routing that is free of credit loops and will log warnings and refuse to route If no fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that case all paths that do not transit the failed compo nents will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was config ured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and mes sage deadlock are likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capable of routing a torus without credit loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To verify that a torus fabric 1s routed free of credit loops use ibdmchk to analyze data collected via ibdia
248. packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL The Subnet Administrator controls the parameters of each communication flow by pro viding them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The following subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack 64 Mellanox Technologies m 2 1 1 0 0 4 4 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chro nology approach to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fab ric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up partitions with appropriate IPoIB broadcast group This broadcast group
249. part of the sockaddr provided to rdma_resolve_add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 4 4 4 1 IPoIB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 4 4 4 2 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 66 Mellanox Technologies m 2 1 1 0 0 4 4 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 8 can be split into two main parts l Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the discovered fabric elements Il PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined match rules such that the target QoS Level defini tion is found Given the QoS Leve
250. pervisor Dom0 This interface is under sys class infiniband lt infiniband device iov Under this directory the following subdirectories can be found e ports The actual physical port resource tables Port GID tables ports n gids n where 0 lt n lt 127 the physical port gids 100 Mellanox Technologies l Rov 2 1 1 0 0 ports n admin guids n where 0 n 127 allows examining or changing the administrative state of a given GUID gt ports n pkeys n where 0 n 126 displays the contents of the physical pkey table e pci id directories one for Dom0 and one per guest Here you may see the map ping between virtual and physical pkey indices and the virtual to physical gid 0 Currently the GID mapping cannot be modified but the pkey virtual to physical mapping can These directories have the structure pci id port m gid idx 0 where m 1 2 this is read only and e pci id port m pkey idx lt n gt Wherem 1 2andn 0 126 For instructions on configuring pkey idx please see below 4 13 7 2 2Configuring an Alias GUID under ports lt n gt admin_guids Step 1 Determine the GUID index of the PCI Virtual Function that you want to pass through to a guest For example if you want to pass through PCI function 02 00 3 to a certain guest you ini tially need to see which GUID index is used for this function To do so cat sys class infini
251. r example if the adapters NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only gt To run an application run the following commands taskset c 8 15 ib write bw a or taskset Oxff00 ib write bw a IRQ Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalance stop The following command assigns the affinity of a single interrupt vector gt echo lt hexadecimal bit mask gt gt proc irq lt irq vector gt smp affinity Bit 1 in lt hexadecimal bit mask gt indicates whether processor core i is in lt irq vector gt s affinity or not IRQ Affinity Configuration It is recommended to set each IRQ to a different core For Sandy Bridge or AMD systems set the irq affinity to the adapter s NUMA node For optimizing single port traffic run set_irq affinity bynode sh lt numa node gt lt interface gt For optimizing dual port traffic run set_irq affinity bynode sh lt numa node gt lt interfacel gt lt interface2 gt To show the current irq affini
252. r outgoing packets just as HWTSTAMP TX ON does but also enables time stamp insertion directly into Sync packets In this case transmitted Sync packets will not received a time stamp via the socket error queue si HWISTAMP TX ONESTEP SYNC NH Note for send side time stamping currently only HWTSTAMP TX OFF and HWTSTAMP TX ON are supported Mellanox Technologies 77 J Rev 2 1 1 0 0 Driver Features Receive side time sampling Enabled by ifreq hwtstamp config rx filter when possible values for hwtstamp config gt rx filter enum hwtstamp rx filters time stamp no incoming packet at all HWTSTAMP FILTER NONE time stamp any incoming packet HWTSTAMP FILTER ALL return value time stamp all packets requested plus some others HWTSTAMP FILTER SOME Je P wily U HWISTAMP FIL js jns wil qi HWTSTAMP FIL Js An yvi U HWTSTAMP FIL Jj Pn wg qi HWTSTAMP FIL fe An ww U HWTSTAMP FIL js pns wo qi HWTSTAMP FILT P any kind of event packet PTP V1 L4 EVENT P Sync packet ER PTP V1 L4 SYNC P Delay req packet PTP V1 L4 DELAY REQ P any kind of event packet ER PTP V2 L4 EVENT P Sync packet ER PTP V2 L4 SYNC P Delay req packet PTP V2 L4 DELAY REQ E Es Sas aesSasasas JJ 52 w 802 AS1 Ethernet any kind of event packet HWTSTAMP FILTER PTP V2 L2 EVENT 802 AS1 Ethernet Sync packet HWTSTAMP
253. r the BER test The reciprocal number of the BER should be provided Example for 10 12 than value need to be 1000000000000 or 0xe8d4a51000 10 12 If threshold given is 0 than all BER values for all ports will be reported extended speeds lt dev type gt Collect and test port extended speeds counters dev type sw all pm per lane List all counters per lane when available ls 2 5 5 10 14 25 FDR10 Specifies the expected link speed lw 1x 4x 8x 12x Specifies the expected link width w write topo file file name Write out a topology file for the discovered topology 0 0u0 iile lt tile gt Specifies the topology file name out ibnl dir directory The topology file custom system definitions ibnl directory screen num errs num Specifies the threshold for printing errors tO SCHEIN default 5 smp window num Max smp MADs on wire default 8 gmp window num Max gmp MADs on wire default 128 max hops lt max hops gt Specifies the maximum hops for the discovery process default 64 V version Prints the version of the tool h help Prints help information without plugins help if exists ele help Prints deep help information including plugins help Output Files Table 26 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 26 ibdiagnet of ibutils2 Output Files
254. r to SCST s README for loading scst driver and its dev handlers drivers scst vdisk block or file IO mode nullio Regardless of the mode you always need to have lun 0 in any group s device list Then you can have any lun number following lun 0 it is not required to have the lun adi numbers in ascending order except that the first lun must always be 0 Setting SRPT LOAD yes in etc infiniband openib conf is not enough as it only loads the ib srpt module but does nof load scst not its dev handlers ae 236 Mellanox Technologies l m 2 1 1 0 0 The scst disk module pass thru mode of SCST is not supported by Mellanox OFED Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst S modprobe scst vdisk c echo open vdisk0 dev md0 BLOCKIO gt proc scsi tgt vdisk vdisk d echo open vdisk1 dev sda BLOCKIO gt proc scsi tgt vdisk vdisk e echo open vdisk2 dev cciss cld0 BLOCKIO gt proc sesi tgt vdisk vdisk f echo add vdisk0 0 gt proc scsi_tgt groups Default devices g echo add vdiskl 1 gt proc scsi_tgt groups Default devices h echo add vdisk2 2 gt proc scsi_tgt groups Default devices Example 2 working with scst vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst b modprobe scst vdisk c echo open vdisk0 dev md0 gt proc scsi tgt vdisk vdisk d echo open vdisk1 10G file gt proc scsi_tgt vdisk vdisk
255. receive buffers to allow the driver moving the inline received message to these buf fers The validity of these addresses is not checked therefore the result of providing non valid virtual addresses is unexpected Connect IB supports Inline Receive on both the requestor and the responder sides Since data is copied at the poll CQ verb Inline Receive on the requestor side is possible only if the user chooses IB V SIGNAL ALL WR Querying Inline Receive Capability User application can use the ibv_exp query device function to get the maximum possible Inline Receive size To get the size the application needs to set the IBV EXP DEVICE ATTR INLINE RECV SZ bitin the ibv exp device attr comp mask 4 18 2 Activating Inline Receive To activate the Inline Receive you need to set the required message size in the max inl recv field in the ibv exp qp init attr struct when calling ibv exp create qp function The value returned by the same field is the actual Inline Receive size applied Setting the message size may affect the WQE CQE size 108 Mellanox Technologies l m 2 1 1 0 0 4 19 Ethernet Performance Counters Counters are used to provide information about how well an operating system an application a service or a driver is performing The counter data helps determine system bottlenecks and fine tune the system and application performance The operating system network and devices p
256. reparing qperf Preparing xm Jg gt Preparing openmpi Preparing openmpi Preparing bupc Preparing infinipath psm Preparing infinipath psm devel Preparing mvapich2 Preparing openshmem Preparing coll reparing libibprof Preparing libvma i qme Changing max locked memory to unlimited in etc security limits conf Please log out from the shell and login again in order to update this change Read more about this topic in the VMA s User Manual VMA README txt is installed at usr share doc libvma 6 5 8 0 README txt Please refer to VMA journal for the latest changes usr share doc libvma 6 5 8 0 journal txt Preparing mlnxofed docs Preparing mpitests mvapich2 1 9 Preparing mpitests openmpi 1 6 5 Preparing mpitests openmpi 1 7 4 Mellanox Technologies 37
257. river to the initrd File executed by users with expertise in the boot process Improper application of this pro The following procedure modifies critical files used in the boot procedure It must be A cedure may prevent the diskless machine from booting Step 1 Back up your current initrd file Step 2 Make a new working directory and change to it host1 mkdir tmp initrd_en host1 cd tmp initrd en Step 3 Normally the initrd image is zipped Extract it using the following command host1 gzip dc lt initrd image gt cpio id The initrd files should now be found under tmp initrd_en Step 4 Create a directory for the ConnectX EN modules and copy them hostis mkdir p tmp initrd en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers hostl cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en hostl cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en Step 5 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command hostis cp sbin insmod tmp initrd en sbin Step 6 If you plan to give your Ethernet device a static IP address then copy ifconfig Other wise skip this step hostis cp sbin ifconfig tmp initrd en sbin Step 7 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be
258. ro vide counter data that an application can consume to provide users with a graphical view of how well the system is performing The counter index is a QP attribute given in the QP context Multiple QPs may be associated with the same counter set If multiple QPs share the same counter its value represents the cumulative total ConnectX 3 support 127 different counters which allocated e 4 counters reserved for PF 2 counters for each port 2counters reserved for VF 1 counter for each port All other counters if exist are allocated by demand e RoCE counters are available only through sysfs located under e sys class infiniband mlx4 ports counters e sys class infiniband mlx4 ports counters ext Physical Function can also read Virtual Functions port counters through sysfs located under e sys class net eth vf statistics To display the network device Ethernet statistics you can run Ethtool S lt devname gt Table 7 Port IN Counters Counter Description rx_packets Total packets successfully received rx_bytes Total bytes in successfully received packets rx_multicast_packets Total multicast packets successfully received rx_broadcast_packets Total broadcast packets successfully received rx_errors Number of receive packets that contained errors preventing them from being deliverable to a higher layer protocol rx_dropped Number of receive packets which were chosen to be discar
259. ro tollerance abort configuration on first error error window 0 mechanism disabled no error checking 0 48K The default is 5 8 9 4 1 Congestion Control Manager Options File Table 22 Congestion Control Manager General Options File Option File Description Values enable Enables disables Congestion Control mechanism Values TRUE FALSE on the fabric nodes Default True num hosts Indicates the number of nodes The CC table val Values 0 48K ues are calculated based on this number Default 0 base on the CCT calculation on the current subnet size Table 23 Congestion Control Manager Switch Options File Option File Description Values threshold Indicates how aggressive the congestion mark 0 Oxf ing should be 0 no packet marking Oxf very aggressive Default Oxf marking rate The mean number of packets between marking Values 0 Oxffff eligible packets with a FECN Default Oxa packet_size Any packet less than this size bytes will not be Values 0 0x3fc0 marked with FECN Default 0x200 Table 24 Congestion Control Manager CA Options File Option File Desctiption Values port_control Specifies the Congestion Control attribute for Values this port 0 QP based congestion control 1 SL Port based congestion con trol Default 0 188 Mellanox Technologies Table 24 Congestion Control Manager CA Op
260. rol over remote access to its memory It is available only on physical functions native machines The two types of Memory Windows supported are type 1 and type 2B Memory Windows are intended for situations where the application wants to grant and revoke remote access rights to a registered region in a dynamic fashion with less of a performance penalty grant different remote access rights to different remote agents and or grant those rights over different ranges within registered region For further information please refer to the InfiniBand specification document Mellanox Technologies 113 Rev 2 1 1 0 0 Driver Features Memory Windows API cannot co work with peer memory clients PeerDirect die 4 20 1 Query Capabilities Memory Windows are available if and only the hardware supports it To verify whether Memory Windows are available run ibv query device For example teuet ioy device atte device arte NI ibv query device context amp device attr if device attr device cap flags amp IBV DEVICE MEM WINDOW device attr device cap flags amp IBV DEVICE MW TYPE 2B Memory window is supported 4 20 2 Allocating Memory Window Allocating memory window is done by calling the ibv_alloc_mw verb type mw IBV MW TYPE 2 IBV MW TYPE 1 mw ibv alloc mw pd type mw 4 20 3 Binding Memory Windows After allocated memory window should be bound to a registered memory region Memory Region sho
261. rt groups are not referred anywhere and there is no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning port groups port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment uses SP Temes port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF end port group Mellanox Technologies 171 Rev 2 1 1 0 0 OpenSM Subnet Manager port group name Virtual Servers The syntax of the port name is as follows node description Pnum node description is compared to the NodeDescription of the node and Pnum is a port number on that node port name vsl HCA 1 P1 vs2 HCA 1 P1 end port group using partitions defined in the partition policy port group name Partitions partition Partl pkey 0x1234 end port group using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for all the nodes in the subnet port group name CAs and SM node type CA SELF end port group end port groups
262. rtinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2v1 lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt mepi lt addr gt lt portnum gt lt destdr path Optional Destination s directed path LID or GUID lid guid gt Examples 1 Query PortInfo by LID with port modifier gt smpquery portinfo 1 I Port info Lid 1 port 1 MEG erede She Ms pelare skenar ers 0x0000000000000000 CIA ior eas ee Eee e TREES 0xfe80000000000000 MO TE E UC TENERE E 0x0001 Mo tte sine et STE 0x0001 CapMalske Ne Stet ada sa AA EY 0x251086a IsS IsTrapSupported TsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported Diageo dE ee e TERI RE TEES 0x0000 MkeyLeasePeriod eese 0 oca POE wee t stmt em ende 1 Minas sou conosnaroo ao 1X or 4X mnki rdthoupponted a ere oa or re er 1X or 4X Nc cis oos e ocosocas5555 AX Mellanox Technologies 211 Rev 2 1 1 0 0 LinkSpeedSUpported oo o o 2 5 Gbps or 5 0 Gbps Oo ano dto UE Active ANNO o ona doo beans ase LinkUp DADES TET T Polling PrOteGE BUS M selen ASTE SERES 0 ME EXE Se egi ee eee OR Oca dee CEU E 0 jnkSpeedAGGLVe e e au e ect 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps euxghborcMIU MU de eo abas T
263. rtitioned address space where variables may be directly read and written by any processor but each variable is physically associated with a single processor UPC uses a Single Program Multiple Data SPMD model of computation in which the amount of parallelism is fixed at program startup time typically with a single thread of execution per processor In order to express parallelism UPC extends ISO C 99 with the following constructs Anexplicitly parallel execution model A shared address space Synchronization primitives and a memory consistency model Memory management primitives The UPC language evolved from experiences with three other earlier languages that proposed parallel extensions to ISO C 99 AC Split C and Parallel C Preprocessor PCP UPC is not a superset of these three languages but rather an attempt to distill the best characteristics of each UPC combines the programmability advantages of the shared memory programming paradigm and the control over data layout and performance of the message passing programming para digm Mellanox ScalableUPC is based on Berkely UPC package see http upc lbl gov and contains the following enhancements e GasNet library used within UPC integrated with Mellanox FCA which off loads from UPC collective operations For further information on FCA please refer to the Mellanox website e GasNet library contains MXM conduit which offloads from UPC all P2P operations as well as some s
264. run x MXM UD RX MAX BUFFERS 128000 lt gt Modifying the default MXM parameters value from SHELL export MXM UD RX MAX BUFFERS 128000 mpirun lt gt Mellanox Technologies 121 Rev 2 1 1 0 0 HPC Features 5 3 4 Configuring Multi Rail Support Multi Rail support enables the user to use more than one of the active ports on the card by mak ing a better use of the resources It provides a combined throughput among the used ports gt To configure dual rail support Specify the list of ports you would like to use to enable multi rail support x MXM RDMA PORTS cardName portNum or x MXM IB PORTS cardName portNum 5 3 5 Configuring MXM over the Ethernet Fabric gt To configure MXM over the Ethernet fabric Step 1 Make sure the Ethernet port is active ibv devinfo ibv devinfo displays the list of cards and ports in the system Please make sure in the Ibv devinfo output that the desired port has Ethernet at the Link layer field and that pe its state is PORT_ACTIVE Step 2 Specify the ports you would like to use if there is a non Ethernet active port in the card x MXM RDMA PORTS mlx4 0 1 or x MXM IB PORTS mlx4 0 1 5 4 Fabric Collective Accelerator The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI process to the server CPUs As a system wide solution FCA does not require any additional hardware The F
265. s m 2 1 1 0 0 log num mgm entry size high rate steer fast drop enable 64b cqe eqe log num mac log num vlan log mtts per seg port type array log num qp log num srq log rdmarc per qp log num cq log num mcg log num mpt log num mtt enable qos internal err reset C 3 mlx4 en Parameters inline thold udp rss p ctx pfcrx Mellanox Technologies 241 log mgm size that defines the num of qp per mcg for example 10 gives 248 range 7 log num mgm entry size 12 To activate device managed flow steering when available set to 1 int Enable steering mode for higher packet rate default off int Enable fast packet drop when no recieve WQEs are posted int Enable 64 byte CQEs EQEs when the FW supports this if non zero default 1 int Log2 max number of MACs per ETH port 1 7 int Obsolete Log2 max number of VLANs per ETH port 0 7 int Log2 number of MTT entries per segment 0 7 default 0 int Either pair of values e g 1 2 to define uniform port1 port2 types configuration for all devices functions or a string to map device function numbers to their pair of port types values e g 0000 04 00 0 1 2 002b 1c 0b a 1 1 Valid port types 1 ib 2 eth 3 auto 4 N A If only a single port is available use the N A port type for pio Gog 1 4 log maximum number of QPs per HCA default 19 int log maximum number of SRQs per HCA default 16
266. s Vlan Switch Tagging In the latter case untagged or priority tagged outgoing packets from the guest will have the VLAN tag inserted and incoming packets will have the VLAN tag removed Any vlan tagged packets sent by the VF are silently dropped The default behavior is VGT The feature may be controlled on the Hypervisor from userspace via iprout2 netlink ip link set dev DEVICE group DEVGROUP up down vf NUM mac LLADDR vlan VLANID qos VLAN QOS spoofchk on off use ip link set dev PF device vf NUM vlan vlan id qos lt qos gt where NUM 0 max vf num e vlan id 0 4095 4095 means set VGT e qos 0 7 For example ip link set dev eth2 vf 2 qos 3 sets VST mode for VF 2 belonging to PF eth2 with qos 3 ip link set dev eth2 vf 2 4095 sets mode for VF 2 back to VGT 4 13 7 3 2Additional Ethernet VF Configuration Options Guest MAC configuration By default guest MAC addresses are configured to be all zeroes In the mlnx_ofed guest driver if a guest sees a zero MAC it generates a random MAC address for itself If the administrator wishes the guest to always start up with the same MAC he she should configure guest MACS before the guest driver comes up The guest MAC may be configured by using ip link set dev PF device vf NUM mac lt LLADDR gt For legacy guests which do not generate random MACS the adminstrator should always configure their MAC ad
267. s 185 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 9 8 9 1 8 9 2 8 9 3 Congestion Control Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in i e it is a shared library libce mgr so that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Congestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so perform the following 1 Create the file Run opensm c lt options file name gt 2 Find the event_plugin_name option in the file and add cemgr to it Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt Once the Congestion Control is enabled on the fabric nodes to completely disable Congestion Control you will need to actively turn 1t off Running the SM w o the CC Manager is not sufficient as the hardware still continues to function in accorda
268. s need to be unique Ifa PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note e Itis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions 152 Mellanox Technologies m 2 1 1 0 0 Examples Default 0x7fff ALL SELF full NewPartition ipoib 0x123456 fu11 0x3456789034 1imi 0x2134af2306 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited ShareIO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited ShareIO 0x80 0x123453 0x123454 0x123455 full 0x123456 0x123457 will be limited 3 0 S harelO 0x80 defmember limited 0x123456 0x123457 x123458 full hareIO 0x80 defmember full 0x123459 0x12345a harelO 0x80 defmember full 0x12345b 0x12345c limited 0x12345d The following rule is equivalent to how OpenSM used to run prior to the partition manager Default 0x7fff ipoib ALL full Mellanox Technologies 153 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm Based on the minimum hops to each node where the path length is optimized 2 UPDN Algorithm Based on the minimum hops to each node but it
269. se the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance Disable the TCP timestamps option for better CPU utilization sysctl w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better throughput sysctl w net ipv4 tcp sack 1 Increase the maximum length of processor input queues sysctl w net core netdev max backlog 250000 Increase the TCP maximum and default buffer sizes using setsockopt Sysc w net core rmem max 4194304 Sysc w net core wmem max 4194304 Sysc tl tl sysctl w net core rmem default 4194304 tl w net core wmem default 4194304 ell Sysc w net core optmem max 4194304 Increase memory thresholds to prevent packet dropping sysctl w net ipv4 tcp rmem 4096 87380 4194304 sysctl w net ipv4 tcp wmem 4096 65536 4194304 Enable low latency mode for TCP sysctl w net ipv4 tcp low latency 1 7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance Disable the
270. selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and login again Other packages such as environ ment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also shows what the current selections are This command is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting Mellanox Technologies 119 Rev 2 1 1 0 0 HPC Features 5 2 4 5 3 5 3 1 Compiling MPI Applications Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user guide html To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich
271. set the minimal number of processes threshold to activate FCA export GASNET FCA NP CMD LINE 1 ScalableUPC contains modules configuration file http modules sf net which can be found at opt mellanox bupc 2 2 etc bupc modulefile rl 5 5 3 Various Executable Examples The following are various executable examples gt To run a Scalable UPC application without FCA support Q upcrun np 128 fca enable 0 lt executable filename gt gt To run UPC applications with FCA enabled for any number of processes export GASNET FCA ENABLE CMD LINE 1 GASNET FCA NP CMD LINE 0 upcrun np 64 lt executable filename gt oo Torun UPC application on 128 processes verbose mode 2 upcrun np 128 fca enable 1 fca np 10 fca verbose 5 executable filename Torun UPC application offload to FCA Barrier and Broadcast only 2 upcrun np 128 fca ops barrier bt executable filename Mellanox Technologies 125 Rev 2 1 1 0 0 Working With VPI 6 Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth 6 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx port config script after the driver is loaded Running sbin connectx port config s will show current port conf
272. size of the message and inline indication in the packet MLX5 SHUT UP BF Disables blue flame feature Otherwise do not disable e MLX5 SINGLE THREADED All spinlocks are disabled Otherwise spinlocks enabled Used by applications that are single threaded and would like to save the overhead of taking spinlocks MLX5 CQE SIZE 64 completion queue entry size is 64 bytes default 128 completion queue entry size is 128 bytes 22 Mellanox Technologies m 2 1 1 0 0 MLX5 SCATTER TO CQE Small buffers are scattered to the completion queue entry and manipulated by the driver Valid for RC transport Default is 1 otherwise disabled 1 3 3 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and kernel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 3 4 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Unreliable Datagram UD by default but it may also be configured to be Reliable Connected RC The interface s
273. ssion algorithm for each TC LIST is comma seperated algorithm names for each TC Possible algorithms strict etc Example ets strict ets sets TCO TC2 to ETS and TC1 to strict The rest are unchanged t LIST tcbw LIST Set minimal guaranteed BW for ETS TCs LIST is comma Seperated percents for each TC Values set to TCs that are not configured to ETS algorithm are ignored but must be present Example if TC0 TC2 are set to ETS then 10 0 90 will set TCO to 10 and TC2 to 90 Percents must sum to 100 r LIST ratelimit LIST Rate limit for TCs in Gbps LIST is a comma seperated Gbps limit for each TC Example 1 8 8 will limit TCO to 1Gbps and TC1 TC2 to 8 Gbps each i INTF interface INTF Interface name a Show all interface s TCs Mellanox Technologies 71 Rev 2 1 1 0 0 Driver Features Get Current Configuration 72 Mellanox Technologies m 2 1 1 0 0 Set ratelimit 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2 tc 0 ratelimit 3 Gbps tsa strict up 0 Skprio 0 Skprio 1 Skprio 2 tos 8 Skprio 3 Skprio 4 tos 24 Skprio 5 Skprio 6 tos 16 Skprio 7 Skprio 8 Skprio 9 Skprio 10 Skprio 11 Skprio 12 Skprio 13 Skprio 14 Skprio 15 mos 1 ups 2 woe 3 up 4 up 5 up 6 mos 7 Configure QoS map UP 0 7 to tc0 1 2 3 to tc1 and 4 5 6 to tc 2 set tc0 tc1 as ets and tc2 as strict divide ets 30 for tc0 and 70 for tc1 nii Gos gt erns s ets ets
274. st f c Immediately afterwards run the following command to test opensm hostl osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed 8 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm par titions conf To change this filename you can use opensm with the Pconfig or P flags The default partition is created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P Key value of 0x7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial mem bership 8 4 1 File Format Notes Line content followed after character is comment and ignored by parser General File Format Partition Definition gt lt PortGUIDs list Partition Definition PartitionName PKey flag value defmember full limited Mellanox Technologies 151 Rev 2 1 1 0 0 OpenSM Subnet Manager where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value for this partition Only low 15 bits will be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of th
275. st be under the complete control of torus 2QoS any con figuration via qos_sl2vl qos swe sl2vl etc must and will be ignored and a warning will be generated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise if traffic 1s not delivered fairly across each of these two VL ranges Torus 2QoS will detect and warn if VL arbi tration is configured unfairly across VLs in the range 0 3 and also in the range 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via qos vlarb high qos vlarb low etc 8 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the cor rect path SL for connection setup Applications that use rdma cm for connection setup will auto matically meet this requirement If a change in fabric topology causes changes in path SL values required to route without credit loops in general all applications would need to repath to avoid message deadlock Since torus 2QoS has the ability to reroute after a single switch failure without changing path SL values repathing by running applications is not required when the fabric is routed with torus 2QoS Torus 2QoS can pr
276. st of match rules and their QoS Level but in this case a match rule has only one criterion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below 170 Mellanox Technologies m 2 1 1 0 0 8 6 4 Policy File Syntax Guidelines Leading and trailing blanks as well as empty lines are ignored so the indentation in the example is just for better readability Comments are started with the pound sign and terminated by EOL Any keyword should be the first non blank in the line unless it s a comment Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules Any section subsection of the policy file is optional 8 6 5 Examples of Advanced Policy File As mentioned earlier any section ofthe policy file is optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file qos levels qos level name DEFAULT gle t end qos level end qos levels Port groups section is missing because there are no match rules which means that po
277. stripping is disabled For more info please refer to Documentation networking timestamping txt in kernel org 4 6 1 3 Querying Time Stamping Capabilities via ethtool To display Time Stamping capabilities via ethtool Show Time Stamping capabilities ethtool T eth lt x gt Example ethtool T eth0 Time stamping parameters for p2pl Capabilities hardware transmit SOF TI ESTAMPING TX HARDWARE software transmit SOF TIMESTAMPING TX SOFTWARE hardware receive SOF TIMESTAMPING RX HARDWARE software receive SOF TIMESTAMPING RX SOFTWARE software system clock SOF TIMESTAMPING SOFTWARE hardware raw clock SOF TIMESTAMPING RAW HARDWARE PTP Hardware Clock none Hardware Transmit Timestamp Modes off HWTSTAMP TX OFF on HWTSTAMP TX ON Hardware Receive Filter Modes none HWTSTAI P FILTER NONE all HWTSTAI P FILTER ALL 4 6 0 RoCE Time Stamping RoCE Time Stamping is currently at beta level Please be aware that everything listed here 1s subject to change RoCE Time Stamping allows you to stamp packets when they are sent to the wire received from the wire The time stamp is given in a raw hardware cycles but could be easily converted into hardware referenced nanoseconds based time Additionally it enables you to query the hardware for the hardware time thus stamp other application s event and compare time 4 6 2 1 Query Capabilities Time stamping is available if and only the hardware reports it is capable
278. t 0 When the value is set to 0 no statistics are collected Mellanox Technologies 189 9 9 1 9 2 9 2 1 9 2 2 190 Mellanox Technologies Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation synopsis and options descriptions error codes and examples Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use
279. t d lt MST DEVICE NAME gt q Example with ConnectX 2 QDR device Image type ConnectX FW Version 2 9 1000 Rom Info type PXE version 3 4 142 devid 26428 proto VPI Device ID 26428 Description Node Porti Port2 Sys image GUIDs 0002c9030005cffa 0002c9030005cffb 0002c9030005cffc 0002c9030005cffd MACs 0002c905cffa 0002c905cffb Board ID MT 0DD0110009 VSD PSID MT 0DD0110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 00 10 39 Extracting the Port GUID Method Il An alternative method for obtaining the port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure below Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ 00 0 open Link down TX O TXE O RX 0 RXE 0 Link status The socket is not connected Waiting for link up on netO ok Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server host hosti nerto erve rm Ll 2 5 Tg filename pxelinux 0 fixed address 11 4 3 130 226 Mellanox Technologies m 2 1 1 0 0 option dhcp client identifier HE 8 010 8010 20 0 8101 8 00022
280. t the Dom0 vml and vm2 IPoIB QPs will all use different PKeys gt To partition IPoIB communication using PKeys Step 1 Create a file etc opensm partitions conf on the host on which OpenSM runs con taining lines Default 0x7fff ipoib ALL full Pkey1 0x3000 ipoib ALL full Pkey3 0x3030 ipoib ALL full This will cause OpenSM to configure the physical Port Pkey tables on all physical ports on the network as follows pkey idx pkey value a 0 OxFFFF 1 0xB000 2 0xB030 102 Mellanox Technologies m 2 1 1 0 0 the most significant bit indicates 1f a PKey is a full PKey The ipoib causes OpenSM to pre create IPoIB the broadcast group for the indicated PKeys de Step 2 Configure on Dom0 the virtual to physical PKey mappings for the VMs Step a Check the PCI ID for the Physical Function and the Virtual Functions lspci grep Mel Stepb Assuming that on Hostl the physical function displayed by Ispci is 0000 02 00 0 and that on Host2 it is 0000 03 00 0 On Hostl do the following cd sys class infiniband mlx4 0 iov 0000 02 00 0 0000 02 00 1 0000 02 00 2 1 0000 02 00 0 contains the virtual to physical mapping tables for the physical func tion 0000 02 00 X contain the virt to phys mapping tables for the virtual functions Do not touch the Dom0 mapping table under lt nnnn gt lt nn gt 00 0 Modify only tables under 0000 02 00 1 and or 0000 02 00 2 We assume that
281. t is not connected Waiting for link up on netO ok DHCP netO 02 02 c9 0c 78 11 ok neto 11 3 12 2 255 255 255 0 Next server 11 3 12 121 Filename pxeilinux O Root path tftpboot tf ftp 11 3 12 121 pxeilinux 0 Next FlexBoot attempts to boot as directed by the DHCP server 228 Mellanox Technologies m 2 1 1 0 0 A 8 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the initrd image must include a device driver module and be configured to load that driver This can be achieved by adding the device driver module into the initrd image and loading it The initrd image of some Linux distributions such as SuSE Linux Enterprise Server and Red Hat Enterprise Linux cannot be edited prior or during the installation process ade If you need to install Linux distributions over Flexboot please replace your initrd images with the images found at www mellanox com gt Products gt Adapter IB VPI SW gt FlexBoot Download Tab A 8 1 Casel InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section A 8 1 1 for an example jb addr ko jb core ko jb mad ko jb sa ko 1b_cm ko jb uverbs ko jb ucm ko jb umad ko iw cm ko rdma cm ko rdma ucm ko mlx4 core ko mlx4 ib ko jb mthca ko e ipoib_helper ko this module is not required for all OS kernels Please
282. t problems in routing tables scatter ports random seed Randomize best port chosen for a route max reverse hops H hop count Set the max number of hops the wrong way around an I O node is allowed to do connectivity for I O nodes on top swithces Biol wile i Yan to le Name of the map file with set of the IDs which will be used by Up Down routing algorithm instead of node GUIDs format lt guid gt lt id gt per line guid routing order file X path to file gt Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the guids provided in the given file one to a line torus config lt path to file gt This option defines the file name for the extra configuration info needed for the torus 2Q0S routing engine The default name is etc opensm torus 2Q0S conf once 0 This option causes OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state sweep s interval This option specifies the number of seconds between subnet sweeps Specifying s 0 disables sweeping Without s OpenSM defaults to a sweep interval of 10 seconds 142 Mellanox Technologies m 2 1 1 0 0 timeout t lt milliseconds gt This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds retries number This option specifies the n
283. the one in 00 07 0 Note PFs not included in the above list will not have SR IOV enabled port type array Specifies the protocol type of the ports It is either one array of 2 port types t1 t2 for all devices or list of BDF to port type array bb dd f tl t2 string Valid port types 1 ib 2 eth 3 auto 4 N A If only a single port is available use the N A port type for port2 e g 1 4 probe vf Jfabsent or zero no VF interfaces will be loaded in the Hyper visor host fnumO0 vfs is a number in the range of 1 63 the driver run ning on the Hypervisor will itself activate that number of VFs All these VFs will run on the Hypervisor This number will apply to all ConnectX HCAs on that host fits format is a string the string specifies the probe vf parameter separately per installed HCA The string format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to use in the PF driver for that HCA For example probe vfs 5 The PF driver will activate 5 VFs on the HCA and this will be applied to all ConnectX HCAs on the host probe vfs 00 04 0 5 00 07 0 8 The PF driver will activate 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for the one in 00 07 0 Note PFs not included in the above list will not activate any of their VFs in the PF driver The example above loads the driver with 5 VFs num vfs The standard use of a VF is a single VF
284. tional ver bosity vvv or v v v V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases 1t is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C ca name Optional Use the specified channel adapter or router P ca port Optional Use the specified port t timeout ms Optional Override the default timeout for the solicited MADs msec dest dr path lid Optional Destination s directed path LID or guid gt GUID lt portnum gt Optional Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 Ifnot found the first port that is UP physical link state is LinkUp Examples 204 Mellanox Technologies Rev 2 1 1 0 0 1 Query the status of Port 1 of CA mlx4 0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstatus mlx4 0 1 Infinib
285. tions File m 2 1 1 0 0 Option File ca control map Desctiption An array of sixteen bits one for each SL Each bit indicates whether or not the corresponding SL entry is to be modified Values Values Oxffff ccti Increase Sets the CC Table Index CCTI increase Default 1 trigger threshold Sets the trigger threshold Default 2 ccti min Sets the CC Table Index CCTI minimum Default 0 cct Sets all the CC table entries to a specified value Values lt comma separated The first entry will remain 0 whereas last value list gt will be set to the rest of the table Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes ccti timer Sets for all SL s the given ccti timer Default 0 Table 25 Congestion Control Manager CC MGR Options File When the value is set to 0 the CCT calculation is based on the number of nodes Option File Desctiption Values max errors error window When number of errors exceeds max_errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to proceed Values max errors 0 zero tollerance abort configuration on first error error window 0 mechanism dis abled no error checking Default 5 cc statistics cycle Enables CC MGR to collect statistics from all nodes every cc statistics cycle seconds Defaul
286. tiples of bytes e g IKB 1024 bytes and 1MB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g Kb 1024 bits FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand iSER iSCSI RDMA Protocol LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE RDMA over Converged Ethernet 14 Mellanox Technologies m 2 1 1 0 0 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description SDP Sockets Direct Protocol SL Service Level SRP SCSI RDMA Protocol MPI Message Passing Interface EoIB Ethernet over Infiniband QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane vHBA Virtual SCSI Host Bus adapter uDAPL User Direct Access Programming Library Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man agers in particular It 1s included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Glossary Sheet 1 of 2 Channel Adapter An IB device that terminates an IB link and executes transport CA Host Channel functions This ma
287. to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by Source port group whether a source port is a member of a specified group Destination port group same as above only for destination port e PKey QoSclass Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Service ID disregarding rest of the query fields However if a certain query has only Service ID which means that this 1s the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 8 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a li
288. tself With allow_ext_sg 0 the parameter cmd sg entries defines the maxi mum S G list length for a single SRP_CMD and commands whose S G list length exceeds this limit after S G list collapsing will fail Whether ib_srp is allowed to include a partial memory descriptor list in an SRP CMD instead of the entire list If a partial memory descriptor list has been included in an SRP CMD the remaining memory descrip tors are communicated from initiator to target via an additional RDMA transfer Setting allow_ext_sg to increases the maximum amount of data that can be transferred between initiator and target via a single SCSI command Since not all SRP target implementations support par tial memory descriptor lists the default value for this option is 0 A number in the range 1 2048 specifying the maximum S G list length the SCSI layer is allowed to pass to ib_srp Specifying a value that exceeds cmd sg entries is only safe with partial memory descriptor list support enabled allow_ext_sg 1 A number in the range 0 n 1 specifying the MSI X completion vector Some HCA s allocate multiple n MSI X vectors per HCA port If the IRQ affinity masks of these interrupts have been configured such that each MSI X interrupt is handled by a different CPU then the comp vector parameter can be used to spread the SRP completion workload over multiple CPU s 50 Mellanox Technologies rev 2 1 1 0 0 tl retry count A number in the range 2 7 specifying
289. ttr bit position 1 carry 0 atomic response 0 iow 0 ue os i x t bit position bit position lt lt 1 bit add res bit adder carry MASK IS SET va bit position MASK IS SET compare add bit position amp new carry if bit add res atomic response bit position carry new carry amp amp MASK IS SET compare add mask bit position return atomic response Ethernet Tunneling Over IPoIB Driver elPolB The eth ipoib driver provides a standard Ethernet interface to be used as a Physical Interface PIF into the Hypervisor virtual network and serves one or more Virtual Interfaces VIF This driver supports L2 Switching Direct Bridging as well as other L3 Switching modes e g NAT This document explains the configuration and driver behavior when configured in Bridging mode m 2 1 1 0 0 In virtualization environment a virtual machine can be expose to the physical network by per forming the next setting Step 1 Create a virtual bridge Step 2 Attach the para virtualized interface created by the eth ipoib driver to the bridge Step3 Attach the Ethernet interface in the Virtual Machine to that bridge The diagram below describes the topology that was created after these steps Virtual Interface s vifX Virtual Bridge s vbrX aka vSwitch Bridge Uplink s Pi elPolB IPoib Uplink InfiniBand Fabric The diagram shows how the traffic from
290. ty level See the vf option for more information about log verbosity su This option sets the maximum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity y This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 150 Mellanox Technologies m 2 1 1 0 0 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 8 3 2 Running osmtest To run osmtest in the default mode simply enter hostlf osmtest The default mode runs all the flows except for the Quality of Service flow see Section 8 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file hostl osmte
291. ty settings run show irq affinity sh interface 7 2 7 2 Auto Tuning Utility MLNX OFED 2 0 x introduces a new affinity tool called mlnx affinity This tool can automati cally adjust your affinity settings for each network interface according to the system architecture Usage e Start mlnx affinity start 136 Mellanox Technologies l m 2 1 1 0 0 Stop mlnx affinity stop Restart mlnx affinity restart minx affinity can also be started by driver load unload gt To enable minx affinity by default e Add the line below to the etc infiniband openib conf file RUN AFFINITY TUNER yes 7 2 7 3 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core utilization so there will be no interleaving between interfaces The following script can be used to separate each adapter s IRQs to different set of cores set irq affinity cpulist sh cpu list interface cpu list can be either a comma separated list of single core numbers 0 1 2 3 or core groups 0 3 Example Ifthe system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the follow ing etc init d irqbalancer stop set irq affinity cpulist sh 0 1 eth2 set irq affinity cpulist sh 2 3 eth3 set irq affinity cpulist sh 4 5 eth4 set irq affinity cpulist sh 6 7 eth5 SHE oe SE Mellanox Technologies 137 Rev 2 1 1 0 0 Per
292. u do not have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd ib sbin 230 Mellanox Technologies m 2 1 1 0 0 Step 7 If you plan to give your IB device a static IP address then copy ifconfig Otherwise skip this step hostis cp sbin ifconfig tmp initrd ib sbin Step 8 If you plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below hostl cp path to DHCP client v3 1 3 gt dhclient tmp initrd ib sbin hostlf cp path to DHCP client v3 1 3 gt dhclient script tmp initrd ib sbin host1f mkdir p tmp initrd ib var state dhcp hostlf touch tmp initrd ib var state dhcp dhclient leases hostlf cp bin uname tmp initrd ib bin hostlf cp usr bin expr tmp initrd ib bin hostlf cp sbin ifconfig tmp initrd ib bin hostl cp bin hostname tmp initrd ib bin Step 9 Create a configuration file for the DHCP client as described in Section 4 3 3 1 and place it under tmp initrd ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number For a ConnectX device interface ib0 send dhcp cli
293. uld have been registered using the IBV ACCESS MW BIND access flag Binding Memory Window type 1 is done viathe ibv bind mw verb struct ibv mw bind mw bind ret ibv bind mw qp mw amp mw bind Binding memory window type 2B is done via the ibv post send verb and a specific Work Request WR with opcode IBV WR BIND MW Prior to binding please make sure to update the existing rkey ibv inc rkey mw gt rkey 4 20 4 Invalidating Memory Window Before rebinding Memory Window type 2 it must be invalidated using the ibv post send verb and a specific WR with opcode IBV WR LOCAL INV 4 20 5 Deallocating Memory Window Deallocating memory window is done using the ibv dealloc mw verb ibv dealloc mw mw 114 Mellanox Technologies m 2 1 1 0 0 5 HPC Features 5 1 Shared Memory Access The Shared Memory Access SHMEM routines provide low latency high bandwidth communi cation for use in highly parallel scalable programs The routines in the SHMEM Application Pro gramming Interface API provide a programming model for exchanging data between cooperating parallel processes The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program The SHMEM parallel programming library is an easy to use programming model which uses highly efficient one sided communication APIs to provide an intuitive global view interface to shared or distributed memory systems SHMEM s capabiliti
294. ulticast forwarding tables of the fabric switches ibdiagnet masks In case of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet pkey A dump of the existing partitions and their member host ports ibdiagnet mcg A dump of the multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes gt b 196 Mellanox Technologies SM report Number of nodes and systems Hop count information maximal hop count an example path and a hop count histo gram All CA to CA paths traced Credit loop report mgid mlid
295. umber of retries used for transactions Without retries OpenSM defaults to 3 retries for transactions maxsmps n number This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs Without maxsmps OpenSM defaults to a maximum of 4 outstanding SMPs rereg on guid migr This option if enabled forces OpenSM to send port info with client reregister bit set to all nodes in the fabric when alias Guid migrates from one physical port to another aguid inout notice This option enables sending GID IN OUT notices on Alias GUIDs register delete request to registered clients sm assign guid func uniq count base port Specifies the algorithm that SM will use when it comes to choose SM assigned alias GUIDs The default is uniq count console q off local This option activates the OpenSM console default off ignore guids i lt equalize ignore guids file gt This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm hop weights file w path to file This option provides the means to define a weighting factor per port for customizing the least weight hops for the routing Mellanox Technologies 143 Rev 2 1 1 0 0 OpenSM Subnet Manager port search ordering file 0 path to file gt This option provides the means to define a mapping betwe
296. uns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 4 As can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 3 3 3 Step 5 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 8 OpenSM Subnet Manager Removing a Subinterface To remove a child interface subinterface run echo lt subinterface PKey gt sys class net lt ib interface delete child Using the example of Step 2 echo 0x8001 gt sys class net ib0 delete child Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example above Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step 1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command kosti ping e 5 11 4 3 176 P
297. upports unicast multicast and broadcast For details see Chapter 4 3 IP over InfiniBand ISER iSCSI Extensions for RDMA iSER extends the iSCSI protocol to RDMA It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies For further information please refer to Chapter 4 2 iSCSI Extensions for RDMA iSER SRP SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI soft ware to be readily used on InfiniBand architecture The SRP driver known as the SRP Initia tor differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an I O unit and provides storage services See Chapter 4 1 SCSI RDMA Protocol and Appen dix B SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that promotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface 1s defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifi cation is timed to coincide with O
298. ut ing on a 3D torus Torus 2QoS routes around link failure by taking the long way around any 1D ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z 4 4 I I I I I I 3 R 4 D 4 4 I I I I I I 2 L L I r 4 I I I I I I 1 m S n T o p I I I I I I y 0 ho ho I I I I I I x 0 i 2 3 4 5 For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has failed torus 2QoS would use the path S m p o T r D Note that it can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say x 5 and x 0 can be ignored for path segments on that ring One result of this is that torus 2QoS can route around many simultaneous link failures as long as no 1D ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabrics Note that in the case where there are multiple parallel links between a pair of switches torus
299. will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a e Itis recommended to use the n option This option adds the initiator_ext to the connecting string See Section 4 1 2 5 for more details srp daemon has a configuration file that can be set where the default is etc srp daemon conf Use the f to supply a different configuration file that configures the tar gets srp daemon is allowed to connect to The configuration file can also be used to set values for additional parameters e g max cmd per lun max sect Acontinuous background daemon operation providing an automatic ongoing detection and connection capability See Section 4 1 2 4 4 1 2 4 Automatic Discovery and Connection to Targets Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running To connect to all the existing Targets in the fabric run srp daemon e o This util ity will scan the fabric once connect to every Target it detects and then exit srp daemon will follow the configuration it finds in etc srp daemon conf Thus it will ignore a target that is disallowed in the configuration file ae e To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp daemon e This utility continues to execute until it is either killed by th
300. wing parameter MTU 65520 65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 4 3 2 IPoIB Mode Setting on page 57 and are configured with the same value For IPoIB slaves that work in datagram mode use MTU 2044 If you do ade not set the correct MTU or do not set MTU at all performance of the interface might decrease e In the bonding slave configuration file e g ifcfg ib0 use the same Linux Network Scripts semantics In particular DEVICE ib0 Inthe bonding slave configuration file e g ifcfg ib0 8003 the line TYPE InfiniBand is necessary when using bonding over devices configured with partitions p key For RHEL users In etc modprobe b bond conf add the following lines alias bond0 bonding For SLES users Itis necessary to update the MANDATORY DEVICES environment variable in etc sysconfig net work config with the names of the IPoIB slave devices e g 1b0 1b1 etc Otherwise bonding mas ter may be created before IPoIB slave interfaces at boot time Itis possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bond ing master Restarting openibd does no keep the bonding configuration via Network Scripts You have to restart the network service in order to bring up the bonding master After the configuration is saved restart the net
301. witch and each is a non root switch I I I I I I 3 I I I I I I 2 Sho Ro I I I I I I i I I I I I I y 0 x 0 1 2 3 4 a For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns However to construct a credit loop the union of multicast routing on this span ning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop In addi tion if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link or switch failures that result in a fabric for which torus 2QoS can generate credit loop free unicast
302. work interface card split them into separate packets Large Receive Offload LRO increases inbound through put of high bandwidth network connections by reducing CPU overhead It works by aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack thus reducing the number of packets that have to be processed LRO is available in kernel versions 3 1 for untagged traffic Note LRO will be done whenever possible Otherwise GRO will be done Generic Receive Offload GRO is available throughout all kernels ethtool c eth lt x gt Queries interrupt coalescing settings ethtool C eth lt x gt adaptive rx Enables disables adaptive interrupt moderation onloff By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern ethtool C eth lt x gt pkt rate low N Sets the values for packet rate limits and for moderation pkt rate high N rx usecs low N time high and low values rx usecs high N Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value 106 Mellanox Technologies m 2 1 1 0 0 Table 6 ethtool Supported Options Options Description ethtool C eth lt x gt rx
303. work service by running etc init d network restart Mellanox Technologies 63 J Rev 2 1 1 0 0 Driver Features 4 4 Quality of Service InfiniBand 4 4 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand Servers Administrator QoS Manager X IB Ethemet o gt Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS e Up to 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served Packets carry class of service marking in the range 0 to 15 in their header SL field Each switch can map the incoming
304. y be an HCA Host CA or a TCA Target Adapter HCA CA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant commu nication IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connectivity only LocalIdentifier ID An address assigned to a port data sink or source point by the Subnet Manager unique within the subnet used for directing packets within the subnet Local Device Node The IB Host Channel Adapter HCA Card installed on the System machine running IBDIAG tools Mellanox Technologies 15 J Rev 2 1 1 0 0 Table 3 Glossary Sheet 2 of 2 Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Man ager The Subnet Manager that is authoritative that has the refer ence configuration information for the subnet See Subnet Manager Multicast Forward ing Tables A table that exists in every switch providing the list of ports to forward received multicast packet The table is organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot Card NIC and provides one or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in t
305. ynchronization routines For further information on MXM please refer to the Mellanox website Mellanox OFED 1 8 includes ScalableUPC 2 1 which is installed under opt mellanox bupc po If you have installed OFED 1 8 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Mellanox Technologies 123 Rev 2 1 1 0 0 HPC Features 5 5 1 Installing ScalableUPC Mellanox ScalableUPC is installed as part of MLNX_OFED package w Mellanox OFED 1 8 5 includes ScalableUPC Rev 2 2 which is installed under opt mellanox bupc re If you have installed OFED 1 8 5 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Please note the binary distribution of ScalableUPC is compiled with the following defaults FCA support FCA is disabled at runtime by default and must be configured prior to using it from the ScalableUPC For further information please refer to FCA User Man ual MXM support enabled by default 5 5 2 FCA Runtime Parameters The following parameters can be passed to upcrun in order to change FCA support behavior Table 15 Runtime Parameters Parameter Description fca enable lt 0 1 gt Disables Enables FCA support at runtime default disable fca np value Enables FCA support

Download Pdf Manuals

image

Related Search

Related Contents

MANUALE DI ISTRUZIONI ITALIANO  2015 Fall Home Show at Gillette Stadium Exhibitor Service Manual  "取扱説明書"  Análisis de productividad de dos carros portacarga en madereo con  “TERMINOS DE REFERENCIA”  DE Bedienungsanleitung Wasserkocher EN Instuction  OM, Husqvarna, 553 RBX, EN, FR, IT, ES, DE, PT, TR, SI, 2012-02  ADC TrueNet Rack Mount Fiber Enclosures RMG Series User's Manual  

Copyright © All rights reserved.
Failed to retrieve file