Home

Mellanox OFED Linux User's Manual

image

Contents

1. 87 4 8 1 Enhanced Atomic Operations 0 0c cece cee ee tenet rr rr rr ra 87 4 9 Huge Pages Support for Queue Resources 89 4 10 Auto Sensing 89 4 10 1 Enabling Auto Sensing 0 cee cc eect teen eens 89 Chapter 5 VPI Configuration and Management cc cec eee cc ccc eccecceces 92 5 1 Port Type Management 92 5 2 InfiniBand Driver 93 5 3 Ethernet Driver 93 Hd La COVERVICW esis Create eyon anna r ls Rare gng aay wie tees tls Seg erase BIR veg S A 93 53 2 Loading the Ethernet Driver i fg ye sacedl cada de peered eget ee tes gee day key fly 93 5 3 3 Unloading the Driver 2 soo diska sie ee EE eee ee eee l eler l bra adie vases 94 5 3 4 Ethernet Driver Usage and Configuration 0 00 c cee cee cece rer ert 94 Chapter 6 Performance cece ccc ccc rece c eee cree ee cece ers ceeerccsces 96 6 1 General System Configurations 96 6 1 1 PCI Express PCIe Capabilities 00 cee eee ee 96 6 1 2 BIOS Power Management Settings 20 0 cee eee ences 96 6 1 3 Intel Hyper Threading Technology 1 0 2 0 cece eects 96 6 2 Performance T
2. 172 10 12 smpquery 175 10 13 perfquery 178 10 14 ibcheckerrs 181 10 15 mstflint 183 10 16 ibv_asynewatch 187 10 17 ibdump 187 Appendix A Mellanox FlexBoot 0 cc cece ccc cc ccc eee ee ee ee cece eee c cece 190 Acl 7 VETVICN a ora sac decide giv 8 S1 Ra a Hog Foes Les HO BEA Cobh Sedaka ea wi ed a Eae een kek 190 A 2 Burning the Expansion ROM Image 0 0 0 c eect tence ene 191 A 3 Preparing the DHCP Server in Linux Environment 00 0 ccc e eee cece eee ee 192 AA Subiiet Manager OpenSM 243 23100 oy Seaway amana apa ae sae bd see ee Lae de EEA dau va es 194 AS ERT PSGtV et sta 5 gins ga i bal ren ren a site Rig a Reng Rig O Ea dod dy al Get a Rees dete dead gh 194 A 6 BlOSConfiguration 0 0 cece eect been nee nee aoei 194 AGT Operations 4006 Aes 426084 30 g terge Os hoe aise wb ace OG bale Be SAN A eae oe BER A EA Ee A AS 195 A 8 Command Line Interface CLD 0 ee ccc ee eee RR Sin Rin 196 A 9 Diskless Machines 25 405 sr n dns ran tae et eee eee ei eg Tae 198 ARTO AS CSE BOOUS sys Ne ee Ak Hi bit oti ll s n sa bre ce
3. I I I I I I ee I I I I I I ae Pte I I I I I I I q 3r I I I I I I 9 0 0 Tp I I I I I I I I I I I I x 0 BH 2 3 4 5 In a pristine fabric torus 2QoS would generate the path from S to D as S n O T 1 D With failed switches O and T torus 2QoS will generate the path S n I q r D with illegal turn at switch I and with hop I q using a VL with bit set In contrast to the earlier examples the second hop after the illegal turn q r can be used 9 5 7 2 Multicast Routing to construct a credit loop encircling the failed switches Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically avail able in current switches there is no way to use SL VL values to separate multicast traffic from uni cast traffic Thus torus 2QoS must generate multicast routing such that credit loops cannot arise from a combination of multicast and unicast path segments It turns out that it is possible to con 128 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 struct spanning trees for multicast routing that have that property For the 2D 6x5 torus example above here is the full fabric spanning tree that torus 2QoS will construct where x is the root switch and each is a non root switch 4 I I I I I I 3 I I I I I I 2 Ht I I I I I I i
4. 61 4 5 1 Ethernet over IB Topology 0 0 ccc ene rr rr rr rr rr rss 61 452 BOB Configuration lt sense oooh Bee Sis Ge oats abe Nu RR ye eee as Se REE eRe Cae 62 4 5 3 Retrieving EoIB Information 0 0 cette eens 67 4 5 4 Advanced EoIB Settings 0 0 0 0 eee cece ee nee nee 72 4 6 IP over InfiniBand 76 4 62 _ IntroductiOtins 3 0535 ks a pI erie ee pay Sole su aes oh wets ka ER SRA TI nas ones 76 4 6 2 IPolB ModeSetting 65 i 4 4 40s37 ado Rg be oti ten tna feet a bed eel ied ods 77 4 6 3 IPoIB Configuration 3 1 008 wse de 840 cea uira teas Vee eT Hee EER nee eda eee 77 4 674 Subinterfac ss i3 st ds dela d nar vat aa A eae were dd ED AA des GENITAS 81 4 6 5 Verifying IPoIB Functionality 0 eee ences 82 4 6 6 Bonding Pol Bris She dc hoy Sha oie daa Bedhead teas oasis Baste eas 83 4 7 Quality of Service 84 4 7 1 Quality of Service Overview mrss asg rus eee eee nett e enn RR 84 4 732 QoS Architectuire s 2 5 5 ohh ees sh eg as Sg SS Bg es Ban De Os 85 Ae 73 SUpPOrted POLICY aint dina okt eats crete Saal a E SANNE as REEE E 85 ATA CMA Features tones sana rare fd at i da ata RAD RAD EE gta aa Tose eae wae ae AR ee 86 4 1 9 lt OpensM Features rannas 860 6 585 Besa hn SN St wows WH De RE Ro ES rA 87 4 8 Atomic Operations
5. I I I I I I y 0 x 0 1 2 3 4 5 For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns However to construct a credit loop the union of multicast routing on this spanning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop In addition if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link or switch failures that result in a fabric for which torus 2QoS can generate credit loop free unicast routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example conside
6. Support and Updates Webpage on page 12 Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers 8 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Documentation Conventions Table 1 Documentation Conventions Description Convention Example File names file extension Directory names directory Commands and their parameters command parami Optional items Mutually exclusive parameters pl p2 p3 Optional mutually exclusive parameters pl p2 p3 J Prompt of a user command under bash hostname shell Prompt of a root command under bash hostname shell Prompt of a user command under tesh tcsh shell Environment variables VARIABLE Code example if a b Comment at the beginning of a code line Characters to be typed by users as is bold font Keywords bold font Variables for which users supply specific Italic font values Emphasized words Italic font These are emphasized words Pop up menu sequences menul gt menu gt gt item Note lt text gt This is a note Aa be Warning lt text gt May result in system A instability Mellanox Technologies 9 J Rev 1 5
7. Once the Congestion Control is enabled on the fabric nodes to completely disable P Congestion Control you will need to actively turn it off Running the SM w o the CC Manager is not sufficient as the hardware still continues to function in accordance to the previous CC configuration For further information on how to turn OFF CC please refer to Section 9 9 3 Configuring Conges tion Control Manager on page 152 9 9 3 Configuring Congestion Control Manager Congestion Control CC Manager comes with a predefined set of setting However you can fine tune the CC mechanism and CC Manager behavior by modifying some of the options To do so perform the following 1 Find the event plugin options option in the SM options file and add the following conf file lt cc mgr options file name gt Options string that would be passed to the plugin s event plugin options ccmgr conf file lt cc mgr options file name gt 2 Run the SM with the new options file opensm F lt options file name gt To turn CC OFF set enable to FALSE in the Congestion Control Manager configura tion file and run OpenSM ones with this configuration Adi For the full list of CC Manager options with all the default values See Configuring Congestion Con trol Manager on page 152 152 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 For further details on the list of CC Manager options please r
8. 9 5 7 6 Torus 2QoS Configuration File Syntax The file torus 2QoS conf contains configuration information that is specific to the OpenSM rout ing engine torus 2QoS Blank lines and lines where the first non whitespace character is are ignored A token is any contiguous group of non whitespace characters Any tokens on a line fol lowing the recognized configuration tokens described below are ignored torus mesh x radix m M t T y radix m M t T z radix m M t T Either torus or mesh must be the first keyword in the configuration and sets the topology that torus 2QoS will try to construct A 2D topology can be configured by specifying one of x_radix y_radix or z_radix as 1 An individual dimension can be configured as mesh open or torus looped by suffixing its radix specification with one of m M t or T Thus mesh 3T 4 5 and torus 3 4M 5M both specify the same topology Note that although torus 2QoS can route mesh fabrics its ability to route around failed compo nents is severely compromised on such fabrics A failed fabric componentis very likely to cause a disjoint ring see UNICAST ROUTING in torus 2QoS 8 xp link sw0 GUID swl GUID yp link sw0 GUID swl GUID zp link sw0 GUID swl GUID xm link sw0 GUID swl GUID ym link sw0 GUID swl GUID zm link sw0 GUID swl_ GUID These keywords are used to seed the torus mesh topology For example xp_link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to
9. 10 2 Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation syn opsis and options descriptions error codes and examples 10 2 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 156 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 1 On the command line specify the file name using the option t lt topology file name gt 2 Define the environment variable IBDIAG TOPO FILE To specify the local system name to an diagnostic tool use one o
10. An EoIB interface can report two different link states e The physical link state of the interface that is made up of the actual HCA port link state and the status of the vNics connection with the BridgeX If the HCA port link state is down or the EoIB connection with the BridgeX has failed the link will be reported as down because without the connection to the BridgeX the EoIB protocol cannot work and no data can be sent on the wire The mlx4_vnic driver can also report the status of the external BridgeX port status by using the mlx4 vnic info script If the eport_ state enforce module parameter is set then the external port state will be reported as the vNic interface link state If the connection between the vNic and the BridgeX is broken hence the external port state is unknown the link will be reported as down e the link state of the external port associated with the vNic interface A link state is down on a host administrated vNic when the BridgeX is connected and the InfiniBand fabric appears to be functional The issue might result from a miscon fe figuration of either BXADDR or and BXEPORT configuration file To query the link state run the following command and look for Link detected ethtool lt interface name gt 68 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Example ethtool eth10 Settings for ethl0 Supported ports Supported link modes Supports auto negotiat
11. cece cece cece etre cece cere cece cceceee 105 8 1 Enabling MXM in OpenMPI 105 Chapter 9 OpenSM Subnet Manager 0 ccc ccc e eee e eee ee ee eveseseees 106 9 1 Overview 2 2 7 106 9 2 opensm Description 106 O27 opens Syntax s 3 25 42 Sas ak See eae as a a eh hs be RA 106 9 2 2 Environment Variables 1 0 0 0 ccc cece cnt E a re eens 113 9 2 3 Signalin s esentsa adea des en re SN RN seeds clu Ota deeded de ndee cee ES 113 9 2 4 Running opensm ss kuster ble lade eee cette tent V RES eens 113 9 3 osmtest Description 114 e A SYNAR 2008 Ae a gh cee en eee an eae nade s NESS es ea as 114 9 32 22 VRUMMING OSMLESt 56508 1935 Gus be bn AGAIN STRANGE RES RSE RSE SS RS eS G whe RANN 116 9 4 Partitions 117 OA FIG BOTAD ct Mask aS A eee et a SEA cad ays ade inte NA tats NANA Gl 117 9 5 Routing Algorithms 118 9 5 1 Effect of Topology Changes 0 c ccc cece teen rr rr eens 120 9 52 Min Hop Algorithini s an os Sade sd 24S eee eet ee ae i BS A a ok 121 9 5 3 UPDN Algorithiin 20422034 gan and i mand aa ee ee LE See Need oat 121 9 5 4 Fat tree Routing Algorithm 0
12. initrd images with the images found at www mellanox com gt Products gt Adapter IB VPI SW gt FlexBoot Download Tab A 9 1 Case I InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section A 9 1 1 for an example ib_addr ko ib_core ko ib_mad ko ib_sa ko ib_cm ko ib_uverbs ko ib_ucm ko ib_umad ko 198 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e iw_cm ko e rdma_cm ko e rdma_ucm ko e mlx4 core ko e mlx4 ib ko e ib_mthca ko e ipoib_helper ko this module is not required for all OS kernels Please check the release notes e ib ipoib ko A 9 1 1 Example Adding an IB Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section 4 6 3 1 IPoIB Configura tion Based on DHCP and is connected to the client machine 3 An initrd file 4 To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the IB Driver to the initrd File executed by users with expertise in the boot process Improper application of this pro The following procedure modifies critical files used in the boot procedure It must be M cedure may
13. 1 A PXE ROM image file for each of the supported Mellanox network adapter devices Specif ically the following images are included ConnectX ConnectX 2 ConnectX 3 images ConnectX_FlexBoot_ lt PCI Device ID gt _ROM lt version gt mrom where the number after the ConnectX_FlexBoot_ prefix indicates the corresponding PCI Device ID of the ConnectX ConnectX 2 ConnectX 3 device 2 Additional documents under docs dhcp A 2 Burning the Expansion ROM Image A 2 1 Burning the Image on ConnectX ConnectX 2 ConnectX 3 P This section is valid for ConnectX ConnectX 2 devices with firmware versions 2 8 0600 or later and ConnectX 3 firmware aa Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot release notes txt 2 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package version 2 7 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under www mellanox com gt Downloads Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run mst start mst status The device name will be of the form mt lt dev_id gt _pci _cr0 conf0 J 1 Depending on the OS the device name may be superceded with a prefix Mellanox Technologies 191 Rev 1 5 3 3 1 0 2 Cre
14. c 2 Mellanox Technologies 73 J Rev 1 5 3 3 1 0 Driver Features UDEV service is active by default however if it is not active run sbin udevd d When vNic MAC address is consistent you can statically name each interface using the UDEV following rule SUBSYSTEM net SYSFS address aa bb cc dd ee ff NAME ethX For further information on the UDEV rules syntax please refer to udev man pages 4 5 4 3 Para Virtualized vNic EoIB driver interfaces can be also used for Linux based virtualization environment such as Xen KVM based Hypervisors This section explains how to configure Para Virtualized PV EoIB to work in such an environment Driver Configuration For PV EoIB to work properly the following features must be disabled in the driver e Large Receive Offload LRO e TX completion polling e RX fragmented buffers To disable the features above edit the modprobe configuration file as follow options mlx4 vnic lro num 0 tx polling 0 rx linear 1 For the full list of mlx4_vnic module parameters run modinfo mlx4 vnic Network Configuration PV EoIB supports both L2 bridged and L3 routed network models The physical interfaces that can be enslaved to the Hypervisor virtual bridge are actually EoIB vNics and they can be cre ated as on an native Linux machine PV EoIB driver supports both host administrated and net work administrated vNics Please refer to Section 4 5 2 EoIB Configuration o
15. hostl cp bin uname tmp initrd_ib bin hostl cp usr bin expr tmp initrd_ib bin hostl cp sbin ifconfig tmp initrd ib bin 200 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 hostl cp bin hostname tmp initrd_ib bin Create a configuration file for the DHCP client as described in Section 4 6 3 1 and place it under tmp initrd_ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number For a ConnectX device interface ib0 send dhcp client identifier HESOOSOWSOWsOWsMOsWZsOOsOUsO2Z EIN NT SUNT Step 9 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd_ib init and add the following lines at the point you wish the IB driver to be loaded A The order of the following commands for loading modules is critical echo loading ipv6 sbin insmod lib modules ipv6 ko echo loading IB driver sbin insmod lib modules ib ib addr ko sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insmod lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod 1lib modules ib iw_cm ko sbin insmod 1ib modules ib rdma_cm ko sbin insmod 1ib modules ib rdma_uc
16. Intel MPI Bench mark and Presta 1 4 6 InfiniBand Subnet Manager All InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OFED See Chapter 9 OpenSM Subnet Manager 1 4 7 Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data cen ter managers e ibutils Mellanox Technologies diagnostic utilities e infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools 1 OpenSM is disabled by default See Chapter 9 OpenSM Subnet Manager for details on enabling it 18 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 1 4 8 Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for e Generating a standard or customized Mellanox firmware image Querying for firmware information e Burning a firmware image to a single InfiniBand node MFT includes the following tools mlxburn This tool provides the following functions e Generation of a standard or customized Mellanox firmware image for burning in bin binary or img format e Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch devi
17. SUBNET UP if opensm was able to setup the subnet correctly P If a fatal non recoverable error occurs opensm exits a gt 9 2 4 1 Running OpenSM As Daemon OpenSM can also run as daemon To run OpenSM in this mode enter Mellanox Technologies 113 Rev 1 5 3 3 1 0 OpenSM Subnet Manager hostl etc init d opensmd start 9 3 osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administrator osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 9 3 2 osmtest has the following test flows e Multicast Compliancy test Event Forwarding test e Service Record registration test e RMPP stress test e Small SA Queries stress test 9 3 1 Syntax osmtest OPTIONS where OPTIONS are i EN This option directs osmtest to run a specific flow Flow Description c create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file s run service registration deregistration and lease rest run event forwarding test flood the SA with queries according to the stress mode e f m multicast flow q QoS info dump VLArb and SLtoVL tables t run trap 64
18. all values provided by this file Mellanox Technologies 79 J Rev 1 5 3 3 1 0 Driver Features IPADDR ib0 11 4 3 175 NETMASK ib0 255 255 0 0 ETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 PADDR _ib0 11 4 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on the first eth lt n gt interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 IPADDR ib0 11 4 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 The parameters below must be added to the IPoIB interfaces configuration files ifcfg HP ibX on RHEL6 3 otherwise different network interfaces may get the same IP address gt NM_CONTROLLED yes TYPE InfiniBand 4 6 3 3 Manually Configuring IPoIB A This manual configuration persists only until the next reboot or driver restart Ad To manually configure IPoIB for the default IB partition VLAN perform the following steps Step 1 To configure the interface enter the ifconfig command with the following items e The appropriate IB interface ib0 ib1 etc e The IP address that you want to assign to the interface e The netmask keyword e The subnet mask that you want to assign to the interface The following example sh
19. from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an I O unit and provides storage services See Chapter 4 4 SCSI RDMA Protocol and Appendix B SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that promotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface is defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifica tion is timed to coincide with OFED release of the Open Fabrics www openfabrics org software stack For more information about the DAT collaborative go to the following site http www datcollaborative org 1 4 5 MPI Message Passing Interface MPI is a library specification that enables the development of parallel software libraries to utilize parallel computers clusters and heterogeneous networks Mellanox OFED includes the following MPI implementations over InfiniBand e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT
20. ibcheckerrs Flags and Options 00 0 cece eect enn nea 196 Table 25 mstflint Switches cerime derasa oee ea EA EE E E rna Bree Dr ees 199 Table 26 mstflint Commands 0 eect teen een eens 200 Table 27 abdump Options aeres e a E ans le Rea eee ea a Ot oe we dew Boas 203 Mellanox Technologies 5 J Rev 1 5 3 3 1 0 Revision History Printed on July 11 2012 Rev 1 5 3 3 1 0 July 2012 Updated Section 4 6 3 2 Static IPoIB Configuration on page 79 Rev 1 5 3 3 0 0 February 2012 Removed FCoE section Rev 1 5 3 3 0 0 December 2011 Updated Table 1 mlnxofedinstall Return Codes on page 24 Updated Table A 1 1 Supported Mellanox Adapter Devices and Firmware on page 190 Updated the installation Script in Section 2 3 3 Installation Procedure on page 25 Removed section Socket Acceleration Added Section 4 5 4 3 Para Virtualized vNic on page 74 and its subsections Added Section 4 5 3 7 ALL VLAN on page 70 and its subsections 99 66 Updated sections mlx4_core Parameters 223 Updated Section 2 1 1 Hardware Requirements on page 20 mlx4 en Parameters on pages 222 and Added new options to the Installation Script section on page 24 Added Section 4 10 Auto Sensing on page 89 Rev 1 5 3 1 0 0 July 2011 Added Section 4 9 Huge Pages Support for Queue Resources on page 89 Updated Section 9 9 Congestion Control
21. ibdump openmpi intel dump pr rds tools sdpnetstat mstflint libibumad devel libibverbs devel libibcm devel librdmacm devel libibmad devel opensm devel opensm static opensm static compat dapl devel ibsdp devel infinipath psm devel ibipathverbs devel ibipathverbs devel ibnes devel static ibnes devel static ibcxgb3 devel ibcxgb3 devel ibm1x4 devel ibmlx4 devel ibmthca devel static ibmthca devel static dapl devel ibmge devel ibmverbs devel 26 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 libibmad static libibmad static libibumad static libibumad static rds devel mft dapl devel static dapl devel static ibibverbs devel static ibibverbs devel static mlnxofed docs ofed scripts ibibverbs ibrdmacm ibibumad ibmverbs ibmge ibibmad opensm libs compat dapl dapl ibibcm ibsdp ibmthca ibm1x4 ibexgb3 ibnes ibipathverbs ibsdp devel ibibcm devel dapl devel compat dapl devel opensm devel libibmad devel libmge devel libmverbs devel libibumad devel librdmacm devel libibverbs devel Device 15b3 673c 02 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe
22. ifopen netl A 8 3 3 ifclose Closes the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example iPXE gt ifclose netl A 8 3 4 autoboot Starts the boot process from the device s A 8 3 5 sanboot Starts the boot process of an iSCSI target Example IPMS Sealer isesi MIRARE s3 sucin A003 1 554 ile asesiloooe A 8 3 6 echo Echoes an environment variable Example Mellanox Technologies 197 Rev 1 5 3 3 1 0 iPXE gt echo root path A 8 3 7 dhcp A network interface attempts to open the network interface and then tries to connect to and com municate with the DHCP server to obtain the IP address and filepath from which the boot will occur Example iPXE gt dhcp netl A 8 3 8 help Displays the available list of commands A 8 3 9 exit Exits from the command line interface A 9 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the initrd image must include a device driver module and be configured to load that driver This can be achieved by adding the device driver module into the initrd image and loading it The initrd image of some Linux distributions such as SuSE Linux Enterprise Server and Red Hat Enterprise Linux cannot be edited prior or during the installation a gt process If you need to install Linux distributions over Flexboot please replace your
23. on page 151 Added Section 9 5 7 6 Torus 2QoS Configuration File Syntax on page 132 Updated Section A Mellanox FlexBoot on page 190 Rev 1 5 2 2 1 0 March 03 2011 New version of MLNX_OFED no changes to this document Rev 1 5 2 October 16 2010 Complete reorganization of the document s chapters Removed section Section 10 NFSoRDMA on page 86 Added Section 4 8 Atomic Operations on page 87 and its subsections Updated Section 2 3 Installing Mellanox OFED on page 21 Removed ibspark tool Added Section 9 5 7 Torus 2QoS Routing Algorithm on page 126 6 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Rev 1 5 1 3 September 16 2010 e Added Section 5 1 2 Firmware Dependencies on page 61 e Updated Section 5 1 7 Reading Port Counters Statistics on page 63 e Updated Section 5 1 Port Type Management on page 92 e Added Section 4 2 2 4 Enabling Disabling FCoE Services on page 47 Rev 1 5 1 2 July 04 2010 e Updated Figure 1 Mellanox OFED Stack on page 16 Rev 1 5 1 1 May 18 2010 e Added Section 4 1 10 Configuring DAPL over RoCE on page 42 Rev 1 5 1 April 22 2010 e Added Section 5 1 7 Reading Port Counters Statistics the section A Detailed Example was moved to become Section 5 1 8 Rev 1 5 March 29 2010 e Updated Figure 1 Mellanox OFED Stack on page 16 e Added support for
24. print 1 sed s else TROSS S cat proc interrupts grep 1 awk print 1 sed s Ei echo Discovered irqs IRQS mask 1 for IRQ in IRQS do echo printf Sx Smask gt proc irq IRQ smp affinity mask mask 2 if Smask ge Slimit then mask 1 fi done echo irqs were set OK 6 2 5 Preserving Your Performance Settings After A Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows lt sysctl namel gt lt valuel gt lt sysctl name2 gt lt value2 gt lt sysctl name3 gt lt value3 gt lt sysctl name4 gt lt value4 gt For example Section 6 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Perfor mance listed the following setting to disable the TCP timestamps option Mellanox Technologies 99 j Rev 1 5 3 3 1 0 Performance sysctl w net ipv4 tcp timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp_ timestamps 0 6 3 Performance Troubleshooting 6 3 1 PCI Express Performance Troubleshooting For the best performance on the PCI Express interface the adapter card should be installed in an x8 slot with the following BIOS configuration parameters e Max Read Req the maximum read request size is 512 or higher e MaxPayloadSize the maximum payload size is 256 or higher A Max Read Reg of 128 and or in
25. 0 open C Link down TAU TAESU RA 0 RXE 0 Link status The socket is not connected Waiting for link up on netO ok Mellanox Technologies 193 Rev 1 5 3 3 1 0 Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 A 4 Subnet Manager OpenSM This section applies to ports configured as InfiniBand only ae FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accomplish this Note that OpenSM may be run on the same host running the DHCP server but it is not manda tory For details on OpenSM see OpenSM Subnet Manager on page 106 To use OpenSM caching for large InfiniBand clusters gt 100 nodes it is recommended to use the OpenSM options described in Section 9 2 1 opensm Syntax on page 106 be A 5 TFTP Server When you set the filename parameter in your DHCP configuration file to a non empty filename the client will ask for this file to be passed through TFTP For this reason you need to install a TFTP server A 6 BIOS Configuration The expansion ROM image
26. 1 5 3 3 1 0 VPI Configuration and Management 5 VPI Configuration and Management VPI allows ConnectX ports to be independently configured as either IB or Eth 5 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx_ port config script after the driver is loaded E Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved configuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are e eth Ethernet e ib Infiniband e auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds Table 4 lists the ConnectX port configurations supported by VPI Table 4 Supported ConnectX Port Configurations Port 1 Configuration Port 2 Configuration ib ib ib eth eth eth Note that the configuration Port eth and Port2 ib is not supported The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified if there is only
27. 2 0 5GT s IB QDR 10GigE rev b0 Link Width 8x PCI Link Speed 2 5Gb s Installation finished successfully In case your machine has the latest firmware no firmware update will occur and the P installation script will print at the end of installation a message similar to the following be Installation finished successfully The firmware version 2 9 1000 is up to date Note To force firmware update use force fw update flag Mellanox Technologies 27 Rev 1 5 3 3 1 0 Installation In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware I4 vendor for help on firmware updates Da Error message I Querying device E Can t auto detect fw configuration file Step 4 In case the installation script performed firmware updates to your network adapter hardware it will ask you to reboot your machine Step 5 The script adds the following lines to etc security limits conf for the userspace components such as MPI soft memlock unlimited hard memlock unlimited These settings unlimit the amount of memory that c
28. 243 coe Sole oh a Yh reads on du EAE ascot ace pester au cant aoa hea s 159 Mellanox Techologies 3 J 10 3 3 Return Cod sire s sj edra adda sik sens oe 28 Miele WIS Mie en senila a Gia AUR aa aide wale oS 160 10 4 ibdiagnet of ibutils IB Net Diagnostic 160 TOA SYNOPSYS perrie ree eh a eens hat dad eee ee ea AONE eee eae 160 10 42 Output Piles s 3 00 uar ae ask raid a eed aaah aus aes oe bee esa bez ed ts ENIA K RE 162 104 3 ERROR CODES jicsitesde a eg kroat hen Hiv ih ie Be accel GU j sa dej HGS Seal ee enola eee A 163 10 5 ibdiagpath IB diagnostic path 163 10 5 1 SYNOPSYS arg end ig r kan eler bad etl deed he E E EE E ER 163 105 2 Output Pues scsi aes a 26S de ink EE 6 Gaon g erate nage Seles Gavan es oe HSS FES EE ok Shes 164 10 53 ERROR CODES ieee Sek at eee he ea eel okie Beles ae ied oe 164 10 6 ibv_devices 165 10 7 ibv_devinfo 165 10 8 ibdev2netdev 166 10 8 1 SYNOPSYS isa desks his che Oey hae eae are aera tau haa syed ae 167 10 9 ibstatus 167 10 10 ibportstate 169 10 11 ibroute
29. 3 3 1 0 Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g IKB 1024 bytes and 1MB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g 1Kb 1024 bits FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE RDMA over Converged Ethernet SDP Sockets Direct Protocol SL Service Level SRP SCSI RDMA Protocol MPI Message Passing Interface EoIB Ethernet over InfiniBand QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane vHBA Virtual SCSI Host Bus adapter uDAPL User Direct Access Programming Library 10 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Manag ers in particular It is included here for ease of reference but the main reference remains the Infini Band Architecture Specifi
30. ConnectX 2 devices e Added support for ROMA over Converged Ethernet RoCE see Chapter 5 RoCE e Modified Section 7 3 3 1 How to Know SDP Is Working e Added Section 7 3 7 Using RDMA for Small Buffers e Added support for NFS over ROMA NFSoRDMA Chapter 10 NFSoRDMA e Added Section 11 5 2 Important Note on RoCE Support on page 114 in Chapter 7 MPI Message Passing Interface e Modified Section 9 2 1 opensm Syntax on page 106 e Added Chapter 5 e Added ibdiagnet of ibutils2 and ibdump to Chapter 10 InfiniBand Fabric Diagnostic Utilities e Appendix B is now called Mellanox FlexBoot instead of BoIB FlexBoot supports Virtual Protocol Interconnect VPI e Added Section 6 3 3 System Performance Troubleshooting e Added the parameter setting VIADEV_RENDEZVOUS_THRESHOLD 8 192 Sec tion 11 2 3 MPI Performance Tuning Rev 1 40 1 Changes from 1 40 March 19 2009 e Correction to text in Section 9 3 3 IPoIB Configuration on page 93 Mellanox Technologies 7 Rev 1 5 3 3 1 0 About this Manual This Preface provides general information concerning the scope and organization of this User s Manual It includes the following sections Section Intended Audience on page 8 Section Documentation Conventions on page 9 Section Glossary on page 11 Section Related Documentation on page 12 Section
31. Fabric links in LST format ibdiagnet2 sm Subnet Manager ibdiagnet2 pm Ports Counters ibdiagnet2 fdbs Unicast FDBs ibdiagnet2 mcfdbs Multicast FDBx ibdiagnet2 nodes_ in Information on nodes fo ibdiagnet2 db_csv ibdiagnet internal database An ibdiagnet run performs the following stages Mellanox Technologies 159 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities e Fabric discovery Duplicated GUIDs detection Links in INIT state and unresponsive links detection e Counters fetch Error counters check e Routing checks e Link width and speed checks 10 3 3 Return Codes 0 Success 1 Failure with description 10 4 ibdiagnet of ibutils IB Net Diagnostic This version of ibdiagnet is included in the ibutils package and it is not run by default A after installing Mellanox OFED To use this ibdiagnet version and not that of the ibu aa tils package you need to specify the full path opt bin ibdiagnet ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 10 4 1 SYNOPSYS bdiagnet c lt count gt v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pe P lt lt PM gt lt Value
32. Linux User s Manual Rev 1 5 3 3 1 0 4 7 4 2 SDP SDP uses CMA for building its connections The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hexadecimal digits holding the remote TCP IP Port Number to connect to 4 7 4 3 RDS RDS uses CMA and thus it is very close to SDP The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hexadecimal digits holding the TCP IP Port Number that the protocol connects to The default port number for RDS is Ox48CA which makes a default Service ID 0x00000000010648CA 4 7 4 4 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 4 7 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 9 can be split into two main parts I Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the dis covered fabric elements ll PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched again
33. Message Passing Interface 7 7 1 7 2 7 2 1 MPI Message Passing Interface Overview PGI compiler does not support RHEL6 0 thus MLNX_OFED v1 5 2 will not include openmpi and mvapich with PGI compiler on RHEL6 Ad Mellanox OFED for Linux includes the following MPI implementations over InfiniBand and RoCE e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 6 lists some useful MPI links Table 6 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mpi org MVAPICH MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections e Prerequisites for Running MPI page 102 e MPI Selector Which MPI Runs page 104 e Compiling MPI Applications page 104 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login i e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote c
34. Options File Option File Desctiption Values port_control Specifies the Congestion Control attribute for this port Values 0 QP based congestion control 1 SL Port based congestion con trol Default 0 ca_control_map An array of sixteen bits one for each SL Each bit indicates whether or not the corresponding SL entry is to be modified Values Oxffff ccti_increase Sets the CC Table Index CCTI increase Default 1 trigger_threshold Sets the trigger threshold Default 2 ecti min Sets the CC Table Index CCTI minimum Default 0 154 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Table 11 Congestion Control Manager CA Options File Option File Desctiption Values cct Sets all the CC table entries to a specified value The Values lt comma separated first entry will remain 0 whereas last value will be set list gt to the rest of the table Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes ccti_timer Sets for all SL s the given ccti timer Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes Table 12 Congestion Control Manager CC MGR Options File Option File Desctiption Values max_ errors When number of errors exceeds max_errors of send Values error_window receive errors or timeouts in less than error_w
35. Parameters for 10 Gb s Operation Firmware Release Notes for Mellanox adapter devices See the Release Notes PDF file relevant to your adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Manual See under docs folder of installed package MFT Release Notes Release Notes for the Mellanox Firmware Tools See under docs folder of installed package Support and Updates Webpage Please visit http Avww mellanox com gt Products gt IB VPI SW Drivers for downloads FAQ trou bleshooting future updates to this manual etc 12 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 1 Mellanox OFED Overview 1 1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack based on the OpenFabrics Enterprise Distribution OFED Linux stack and operates across all Mellanox net work adapter solutions supporting 10 20 40 and 56 Gb s InfiniBand IB 10Gb s and 40Gb s Eth ernet and 2 5 or 5 0 GT s PCI Express 2 0 and 8 GT s PCI Express 3 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are supported with major operating system distributions Mellanox OFED is certified with the following products e Mellanox Messaging Accelerator VMA software Multicast socket acceleration library that
36. Rak r yt Uae eh Metis gisele 204 ASU WitPE ban onct scart ee ecthne beeen eta hota Dae a NAR ASIA AR fed eed 205 Appendix B SRP Target Driver 0 cc cece ccc ccc ce cece cece eee e reece eee eee 200 B 1 Prerequisites and Installation 0 0 teen tenet eens 206 B2 HOWAO PUN a sod ea Pas ah ee ne EES aA ARAL ha Se ds are bad ohdoes 206 B 3 How to Unload Shutdown 0 cece cence ent eee teens 209 Appendix C mlx4 Module Parameters 0 ccc cece cece cece cece reece eee cece 210 Cul mlx4 core Parameterss ox iiss Tas sev a get ste deh ois WERT STR GREK Br TES ASUS HEHE SUS Seu he 210 G2 smlx4 ib Parameters 2 5 0344 2554 fee ies ERE E ee ode E E bide ee tee 210 x3 MA en Parameters tA 6 HAs Ma one eh aa Soe sae Whe ens dail SNARARE Witte 211 Appendix D ib bonding Driver for Systems Using SLES10 SP4 212 DI Using the ib bonding Driver 5 20504 dra vey mek egy Ocak dead pee ody bod oda GUEA vee eae 212 4 Mellanox Techologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 List of Tables Table 1 Typographical Conventions 0 c ce cee cece cee neces 13 Table 2 Abbreviations and Acronyms 0 00 c eect rr rer rer rr rerna 13 Tabl e Closs a T asad Grea Gane Geely heel Dien ee Mens 15 Table 4 Reference Documents sssssseresererererereeeerrre rer reser rss rr rss e een eee 16 Table 5 minxofedinstall Return Codes 0 0 cece
37. Technologies ConnectX ConnectX 2 can operate as an InfiniBand adapter or as an Ethernet NIC The OFED driver supports InfiniBand and Ethernet NIC configurations To accommodate the supported configurations the driver is split into four modules 16 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 mlx4_core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mlx4_ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer mlx4_en A 10 40GigE driver under drivers net mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer 1 4 3 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and ker nel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 4 4 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the Infini Band transport service The transpor
38. a swich port then ibportstate can be used to e disable enable or reset the port e validate the port s link width and speed against the peer port Synopsis ibportstate d e v V D G s lt smlid gt C lt ca_name gt P lt ca_port gt t lt timeout_ms gt lt dest dr path lid guid gt lt portnum gt lt op gt lt value gt Table 18 lists the various flags of the command Table 18 ibportstate Flags and Options Optional Default Flag ae i If Not Description Sy Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d Mellanox Technologies 169 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Table 18 ibportstate Flags and Options Continued Optional Default Flag ote lav If Not Description y Specified e rr_show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x
39. achieve effective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet qos_ca max vls 15 qos_ca high limit 6 gos ealvlarb high 0 4 gos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 gos_ca_sl2vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 qos swe max vls 15 gos swe high limit 6 gos swe vlarb high 0 4 gos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 qos_swe sl2vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x 4KB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off 9 6 8 Deployment Example Figure 4 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs Figure 4 Example QoS Deployment on InfiniBand Subnet App A Server Traffic class SDP Service level 2 Policy min 20 BW Traffic class Partition A Service level 0 Policy min 40 Service Access Points Traffic class SRP Service Level 1 a Ser Policy min 30 BW App A Server Traffic class IPoIB Service Level 3 Po
40. and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos lt type gt high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sent is high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded P If the 255 value is used the low priority VLs may be starved pa A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table 142 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to
41. and other mod ules See Section 2 3 Installing Mellanox OFED for details on installation The list of the modules that will be loaded automatically upon boot can be found in the configuration file etc infiniband openib conf Ad Enter the following command to display the current run of MLNX OFED ibv_devinfo hca id mlx4 0 transport InfiniBand 0 fw_ver 2 7 700 node guid 0002 c903 0008 e810 sys image guid 0002 c903 0008 e813 vendor id 0x02c9 vendor part_id 26428 hw ver 0xB0 board id MT 0DD0120009 phys port_cnt 2 porta state PORT INIT 2 max mtu 2048 4 active mtu 2048 4 sm lid 0 port_lid 0 port_lmc 0x00 link layer IB poies A state PORT ACTIVE 4 max mtu 2048 4 active mtu 1024 3 sm lid 0 port lid 0 port_lmc 0x00 link layer Ethernet Notes regarding the command output 1 The InfiniBand port port 1 is in PORT_INIT state and the Ethernet port port 2 is in PORT_ACTIVE state You can also run the following commands to obtain the port state cat sys class infiniband mlx4 0 ports 1 state 2 INIT cat sys class infiniband mlx4 0 ports 2 state 4 ACTIVE Mellanox Technologies 37 I Rev 1 5 3 3 1 0 Driver Features 2 Look at the link layer parameter of each port In this case port 1 is IB and port 2 is Ethernet Nevertheless port 2 appears in the list of the HCA s ports You can also run the following commands to obtain the link layer of th
42. can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoiding the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm P QoS support has to be turned on in order that SL VL mappings are used Ad A LMC gt 0 is not supported by the LASH routing If this is specified the default routing algorithm is invoked instead aa For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do_mesh_analysis This will add an addi tional phase that analyses the mesh to try to determine the dimension and size of a mesh If it deter mines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs 9 5 6 DO
43. configuration file VNICVLAN al1 For further information on how to create host admin vNics please see Section 4 5 2 1 EoIB Host Administered vNic on page 63 EoIB Host Administered vNic Checking the Configuration To verify the gateway vNic is configured with the All VLAN mode use the mlx4 vnic info script Mellanox Technologies 71 Rev 1 5 3 3 1 0 Driver Features e Gateway Support To verify the gateway is configured to All VLAN mode Run mlx4 vnic info g lt GW NAME gt Example mlx4 vnic_ info g A2 IOA_ PORT mlx4 0 1 BX NAME bridge 119c64 BX GUID OOSO2Z seis Msgs Meol so7 EPORT NAME A2 EPORT ID 63 STATE connected GW_TYPE LEGACY PKEY Oxffff ALL VLAN yes e vNic Support To verify the vNIC is configured to All WLAN mode Run mlx4 vnic_info i lt interface gt Example mlx4 vnic_ info i eth204 NETDEV NAME eth204 NETDEV_ LINK up NETDEV OPEN yes GW_TYPE LEGACY ALL VLAN yes For further information on mlx4 vnic info script please see Section 4 5 3 1 mlx4_vnic_info on page 67 4 5 4 Advanced EolB Settings 4 5 4 1 Module Parameters The mlx4_vnic driver supports the following module parameters These parameters are intended to enable more specific configuration of the mlx4_vnic driver to customer needs The mlx4_vnic is also effected by module parameters of other modules such as set 4k mtu of mlx4 core This mod ules are not addressed in this section The available modul
44. control opensm behavior e OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet 1lst opensm fdbs and opensm mcfdbs By default this direc tory is var log OSM CACHE DIR opensm stores certain data to the disk such that subsequent runs are consistent The default direc tory used is var cache opensm The following file is included in it guid2lid stores the LID range assigned to each GUID 9 2 3 Signaling When opensm receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSR1 can be used to trigger a reopen of var log opensm 1og for logrotate pur poses 9 2 4 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter hostl opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file message registers only general major events the second file opensm log includes details of reported errors All errors reported in opensm 1og should be treated as indicators of IB fabric health Both log files should include the message
45. each HCA srp daemon c e R 300 i lt InfiniBand HCA name gt p lt port number gt This step can be performed by executing srp_daemon sh which sends its log to var log srp_daemon log Now it is possible to access the SRP LUNs on dev mapper It is possible for regular non SRP LUNs to also be present the SRP LUNs may be L4 identified by their names You can configure the etc multipath conf file to change i multipath behavior occur if the SRP LUNs are in the black list of multipath Edit the blacklist section in w etc multipath conf and make sure the SRP LUNs are not black listed It is also possible that the SRP LUNs will not appear under dev mapper This can Automatic Activation of High Availability e Set the value of SRPHA ENABLE in etc infiniband openib conf to t yes For the changes in openib conf to take effect run etc init d openibd restart e From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper P It is possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name Aa e Itis possible to see the output of the SRP daemon in var log srp_daemon log 60 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 4 4 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop or as a by product of a comple
46. eee 29 Table 6 mlx4 vnic conf file format 0 0 cee rr rr rr tres sa 77 Table 7 Red Hat Linux mlx4 vnic conf file format 0 0 eee eee 78 Table 8 Supported ConnectX Port Configurations 00 cece eee eee 101 Table 9 Recommended PCIe Configuration 00 cece rer ere era 106 Fable 10 Useful MPT anks sa cee eat Gage ot alee berg Bi eee ee aa nea neler 113 Table 11 Congestion Control Manager General Options File 00 00 cece eens 165 Table 12 Congestion Control Manager Switch Options File 00 0 eee eee 165 Table 13 Congestion Control Manager CA Options File 0 00 0 e cece eee ers 165 Table 14 Congestion Control Manager CC MGR Options File 0 0 0 0 cece eens 166 Table 15 ibdiagnet of ibutils2 Output Files 2 0 0 0 ccc rer rer rr resa 170 Table 16 ibdiagnet of ibutils Output Files 20 0 cee 173 Table 17 ibdiagpath Output Files 2 0 0 0 rer rer rer rr rr rr tre eae 175 Table 18 ibv_devinfo Flags and Options 00 0 n eens 177 Table 19 ibstatus Flags and Options 0 cette ene rer sea 179 Table 20 ibportstate Flags and Options 0 0c cee eee ene n nena 181 Table 21 ibportstate Flags and Options 0 0c cc cece eee een n nen 185 Table 22 smpquery Flags and Options 0 cece eee n rer ere ra 189 Table 23 perfquery Flags and Options 0 0 cece ccc rer rer rer tre tr resa 193 Table 24
47. for Linux User s Manual Rev 1 5 3 3 1 0 e Uses any CX IB ports one or two e Inserts IP UDP TCP checksum on outgoing packets e Calculates checksum on received packets e Support net device TSO through CX LSO capability to defragment large datagrams to MTU quantas e Dual operation mode datagram and connected e Large MTU support through connected mode IPoIB also supports the following software based enhancements e Large Receive Offload e NAPI e Ethtool support This chapter describes the following IPoIB mode setting Section 4 6 2 e IPoIB configuration Section 4 6 3 e How to create and remove subinterfaces Section 4 6 4 e How to verify IPoIB functionality Section 4 6 5 e The ib bonding driver Section 4 6 6 4 6 2 IPoIB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Connected mode This can be changed to become Datagram mode by editing the file etc infiniband openib conf and setting SET IPOIB CM no After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode 4 6 3 IPolB Configuration Unless you have run the installation script mlnxofedinstal1 with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP address and a
48. frames irq 0 6 2 4 Interrupt Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer in RedHat gt etc init d irgbalance stop 98 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 The following command turns off the IRQ balancer in SLES gt etc init d irg balancer stop The following command assigns the affinity of a single interrupt vector gt echo lt hexadecimal bit mask gt gt proc irg lt irg vector gt smp affinity where bit 7 in lt hexadecimal bit mask gt indicates whether processor core i is in lt irq vector gt s affinity or not 6 2 4 1 Example Script for Setting Interrupt Affinity On systems that support NUMA it is recommended to set IRQs from different net work devices to processor cores that reside on different physical CPU sockets ae bin bash CORES cat proc cpuinfo grep processor tail 1 awk print 3 1 limit 1 while SCORES gt 0 do limit limit 2 CORES S CORES 1 done ie iz oll e then TROQOS cat proc interrupts grep eth mlx awk
49. gt gt lw lt lx 4x 12x gt ls lt 2 5 5 10 gt skip lt ibdiag check s gt load db lt db file gt 160 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 OPTIONS Mellanox Technologies 161 Rev 1 5 3 3 1 0 10 4 2 Output Files Table 14 ibdiagnet of ibutils Output Files net mcfdbs Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet Ist List of all the nodes ports and links in the fabric ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiag A dump of the multicast forwarding tables of the fabric switches ibdiag In case of duplicate port node Guids these file include the map between masked Guid net masks and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet pkey A dump of the the existing partitions and their member host ports ibdiagnet mcg A dump of the multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option InfiniBand Fabric Diagnostic Utilities In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error i
50. ipoib indicates that this partition may be used for IPoIB as a result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate mtu and scope should be specified as defined in the IBTA specifica tion for example mtu 4 for 2048 PortGUIDs list Mellanox Technologies 117 Rev 1 5 3 3 1 0 OpenSM Subnet Manager PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition e ALL means all end ports in this subnet e SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters 5 e The line can be wrapped after after a Partition Definition and between e A PartitionName does not need to be unique but PKey does need to be unique e Ifa PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also nex
51. lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 3 Check the LID2 Port 1 using the specified threshold file gt cat threshl SymbolErrors 10 FAILED 182 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 LinkRecovers 10 LinkDowned 10 RevErrors 10 RevRemotePhysErrors 100 RevSwRelayErrors 100 XmtDiscards 100 XmtConstraintErrors 100 RevConstraintErrors 100 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 gt ibcheckerrs v T threshl 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 10 15mstflint Applicable Hardware Mellanox InfiniBand and Ethernet devices and network adapter cards Description Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access If you purchased a standard Mellanox Technologies network adapter card please down A load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Table 23 lists the various switches of the utility and Table 24 lists its commands Table 23 mstflint Switches
52. line mlx4 vnicd lt yes no gt parameters b Start the service mlx4_vnic_confd to read and apply the configuration etc init d mlx4 vnic confd start e To see full list of the daemon parameters run mlx4 vnicd help For example to enable mlx4_vnic daemon with GC cat etc infiniband mlx4 vnic conf mlx4 vnicd yes gc_enable yes etc init d mlx4 vnic confd start Checking configuration file Starting mlx4 vnicd pid 30920 OK OK The mlx4_ vnicd daemon requires xenstore or libvirt to run Some Hypervisors may not have enough memory for the driver domain as a result mlx4_vnic driver may fail to initialize or create more vNics causing the machine to be unresponsive e To avoid this behavior you can a Allocate more memory for the driver domain For further information on how to increase dom0 mem please refer to http support citrix com article CTX12653 1 b Lower the m1x4_vnic driver memory consumption by decreasing its RX TX rings number and length For further information please refer to Section 4 5 4 1 Module Parameters on page 72 4 6 IP over InfiniBand 4 6 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service The IPoIB driver ib_ipoib exploits the following ConnectX ConnectX 2 capabilities 76 Mellanox Technologies Mellanox OFED
53. mlx4 core ko tmp initrd en lib modules mlnx en host1 cp net mlx4 mlx4 en ko tmp initrd_en lib modules mlnx en Step 5 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd_en sbin Step 6 Ifyou plan to give your Ethernet device a static IP address then copy ifconfig Otherwise skip this step host1 cp sbin ifconfig tmp initrd_en sbin Step 7 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd_en init and add the following lines at the point you wish the Ethernet driver to be loaded The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network interface Step 9 Save the init file Step 10 Close initrd host1 cd tmp initrd_en host1 find cpio H newc o gt tmp new initrd _en img host1 gzip tmp new init _en img Mellanox Technologies 203 Rev 1 5 3 3 1 0 At this stage the modified initrd including the Ethernet driver is ready and located at tmp new_ init ib img gz Copy it to the original initrd location and rename it prop erly A 10 iSCSI Boot Mellanox FlexBoot enables an iSCS
54. mlx4_vnic_confd service is used to read these configuration files and pass the relevant data to the mlx4_vnic module EoIB Host Administered vNic supports two forms of configuration files e Central Configuration File etc infiniband mlx4_vnic conf e yNic Specific Configuration Files ifefg ethX Both forms of configuration supply the same functionality If both forms of configuration files exist the central configuration file has precedence and only this file will be used Central Configuration File etc infiniband mlx4_vnic conf The mlx4 vnic conf file consists of lines each describing one vNic The following file format is used name eth47 mac 00 25 8B 27 16 84 ib port mlx4 0 1 vid 2 vnic_id 7 bx BxX001 eport All The fields used in the file have the following meaning Table 2 mlx4_vnic conf file format Field Description name The name of the interface that is displayed when running ifconfig mac The mac address to assign to the vNic ib_port The device name and port number in the form device name port number The device name can be retrieved by running ibv_devinfo and using the out put of hca_id field The port number can have a value of 1 or 2 vid Optional field If VLAN ID exists the vNic will be assigned the specified VLAN ID This value must be between 0 and 4095 Ifthe vid is set to all the ALL VLAN mode will be enabled and the vNic will support multiple vNic tags e Ifno
55. nen A lt ca_port gt t lt timeout_ms gt lt dest dr path lid guid gt lt startlid gt lt endlid gt 172 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Table 19 lists the various flags of the command Table 19 ibportstate Flags and Options Optional Default Flag ete ane If Not Description y Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a ll Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v V ersion Optional Show version info a Il Optional Show all LIDs in range including invalid entries n o_dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The parameters lt startlid gt and lt endlid gt specify the MLID range s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries C lt ca_name gt Optional Us
56. one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device This utility also has a non interactive mode sbin connectx port config d device lt PCI device ID gt c conf lt portl port2 gt 92 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 5 2 InfiniBand Driver The InfiniBand driver m1x4 ib handles InfiniBand specific functions and plugs into the Infini Band midlayer 5 3 Ethernet Driver 5 3 1 Overview MLNX_EN driver is composed from mlx4_core and mlx4 en kernel modules and exposes the following ConnectX ConnectX 2 capabilities e Single Dual port e Up to 16 Rx queues per port e 16 Tx queues per port e Rx steering mode Receive Core Affinity RCA e Tx arbitration mode VLAN user priority off by default e MSI X or INTx e Adaptive interrupt moderation e HW Tx Rx checksum calculation e Large Send Offload i e TCP Segmentation Offload e Large Receive Offload e Multi core NAPI support e VLAN Tx Rx acceleration HW VLAN stripping insertion e HW VLAN filtering e HW multicast filtering e ifconfig up down mtu changes up to 10K e Ethtool support e Net device statistics e CX4 QSFP and SFP connectors e Wake on Lan support e Ethernet 10 40GigE The current version of MLNX_OFED supports NC SI in Ethernet mode only aa 5 3 2 Loading the
57. or more VLAN associated vHubs A specific gateway can have multiple vHubs distinguishable by their unique VLAN ID Traffic coming from the Ethernet side on a specific eport will be routed to the relevant vHub group based on its VLAN tag or to the default vHub for that GW if no vLan ID is present 4 5 1 3 Virtual NIC vNic A virtual NIC is a network interface instance on the host side which belongs to a single vHub on a specific GW The vNic behaves similar to any regular hardware network interface The host can have multiple interfaces that belong to the same vHub 4 5 2 EolB Configuration mlx4 vnic module supports two different modes of configuration which is passed to the host mlx4_vnic driver using the EoIB protocol e host administration where the vNic is configured on the host side 62 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e network administration where the configuration is done by the BridgeX Both modes of operation require the presence of a BridgeX gateway in order to work properly The EoIB driver supports a mixture of host and network administered vNics 4 5 2 1 EolB Host Administered vNic In the host administered mode vNics are configured using static configuration files located on the host side These configuration files define the number of vNics and the vHub that each host administered vNic will belong to i e the vNic s BridgeX box eport and VLAN id properties The
58. presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device The priority of this list can be modified through BIOS setup 194 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 A 7 Operation A 7 1 Prerequisites e Make sure that your client is connected to the server s e The FlexBoot image is already programmed on the adapter card see Section A 2 e For InfiniBand ports only Start the Subnet Manager as described in Section A 4 e The DHCP server should be configured and started see Section 4 6 3 1 IPoIB Con figuration Based on DHCP on page 78 e Configure and start at least one of the services iSCSI Target see Section A 10 and or TFTP see Section A 5 A 7 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot to be the first on the boot device priority list see Section A 6 On dual port network adapters the client first attempts to boot from Port 1 If this fails A it switches to boot from Port 2 Note also that the driver waits up to 90 seconds for a each port to come up If MLNX FlexBoot iPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by th
59. prevent the diskless machine from booting Step 1 Back up your current initrd file Step 2 Make a new working directory and change to it host1 mkdir tmp initrd_ib host1 cd tmp initrd_ib Step 3 Normally the initrd image is zipped Extract it using the following command host1 gzip de lt initrd image gt cpio id The initrd files should now be found under tmp initrd_ib Step 4 Create a directory for the InfiniBand modules and copy them host1 mkdir p tmp initrd_ib lib modules ib host1 cd lib modules uname r updates kernel drivers host1 cp infiniband core ib addr ko tmp initrd_ib lib modules ib host1 cp infiniband core ib core ko tmp initrd_ib lib modules ib Mellanox Technologies 199 Rev 1 5 3 3 1 0 Step 5 Step 6 Step 7 Step 8 hostl cp infiniband core ib mad ko tmp initrd_ib lib modules ib hostl cp infiniband core ib sa ko tmp initrd_ib lib modules ib hostl cp infiniband core ib cm ko tmp initrd_ib lib modules ib hostl cp infiniband core ib uverbs ko tmp initrd ib lib modules ib host1 cp infiniband core ib ucm ko tmp initrd_ib lib modules ib hostl cp infiniband core ib umad ko tmp initrd_ib lib modules ib hostl cp infiniband core iw_cm ko tmp initrd_ib lib modules ib hostl cp infiniband core rdma_cm ko tmp initrd_ib lib modules ib hostl cp infiniband core rdma_ucm ko tmp initrd_ib lib modules ib hostl cp net mlx4 mlx4 core ko tmp initrd_ib lib modules ib h
60. provides storage services Section 4 4 2 describes the SRP Initiator included in Mellanox OFED for Linux This package however does not include an SRP Target 4 4 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D avail able from http www t10 org The SRP Initiator supports e Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp t10 drafts spe3 spce3r2 1b pdf e Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2rl6 pdf e Basic functionality task management and limited error handling 4 4 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or change the value of SRP_ LOAD in etc infiniband openib conf to yes P For the changes to take effect run etc init d openibd restart a When loading the ib srp module it is possible to set the module parameter A srp sg tablesize This is the maximum number of gather scatter entries per I O Al default 12 4 4 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 4 4 2 4 explains how to do this automatically e Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is runn
61. specified channel adapter or router P lt ca_port gt Optional Use the specified port R Optional Reset the counters t Optional Override the default timeout for the solicited lt timeout ms gt MADs msec V ersion Optional Show version info lt lid guid gt Optional LID or GUID port reset_ mask Examples perfquery r 32 1 read performance counters and reset perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports perfquery R 32 2 Ox0fff reset only error counters of port 2 Mellanox Technologies 179 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities 1 Read local port s performance counters 2 Read performance counters from LID 2 all ports 3 Read then reset performance counters from LID 2 port 1 180 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 gt perfquery r 21 Port counters Lid 2 port 1 ROBES CCC ap R sere setae ae Tyas 1 Counters e lect kee a a A reine 0x0100 OMM OLEE EOL S Se Stel Sker E OU ee 0 EPO ENSO NIE SRRA R A EAN 0 THE aDloy ANSI nig co oR OOO Roe OOOO ONS 0 RCV PEROT Shunt dunce slank rea ske 0 ROVReMOtERRNVS ET CONSE Ne e 0 REVONRST AVE BIO SE ooo obo nooo Das 0 MM EDS GATES necro 3 MMEC ONSE rA NERT ROTS i eer e
62. srp_ daemon which e Detect targets on the fabric reachable by the Initiator for Step 1 e Output target attributes in a format suitable for use in the above echo command Step 2 The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented ibsrpdm ibsrpdm is using for the following tasks 1 Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device dev umad0 exe cute the following command ibsrpdm Mellanox Technologies 55 J Rev 1 5 3 3 1 0 Driver Features This command will output information on each SRP Target detected in human readable form Sample output IO Unit Info port LID 0103 port GID e800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 1O class s Q100 ID LSI Storage Systems SRP Driver 200400a0b81146al service entries 1 service 0 200400a0b81146a1l SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the fol lowing command ibsrpdm d lt umad device gt 2 Assistance in creating an SRP connection a To generate output suit
63. taking the long way around any 1D ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z 126 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 4 9 4 4 4 4 I I I I I I I p I I I I I I 20S 1 I I I I I I 1 So 2 I I I I I I y 0 I I I I I I x 0 1 2 3 4 5 For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has failed torus 2QoS would use the path S m p o T r D Note that it can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say x 5 and x 0 can be ignored for path segments on that ring One result of this is that torus 2QoS can route around many simultaneous link failures as long as no 1D ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabrics Note that in the case where there are multiple parallel links between a pair of switches torus 2QoS will allocate routes across such links in a round robin fashion based on ports at the path destina tion switch that are active and not use
64. the OS This can be done at runtime by running echo lt nr pages gt gt proc sys vm nr hugepages to reserve nr pages If the system memory is too fragmented the operation may fail Therefore we recom mend performing this action after rebooting the system Since we are using IPC shared memory for allocating huge pages occasionally resources might not be freed To delete old unused shared memory resources use ye ipcerm 4 10 Auto Sensing Auto Sensing enables the NIC to automatically sense the link type InfiniBand or Ethernet based on the link partner and load the appropriate driver stack InfiniBand or Ethernet For example if the first port is connected to an InfiniBand switch and the second to Ethernet switch the NIC will automatically load the first switch as InfiniBand and the second as Ethernet 4 10 1 Enabling Auto Sensing Upon driver start up 1 Sense the adapter card s port type If a valid cable or module is connected QSFP SFP or SFP with EEPROM in the cable module Set the port type to the sensed link type IB Ethernet Mellanox Technologies 89 J Rev 1 5 3 3 1 0 Driver Features Otherwise Set the port type as default Ethernet 90 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 During driver run time Sense a link every 3 seconds if no link is sensed detected Ifsensed set the port type as sensed Mellanox Technologies 91 J Rev
65. the folow ing command lsmod grep mlx4 en If the module is loaded the mlnx4_en should be displayed as shown in the example below lsmod grep mlx4 en mlx4 en 75276 0 e Run ibv_devinfo There is a new field named link_layer which can be either Eth emet or IB If the value is IB then you need to use connectx_port_config to change the ConnectX ConnectX 2 ports designation to eth see mlx4 release notes txt for details e Configure the IP address of the interface so that the link will become active e All IB verbs applications which run over IB verbs should work on RoCE links as long as they use GRH headers that is as long as they specify use of GRH in their address vector 4 1 5 Ported Applications The following applications are ported with RoCE 34 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e ibv_ pingpong examples are ported The user must specify the GID of the remote peer using the new g option The GID has the same format as that in sys class infini band mlx4_0 ports 1 gids 0 Care should be taken when using ibv ud pingpong The default message size is 2K A which is likely to exceed the MTU of the RoCE link Use ibv_devinfo to inspect the ai link MTU and specify an appropriate message size e Allrdma cm applications should work seamlessly without any change e libsdp works without any change e Performance tests 4 1 6 GID Tables With RoC
66. the message disappears see figure Press Ctrl1 B for the iPXE command line Alternatively you may skip invoking CLI right after POST and invoke it instead right after Flex Boot starts booting Once the CLI is invoked you will see the following prompt 1PXE gt Operation The CLI resembles a Linux shell where the user can run commands to configure and manage one or more PXE port network interfaces Each port is assigned a network interface called neti where iis 0 1 2 lt of interface gt Some commands are general and are applied to all network inter faces Other commands are port specific therefore the relevant network interface is specified in the command 196 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 A 8 3 Command Reference A 8 3 1 ifstat Displays the available network interfaces in a similar manner to Linux s ifconfig iPXE gt ifstat neto 0 0 0Z c9 03 00 0c 78 11 eo OS OO Pee O10 Be gt Be od Fa FT ots DD CLink down TX 8 TXE 2 RX 11 RXE 11 Link status The socket is not connected TXE 2 x No such LCRXE 3 x The socket is CRXE 8 x Operation canceled neti 00O 02 c9 0c 78 12 on PCIOZ 00 0 Copen CLink up TX 12 TXE O RX O RXE 01 iPXE gt file or directory not connected A 8 3 2 ifopen Opens the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example iPXE gt
67. tual Local Area Networks VLAN tag The VLAN tag is used in the VLAN header within the EoIB packets and is enforced by EoIB hosts when handling the EoIB packets The tag is also extended to the Ethernet fabric when packets pass through the BridgeX This model of operation ensures a high level of security however it requires each VLAN tag used to have its own individ ual vNic to be created and each vHub requires InfiniBand fabric resources like multicast groups MGIDs If many VLANs are needed the resources required to create and manage them are large ALL VLAN vHub enables the user to use its resources efficiently by creating a vNic that can support multiple VLAN tags without creating multiple vNics However it reduces VLAN separation com pared to the vNic vHub model ALL VLAN Functionality When ALL VLAN is enabled the address lookup on the BridgeX consists of the MAC address only without the VLAN so all packets with the same MAC regardless of the VLAN are sent to the same InfiniBand address Same behavior can be expected from the host EoIB driver which 70 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 also sends packets to the relevant InfiniBand addresses while disregarding the VLAN In both sce narious the Ethernet packet that is embedded in the EoIB packet includes the VLAN header enabling VLAN enforcement either in the Ethernet fabric or at the receiving EoIB host ALL VLAN must
68. y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as it occurs on a 1D ring the ring for x 3 that is broken by a failure as in the above example 9 5 7 3 Torus Topology Discovery The algorithm used by torus 2QosS to construct the torus topology from the undirected graph repre senting the fabric requires that the radix of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm is to examine the cube formed by the eight switch locations bounded by the corners x y z and x 1 y 1 z 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that is consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm is based on examining the topology of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2QoS will detect and report when it has insufficient configuration for a torus with radix 4 dimensions 130 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 In the event
69. 00 e rr rs rr ss 122 9 5 5 LASH Routing Algorithm 0 0 ccc rer rr rr rr rr rea 124 9 5 6 DOR Routing Algorithm sssssssrererserererererrrrere rer rer rer rr rr rr rr ses 125 9 5 7 Torus 2QoS Routing Algorithm sseoesseeererererrrrererrr ere rer reser rer ra 126 9 6 Quality of Service Management in OpenSM 134 JOL OVV EW sosse soo lekens end bar Mea ete MG ee cag ER id wight a E Bano ae Late ta eal 134 9 6 2 Advanced QoS Policy File 0 2 0 eee rr rr srt rr rr rea 135 9 6 3 Simple QoS Policy Definition 0 cece rer rr rr eens 136 9 6 4 Policy File Syntax Guidelines ssseeeeeererereerrrrerre rer ere rr rr rr rss rr rs 136 9 6 5 Examples of Advanced Policy File 0 2 0 0 ccc cece rr rr rr rr rr rr rea 136 9 6 6 Simple QoS Policy Details and Examples sesseeeeseeereeerer rer rer eter era 139 9 6 7 SL2VL Mapping and VL ArbitratiOn sssseseeerrrrererrrerrrrr reser rer ra 141 9 6 8 Deployment Example sseseeresrererersrsrersersrsrrr sees rr rr rr rr tra 143 9 7 QoS Configuration Examples 143 9 7 1 Typical HPC Example MPI and Lustre 0 eee rer ra 144 9 7 2 EDC SOA 2 tier IPoIB and SRP 1 0 eee teens 144 9 7 3 EDC 3 tier IPoIB RDS SRP 1 2 cece 145 9 8 Adaptive Routing 146 9 8 1 OVERVIEW noe e ae ce Se hea eet a
70. 000b8cffff004016 MT47396 Infiniscale III Mellanox Tech nologies 0x0006 023 Channel Adapter portguid 0x0002c90300001039 swl37 HCA 1 0x0007 020 Channel Adapter portguid 0x0002c9020025874a swl57 HCA 1 0x0008 024 Channel Adapter portguid 0x0002c902002582cd swl36 HCA 1 5 valid lids dumped 5 Dump all non empty mlids of switch with Lid 3 gt ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infinis cale III Mellanox Technologies 0 i 2 Dress OL AZSHSSTE OO TASASCTE OO 123 a MLid 00 01 02 03 20 xc02 xc0 xc022 XC XC x mM x x XC XC 23 24 40 4 42 12 valid mlids dumped XC XC XC XC So Ge Se So SS SS aS S amp S x x Mm KM MR MK MK XC 10 12smpquery Applicable Hardware All InfiniBand devices Description Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsys Sijseueny al cl Ie Te eo Ie es sem FY C lt ca_name gt P lt ca_port gt t lt timeout_ms gt node name map lt node name map gt lt op gt lt dest dr _path lid guid gt op params Mellanox Technologies 175 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Table 20 lists the various flags of the command Table 20 smpquery Flags and Options O
71. 02 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Tech nologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Tech nologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 gt ibroute 2 3 7 Unicast lids 0x3 0x7 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III ellanox Technologies Lid Out Destination Port Info 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Tech nologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw1l37 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a swl57 HCA 1 3 valid lids dumped 4 Dump all Lids with valid out ports of the switch with portguid 0x000b8cffff 004016 gt ibroute G 0x000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3 guid 0x000b8cfff 004016 MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 023 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Tech nologies 174 Mellanox Technologies Mellanox OFED for Linux User s Manua Rev 1 5 3 3 1 0 0x0003 000 Switch portguid 0x
72. 08f1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout ms gt MADs msec lt dest dr_path Optional Destination s directed path LID or GUID lid guid gt lt portnum gt Optional Destination s port number lt op gt Optional query Define the allowed port operations enable dis lt value gt able reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 If not found the first port that is UP physical link state is LinkUp Examples 1 Query the status of Port 1 of CA mlx4_0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstatus mlx4 0 1 Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sm lid 0x3 state 2 JUNIE 170 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 2 Query the status of two channel adapters using directed paths 3 Change the speed of a port Mellanox Technologies 171 Rev 1 5 3 3 1 0 In
73. 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance e Disable the TCP timestamps option for better CPU utilization sysctl w net ipv4 tcp timestamps 0 e Disable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp_ sack 0 Mellanox Technologies 97 J Rev 1 5 3 3 1 0 Performance 6 2 3 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface eth1 then turns off Rx interrupt moderation and last shows the new setting gt ethtool c ethl Coalesce parameters for ethl Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 rx usecs 16 rx frames 88 rx usecs irq 0 rx frames irg 0 gt ethtool C ethl adaptive rx off rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX off pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 rx usecs irg 0 rx
74. 6 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI target Filename option root padi ViscSst iscsi terger Woes egiisesal Target tear The following is an example for configuring an IB ETH device to boot from an iSCSI target 204 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 host hostl filename For a ConnectX device with ports configured as InfiniBand comment out the following ine option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 For a ConnectX device with ports configured as Ethernet comment out the following line hardware ethernet 00 02 c9 00 00 bb A 11 WinPE Mellanox FlexBoot enables WinPE boot via TFTP For instructions on preparing a WinPE image please see http etherboot org wiki winpe Mellanox Technologies 205 Rev 1 5 3 3 1 0 Appendix B SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possiblee to work with and support a lot of IO modes on real or virtual devices in the backend 1
75. 65 flow this flow requires running of external tool Default all flows except QoS w wait This option specifies the wait time for trap 64 65 in seconds It is used only when running f t the trap 64 65 flow Default 10 sec d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description 114 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support m max lid This option specifies the maximal LID number to be searched for during inventory file build Default 100 oj Ould This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port D ENE This option displays a menu of possible local port GUID values with which osmtest could bind i inventory This option specifies the name of the inventory file Normally osmtest expects to find an inventory file which osmtest uses to validate real time information received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See c option for related information Sye eS eGess This o
76. 7 e Assign an IP address to the VLAN interface This should create a new entry in the GID table as index 1 gt ifconfig eth2 7 7 10 11 12 e Verbs test On server gt ibv_rc_pingpong g 1 On client gt ibv_rc_pingpongs g server e For rdma_cm applications the user needs only to specify an IP address of a VLAN device for the traffic to go with the VLAN tagged frames 4 1 8 Reading Port Counters Statistics It is possible to read port statistics in the same way it is done for regular InfiniBand ports The information is available from the sysfs at sys class infiniband lt device gt ports lt port number gt counters and the supported counters are port rcv_packets port xmit packets port rev data and port xmit data These counters count InfiniBand data only and do not account for Ethernet traffic For example to read the number of transmited packets run gt cat sys class infiniband lt device gt ports lt port number gt counters port xmit_ packets P RoCE traffic is not shown in the associated Etherent device s counters since it is offloaded by the hardware and does not go through Ethernet network driver gt 36 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 4 1 9 A Detailed Example This section provides a step by step example of using InfiniBand over Ethernet RoCE Installation and Driver Loading The MLNX OFED installation script installs RoCE as part of mlx4 and mlx4_en
77. Aa beg Sos PRs VR ATG SR IE EA ees 15 15374 A Directory Structures eree ea ptt ae cue oe IRS Aad beens NTA NER SR ARA Beg gata 15 1 4 Architecture 15 1 4 1 mthea HCA IB Driver rainei eae tee Seeks as See Pees Leeds A 16 14 2 mix4 VPL Drivers dcr vay von cay eat dete eddies oda beep dee dae dr pre vals sey ane 16 t43 Mid layer Core serari aise tk andre knep eo Bey be Re Es SE r art r OA a ee alae Kr 17 144 SOLP Si otra iis ote R chi ee erat eae ta adat hl TTT AN RA RA 17 TADS MPD Se ote sities ole ea et Heese eek oe et eee eee CATR Bee oe ah eRe Ba Raat 18 1 4 6 InfiniBand Subnet Manager 0 ete rr rr eens 18 14 7 Diagnostic Utilities ierse rieti ccc ence cette nee e 18 1 4 8 Mellanox Firmware Tools 2 0 eee ete rr rer rss rr 19 1 5 Quality of Service 19 Chapter 2 Installation 0 ccc ccc ce cece reece eee e erect eres see eecerescecee 20 2 1 Hardware and Software Requirements 20 2 1 1 Hardware Requirements s css mis obs be sma sed ccc cece Tara a a 20 2 1 2 Software Requirements sa sm ras danin sasse kun eG bad ede Seaweed dae ea ees 21 2 2 Downloading Mellanox OFED 21 2 3 Installing Mellanox OFED 21 2 3 Pre installation N
78. B FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the field boundary parameter specifies the field boundaries The pseudocode below describes the operation Dit aker lei biln b G value ci bl b2 kco value amp 2 return value amp 1 define MASK IS SET mask attr mask amp attr ane OSTRON carry 0 atomic response 0 Tor i 0 to 63 ae at NS bit position bit position lt lt 1 bit_add res bit_adder carry MASK IS SET va bit position MASK IS SET compare add bit position amp new carry if bit_add_res atomic response bit position carry new carry amp amp MASK IS SET compare add_mask bit position return atomic response Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 4 9 Huge Pages Support for Queue Resources Buffer resources for QPs and CQs can now be set to use huge pages When using huge pages the HCA needs less MTT resources thus improving performance by experiencing less cache misses Huge pages are supported for UD QPs e RC QPs e CQs Huge pages are OFF by default An application can be instructed to use huge pages by exporting to following environment variables HUGE UD y e HUGE RC y e HUGE CQ y For huge pages allocation to succeed the system administrator will have to reserve huge pages from
79. CP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 3 2 Configuring the DHCP Server on page 192 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate configu ration file needs to be created By default the DHCP server looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of the Mellanox OFED for Linux installation The DHCP server must run on a machine which has loaded the IPoIB module To run the DHCP server from the command line enter dhcpd lt IB network interface name gt d Example 78 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 host1 dhcpd ib0 d DHCP Client Optional P A DHCP client can be used if you need to prepare a diskless machine with an IB driver See Step 8 under Example Adding an IB Driver to initrd Linux hea In order to use a DHCP client identifier
80. CP socket family with SDP socket calls The library also implements a user level socket switch Using a configuration file the system administrator can set up the policy that selects the type of socket to be used Libsdp so also has the option to allow server sockets to listen on both SDP and TCP interfaces The various configu rations with SDP TCP sockets are explained inside the etc libsdp conf file 4 3 3 Configuring SDP To load SDP upon boot edit the file etc infiniband openib conf and set SDP_LOAD yes P For the changes to take effect run etc init d openibd restart SDP can work over IPoIB interfaces or RoCE interfaces In case of IPoIB SDP uses the same IP addresses and interface names as IPoIB see IPoIB configuration in Section 4 6 3 and Section 4 6 3 3 In case of RoCE SDP use the same IP addresses and interface names of the corresponding mlx4_en interfaces see mlx4_en configuration in Section 5 3 and Section 5 3 4 4 3 3 1 How to Know SDP Is Working Since SDP is a transparent TCP replacement it can sometimes be difficult to know that it is work ing correctly To check whether traffic is passing through SDP or TCP monitor the file proc net sdpstats and see which counters are running Alternative Method Using the sdpnetstat Program The sdpnetstat program can be used to verify both that SDP is loaded and is being used The following command shows all active SDP sockets using the same format as the trad
81. Cma1x10u9 SQUAjwONevaMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrodZ1lfCQ lt username gt thost1 Step 4 Now you need to add the public key to the authorized_keys2 file on the target machine host1 cat id rsa pub xargs ssh host2 echo gt gt home lt username gt ssh authorized_keys2 lt username gt thost2 s password Hi Enter password host1 For a local machine simply add the key to authorized keys2 host1 cat id _rsa pub gt gt authorized keys2 Step 5 Test host1 ssh host2 uname Linux Mellanox Technologies 103 Rev 1 5 3 3 1 0 MPI Message Passing Interface 7 3 7 4 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mellanox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and
82. E there may be several entries in a port s GID table The first entry always contains the IPv6 link s local address of the corresponding Ethernet interface The link s local address is formed in the following way gid 0 7 fe80000000000000 gid 8 mac 0 2 gid 9 mac 1 gid 10 mac 2 gid 11 ff gid 12 fe gid 13 mac 3 gid 14 mac 4 gid 15 mac 5 If VLAN is supported by the kernel and there are VLAN interfaces on the main Ethernet interface the interface that the IB port is tied to then each such VLAN will appear as a new GID in the port s GID table The format of the GID entry will be identical to the one described above except for the following change gid 11 VLAN ID high byte 4 MS bits gid 12 VLAN ID low byte Please note that VLAN ID is 12 bits wide 4 1 6 1 Priority Pause Frames Tagged Ethernet frames carry a 3 bit priority field The value of this field is derived from the IB SL field by taking the 3 least significant bits of the SL field 4 1 7 Using VLANs In order for RoCE traffic to use VLAN tagged frames the user needs to specify GID table entries that are derived from VLAN devices when creating address vectors Consider the example below e Make sure VLAN support is enabled by the kernel Usually this requires loading the 802 1q module Mellanox Technologies 35 J Rev 1 5 3 3 1 0 Driver Features gt modprobe 8021q e Adda VLAN device gt vconfig add eth2
83. ED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages P Pre existing configuration files will be saved with the extension conf saverpm be If you need to install Mellanox OFED on an entire homogeneous cluster a common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh If your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the mlnx add kernel support sh script located under the docs directory Usage mlnx_add_ kernel support sh i iso lt mlnx iso gt t tmpdir lt local work dir gt v verbose Example The following command will create a MLNX OFED LINUX ISO image for RedHat 5 6 under the tmp directory MLNX OFED LINUX 1 5 3 rhel5 6 x86 64 docs mlnx add kernel support sh i mnt MLNX OFED LINUX 1 5 3 rhel5 6 x86 64 1iso All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Removing OFED RPMs Running mkisofs Created tmp MLNX OFED LINUX 1 5 3 rhel5 6 x86 64 iso 2 3 2 Installation Script Mellanox OFED includes an installation script called minxofedinstall Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Pro ce
84. Enable Quality of Service support in the HCA default off bool enable pre tll _mode For FCoXX enable pre tll mode if non zero default 0 int internal err reset Reset device on internal errors if non zero default 1 int C 2 mlx4_ib Parameters debug level Enable debug tracing if gt 0 default 0 210 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 C 3 mlx4_en Parameters inline thold Threshold for using inline data int num rx rings Total number of RX Rings default 16 range 1 16 power Oe 2 Walia udp rss Enable RSS for incoming UDP traffic or disabled 0 bool num lro Number of LRO sessions per ring or disabled 0 unit use napi Use NAPI 1 default or process incoming traffic from interrupt context 0 bool use tx polling Use polling for TX processing default 1 bool enable sys tune Tune the cpu s for better performance default 0 bool Mellanox Technologies 211 Rev 1 5 3 3 1 0 Appendix D ib bonding Driver for Systems Using SLES10 SP4 D 1 Using the ib bonding Driver The ib bonding driver is a High Availability solution for IPoIB interfaces It is based on the Linux Ethernet Bonding Driver and was adapted to work with IPoIB The ib bonding package contains a bonding driver and a utility called ib bond to manage and control the driver operation The ib bonding driver comes with the ib bonding package run rpm qi ib bonding to get the package i
85. Ethernet Driver By default the Mellanox OFED stack loads m1x4 en Run ifconfig a to verify that the module is listed Mellanox Technologies 93 j Rev 1 5 3 3 1 0 VPI Configuration and Management 5 3 3 Unloading the Driver If etc infiniband openib conf had MLX4 EN LOAD yes at driver start up then you can unload the m1x4 en driver by running etc init d openibd stop Otherwise unload m1x4 en by running gt modprobe r mlx4 en 5 3 4 Ethernet Driver Usage and Configuration To assign an IP address to the interface run gt ifconfig eth lt n gt lt ip gt where x is the OS assigned interface number e To check driver and device information run gt ethtool i eth lt x gt Example ethtool i eth2 driver mlx4 en MT 1020110019 Cx 3 versions loisg ee ANN firmware version 2 10 0000 bus info 0000 07 00 0 e To query stateless offload status run gt ethtool k eth lt x gt e To set stateless offload status run gt ethtool K eth lt x gt rx on off tx on off sg on off tso on off e To query interrupt coalescing settings run gt ethtool c eth lt x gt e By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern To enable disable adaptive interrupt moderation use the following command gt ethtool C eth lt x gt adaptive rx onloff e Above an upper limit of packet rate adaptive modera
86. I boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd Linux There are two instances of connection to the remote iSCSI Target the first is for getting the kernel and initrd via FlexBoot and the second is for loading other parts of the OS via initrd If you choose to continue loading the OS after boot through the HCA device driver please verify that the initrd image includes the HCA driver as described in Section A 8 A 10 1Configuring an iSCSI Target in Linux Environment Prerequisites Step 1 Make sure that an iSCSI Target is installed on your server side You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 3 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target iqn line Lun 0 Path dev sda5 Type fileio Example of an iSCSI Target iqn line Target iqn 2007 08 7 3 4 10 iscsiboot Step 4 Start your iSCSI Target Example host1l etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section 4
87. If it is not present the trailer of the configuration file name e g ifefg eth47 gt eth47 is used instead HWADDR The mac address to assign the vNic BXADDR The BridgeX box system GUID or system name string BXEPORT The string describing the eport name VNICVLAN Optional field If it exists the vNic will be assigned the VLAN ID speci fied This value must be between 0 and 4095 or all for ALL VLAN feature VNICIBPORT The device name and port number in the form device name port number The device name can be retrieved by running ibv_devinfo and using the out put of hca_id field The port number can have a value of 1 or 2 GW_PKEY Optional field If discovery _pkey module parameter is set this value will control on what partition would be used to discover the gateways For more information about discovery_pkeys please refer to Section 4 5 3 6 Discovery Partitions Configuration on page 70 Other fields available for regular eth interfaces in the ifefg ethX files may also be used 64 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 mlx4_vnic_confd Once the configuration files are updated the host administered vNics can be created To manage the host administrated vNics run the following script Usage etc init d mlx4 vnic_ confd start stop restart reload status This script manages host administrated vNics only to retrieve general inform
88. LE lt full path gt AR Manager log file This option can be changed on the fly Default var log armgr log LOG SIZE lt size in MB gt This option defines maximal AR Manager log file size in MB The logfile will be truncated and restarted upon reaching this limit This option cannot be changed on the fly 0 unlimited log file size Default 5 Per switch AR Options A user can provide per switch configuration options with the following syntax SWITCH lt GUID gt I lt switch option 1 gt lt switch option 2 gt Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 The following are the per switch options Table 8 Adaptive Routing Manager Pre Switch Options File Option File Description Values ENABLE Allows you to enable disable the AR on this switch Default true lt true false gt If the general ENABLE option value is set to false then this per switch option is ignored This option can be changed on the fly AGEING_ TIME Applicable to bounded AR mode only Specifies how Default 30 lt usec gt much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmis sion burst 32 bit value In the pre switch options file this option refers to the particular switch only This option can be changed on the fly Example of Adaptive Routing Manager Options Fi
89. LIDs If the fabric is con figured to allow multiple LIDs per port then using any of them is valid for defining a port e Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the I option Mellanox Technologies 157 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities 10 3 ibdiagnet of ibutils2 IB Net Diagnostic 4 after installing Mellanox OFED To use this ibdiagnet version run ibdiagnet be This version of ibdiagnet is included in the ibutils2 package and it is run by default Please see ibutils2_release_notes txt for additional information and known issues ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 10 3 1 SYNOPSYS ibdiagnet i lt dev name gt p lt port num gt pm pc P lt lt PM gt lt Value gt gt e al lw lt 1lx 4x 8x 12x gt ls lt 2 5 5 10 gt skip lt ibdiag stage gt o lt out dir gt h V 158 Mellanox Technologies Mellanox OFED
90. Mellanox TECHNOLOGIES Mellanox OFED for Linux User Manual Rev 1 5 3 3 1 0 www mellanox com NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTA TION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PROD UCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANT ABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPE CIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAY MENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CON TRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBIL ITY OF SUCH DAMAGE Mellanox TECHNOLOGIES M
91. R Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port Mellanox Technologies 125 Rev 1 5 3 3 1 0 OpenSM Subnet Manager on one end of the cable and the other port on the other end continuing along the mesh dimen sion Use R dor option to activate the DOR algorithm 9 5 7 Torus 2QoS Routing Algorithm Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics The torus 2QoS routing engine can provide the following functionality on a 2D 3D torus e Free of credit loops routing e Two levels of QoS assu
92. Sheet 1 of 3 Affected Switch Relevant Description Commands h Print the help menu hh Print an extended help menu d evice All Specify the device to which the Flash is connected lt device gt Mellanox Technologies 183 Rev 1 5 3 3 1 0 Table 23 mstflint Switches Sheet 2 of 3 Affected Switch Relevant Description Commands guid lt GUID gt burn sg GUID base value 4 GUIDs are automatically assigned to the follow ing values guid gt node GUID guid 1 gt portl guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value guids burn sg 4 GUIDs must be specified here The specified GUIDs are assigned lt GUIDs gt the following values repectively node port1 port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 mac lt MAC gt burn sg MAC address base value Two MACs are automatically assigned to the following values mac gt portl mact 1 gt port2 Note This switch is applicable only for Mellanox Technologies Eth ernet products macs burn sg Two MACs must be specified here The specified MACs are lt MACs gt assigned to port and port2 repectively Note This switch is applicable only for Mellanox Technologies Eth ernet products blank_guids burn Burn the image with b
93. X installation script e uninstall sh This is the MLNX_OFED_LINUX un installation script e lt RPMS folders gt Directory of binary RPMs for a specific CPU architecture e firmware Directory of the Mellanox IB HCA firmware images including Boot over IB e src Directory of the OFED source tarball and the Mellanox Firmware Tools MFT tarball e docs Directory of Mellanox OFED related documentation 1 4 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protocols ULPs interface with the hardware and with the kernel and user space The application level also shows the versatility of markets that Mellanox OFED applies to Mellanox Technologies 15 Rev 1 5 3 3 1 0 Mellanox OFED Overview Figure 1 Mellanox OFED Stack Back end App Middleware Front end Eth Cluster Config Mgmnt A Life Sciences Block Storage HPC Application Application Mellanox VPI Device HCA NIC Markets gt Linux MEE OFED in Linux Applications Mi OF ED The following sub sections briefly describe the various components of the Mellanox OFED stack 1 4 1 mthca HCA IB Driver mthca is the low level driver implementation for the following Mellanox Technologies HCA InfiniBand devices InfiniHost InfiniHost III Ex and InfiniHost III Lx 1 4 2 mlx4 VPI Driver m1x4 is the low level driver implementation for the ConnectX and ConnectX 2 adapters designed by Mellanox
94. a v2 ethx unique lt ethX gt The associated Ethernet device used by eth3 RoCE lt port gt The port number 1 42 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 The following is an example of the ibdev2netdev utility s output and the entries added per each output line Example sw419 ibdev2netdev mix4 0 port 2 SS era mlx4 0 port 1 lt gt eth3 ofa v2 eth2 u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 eth2 2 ofa v2 eth3 u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 eth3 1 4 2 Reliable Datagram Sockets 4 2 1 Overview Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram deliv ery between sockets over RC or TCP IP RDS is intended for use with Oracle RAC 11g For programming details enter host1 man rds 4 2 2 RDS Configuration The RDS ULP is installed as part of Mellanox OFED for Linux To load the RDS module upon boot edit the file etc infiniband openib conf and set RDS LOAD yes For the changes to take effect run etc init d openibd restart aa 4 3 Sockets Direct Protocol 4 3 1 Overview Sockets Direct Protocol SDP is an InfiniBand byte stream transport protocol that provides TCP stream semantics Capable of utilizing InfiniBand s advanced protocol offload capabilities SDP can provide lower latency higher bandwidth and lower CPU utilization than IPoIB or Ethernet running some sockets
95. able for utilization in the echo command of Section 4 4 2 2 add the c option to ibsrpdm ibsrpdm c Sample output id_ext 200400A0B81146A1 ioc guid 0002c90200402bd4 dgid e800000000000000002c90200402bd5 pkey ffff service 1d 200400a0b81146al b To establish a connection with an SRP Target using the output from the libsrpdm c exam ple above execute the following command echo n id ext 200400A0B81146A1 i0c_guid 0002c90200402bd4 dgid e800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband_ srp srp mthca0 1 add_target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command srp_daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsr pdm functionality described above srp_daemon can also e Establish an SRP connection by itself without the need to issue the echo command described in Section 4 4 2 2 e Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode 56 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit e Enable High Availability operation together with Device Mapper Multipath e Have a conf
96. ace SDP debug is controlled by options in the Libsdp conf file You can also have a local version and point to it explicitly using the following command host1 export LIBSDP CONFIG FILE lt path gt libsdp conf To obtain extensive debug information you can modify libsdp conf to have the log directive produce maximum debug output provide the min level flag with the value 1 The log statement enables the user to specify the debug and error messages that are to be sent and their destination The syntax of log is as follows log destination stderr syslog file lt filename gt min level 1 9 where options are Mellanox Technologies 45 Rev 1 5 3 3 1 0 Driver Features destination send log messages to the specified destination stderr forward messages to the STDERR syslog send messages to the syslog service file lt filename gt write messages to the file var log filename for root For a regular user write to tmp lt filename gt lt uid gt if filename is not specified as a full path otherwise write to lt path gt lt filename gt lt uid gt min level verbosity level of the log 9 print errors only 8 print warnings 7 print connect and listen summary useful for tracking SDP usage 4 print positive match summary useful for config file debug 3 print negative match summary useful for config file debug 2 print function calls and return values 1 print debug messages Examples To print SDP usage pe
97. ade of LID GID and QPN This transla tion is totally invisible to the OS and user Thus differentiating EoIB from IPoIB which exposes a 20 Bytes HW address to the OS The mlx4 vnic module is designed for Mellanox s ConnectX family of HCAs and intended to be used with Mellanox s BridgeX gateway family Having a BridgeX gateway is a requirement for using EoIB It performs the following operations e Enables the layer 2 address translation required by the mlx4_vnic module e Enables routing of packets from the InfiniBand fabric to a or 10 GigE Ethernet sub net 4 5 1 Ethernet over IB Topology EoIB is designed to work over an InfiniBand fabric and requires the presence of two entities e Subnet Manager SM The required subnet manager configuration is not unique to EoIB but rather similar to other Infini Band applications and ULPs Mellanox Technologies 61 J Rev 1 5 3 3 1 0 Driver Features e BridgeX gateway The BridgeX gateway is at the heart of EoIB On one side usually referred to as the internal side it is connected to the InfiniBand fabric by one or more links On the other side usually referred to as the external side it is connected to the Ethernet subnet by one or more ports The Ethernet connec tions on the BridgeX s external side are called external ports or eports Every BridgeX that is in use with EoIB needs to have one or more eports connected 4 5 1 1 External Ports eports and Gateway The combinatio
98. aemon log e It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRPHA_ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Section 4 4 2 6 For the changes in openib conf to take effect run etc init d openibd restart 4 4 2 5 Multiple Connections from Initiator IB Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port i e SRP connections use the same path the configuration is enabled using a different initiator_ext value for each SRP connection The initiator_ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator_ext value on each path The conventions is to use the Target port GUID as the initiator_ext value for the relevant path If you use srp_daemon with n flag it automatically assigns initiator_ext values according to this convention For example id_ext 200500A0B81146A1 ioc guid 0002c90200402bec dgid fe800000000000000002c90200402bed pkey ffff service id 200500a0b81146
99. affic analysis The following describes a work flow for local HCA adapter sniffing e Run ibdump with the desired options e Run the application that you wish its traffic to be analyzed Stop ibdump CTRL c or wait for the data buffer to fill in mem mode e Open Wireshark and load the generated file How to Get Wireshark Mellanox Technologies 187 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Download the current release from www wireshark org for a Linux or Windows environment See the ibdump_release_notes txt file for more details Although ibdump is a Linux application the generated pcap file may be analyzed on either operating system Ad Synopsis ibdump options Table 25 lists the various flags of the command Table 25 ibdump Options a Default Flag oe If Not Description y Specified h help Optional Print the help menu d ib dev lt dev gt Optional First device Use IB device lt dev gt found i ib port lt port gt Optional 1 Use port lt port gt of IB device 0 output lt file gt Optional sniffer pcap Dump file name b max burst lt log2 Optional 12 4096 log2 of the maximal burst size that can be captured burst gt entries with no packet loss Each entry takes MTU bytes of memory mem mode lt size gt Optional When specified packets are written to the dump file only after the capture is stopped It is faster than the defa
100. al initiator ext ed2b400002c90200 Notes 58 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 1 It is recommended to use the n flag for all srp_ daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp_daemon sh always uses the n option whether invoked manually by the user or auto matically at startup by setting SRPHA_ENABLE to yes 4 4 2 6 High Availability HA Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each ini tiator is connected to the same target from several ports HCAs The DM multipath is responsible for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib_srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fabric and send the ib srp module requests to connect to them as well Operation When a path from port to a target fails the ib srp module starts an error recovery process If this process gets to the reset host stage and there is no path to the target from this port ib srp will remove this scsi_host After the scsi_host is removed multipath switches to another path t
101. an be pinned by a user space application If desired tune the value unlimited to a specific amount of RAM Step 6 For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be running on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 9 OpenSM Subnet Manager Step 7 InfiniBand only Run the hca self test ofed utility to verify whether or not the Infini Band link is up The utility also checks for and displays additional information such as e HCA firmware version e Kernel architecture e Driver version Number of active HCA ports along with their states e Node GUID Note For more details on hca self test ofed see the file hca_self test readme under docs hostl usr bin hca_self test ofed Performing InfiniBand HCA Self Test umbertor HCAs Detected 1 AGE IDEN UCI Calc dixenanancncnannnancmcnan PASS Kerne INI sanaotnondosodnanboaodnanace x86 64 SORE DieIWEIe WESC cooccocccsescnanenas MLNX OFED LINUX 1 5 3 OFED 1 5 3 1 5 2 2 6 32 12 0 7 default HOSE WWreiyete REM ECNE CREEA A E ax PASS AGATE rmware Ton ICN 0 e a v2 9 1000 HCA Firmware Check on HCA 0 PASS Hosti Driver IMouliciculieiciorl a PASS Number of HCA Ports Active 1 Pontio tate Ore Port Eon ICN GH0 cossooce UP 4X DDR Port State of Port 2 on HCA 0 aye INIT 28 Mell
102. anox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 fier Cover Cast OM SCA H sccssnssas PASS Kern elev Sllogm hec e ter PASS Mace GUT cm BCA GO A 00 02 c9 03 00 00 10 e0 DONE After the installer completes information about the Mellanox OFED installation such as A prefix kernel version and installation parameters can be retrieved by running the com M mand etc infiniband info 2 3 4 Installation Results Software e The OFED and MFT packages are installed under the usr directory e The kernel modules are installed under InfiniBand subsystem lib modules uname r updates kernel drivers infiniband mlx4 driver Under 1ib modules uname r updates kernel drivers net m1x4 you will find mlx4_core ko mlx4_en ko mlx4_ib ko mlx4 vnic ko and mlx4_fc ko IPoIB lib modules uname SDP K updates kernel drivers infiniband ulp ipoib ib_ipoib ko lib modules uname r updates kernel drivers infiniband ulp sdp ib_sdp ko SRP lib modules uname r updates kernel drivers infiniband ulp srp ib_srp ko RDS lib modules uname r updates kernel net rds rds ko lib modules uname r updates kernel net rds rds_rdma ko lib modules uname r updates kernel net rds rds tep ko e The package kernel ib devel include files are placed under usr src ofa_kernel include These include files should be used when building kernel modu
103. are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds TASIS any service id 0x00000000010648CA lt SL gt 9 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the target IB port GUID The following two match rules are equivalent srp target port guid 0x1234 lt SL gt any target port guid 0x1234 lt SL gt Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 9 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traf fic and that s why it is the only ULP that did not appear in the qos ulps section 9 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are e Max VLs the maximum number of VLs that will be on the subnet e High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 e VLArb low table Low priority VL Arbitration table IBA 7 6 9 template Mellanox Technologi
104. as a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same target port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group c Ifnone prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node 9 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign_lids option is specified S ren ecdscugnmlhids 120 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engin
105. as the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by e Port GUID e Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group e Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group e Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port II QoS Setup denoted by gos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fab ric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts II QoS Levels denoted by gos levels Each QoS Level defines Service Level SL and a few optional fields e MTU limit e Rate limit e PKey e Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS M
106. atch rule has only one criterion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below Policy File Syntax Guidelines e Leading and trailing blanks as well as empty lines are ignored so the indentation in the example is just for better readability e Comments are started with the pound sign and terminated by EOL e Any keyword should be the first non blank in the line unless it s a comment e Keywords that denote section subsection start have matching closing keywords e Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules e Any section subsection of the policy file is optional Examples of Advanced Policy File As mentioned earlier any section of the policy file is optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file 136 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 gos levels gos level name DEFAULT ale 0 end qos level end qos levels Port groups section is missing because there are no match rules which means that port groups are not referred an
107. atching Rules denoted by qos match rules Mellanox Technologies 135 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 9 6 3 9 6 4 9 6 5 Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by e Source port group whether a source port is a member of a specified group e Destination port group same as above only for destination port e PKey e QoS class e Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Ser vice ID disregarding rest of the query fields However if a certain query has only Service ID which means that this is the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a m
108. ate and burn the composite image Run flint dev lt mst device name gt brom lt expansion ROM image gt Example on Linux flint dev dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Example on Windows flint dev mt26428 pci_cr0 brom ConnectX 26428 ROM X X XXX mrom Removing the Expansion ROM Image Remove the expansion ROM image Run flint dev lt mst device name gt drom When removing the expansion ROM image you also remove Flexboot from the boot device list Ad A 3 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB A 3 1 Installing the DHCP Server To add IPoIB support in DHCP client server please refer to docs dhcp README A 3 2 Configuring the DHCP Server A 3 2 1 For ConnectX Family Devices When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions The value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecimal digits 192 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Extracting
109. ation about the vNics on the system including network administrated vNics refer to Section 4 5 3 1 mlx4_vnic_info on page 67 When using BKADDR bx field all vNics BX address configuration should be consis tent either all of them use GUID format or name format as vconfig for VLAN modification or ifconfig for MAC modification are not sup The MAC and VLAN values are set using the configuration files only other tools such w ported 4 5 2 2 EolB Network Administered vNic In network administered mode the configuration of the vNic is done by the BridgeX Ifa vNic is configured for a specific host it will appear on that host once a connection is established between the BridgeX and the mlx4_vnic module This connection between the mlx4_vnic modules and all available BridgeX boxes is established automatically when the mlx4 vnic module is loaded If the BridgeX is configured to remove the vNic or if the connection between the host and BridgeX is lost the vNic interface will disappear running ifconfig will not display the interface Similar to host administered vNics a network administered vNic resides on a specific vHub For further information on how to configure a network administered vNic please refer to BridgeX documentation To disable network administered vNics on the host side load mlx4 vnic module with the net_admin module parameter set to 0 4 5 2 3 VLAN Configuration A vNic instance is associated wi
110. based applications SDP can be used by applications and improve their performance transparently that is without any recompilation Since SDP has the same socket semantics as TCP an existing application is able to run using SDP the difference is that the application s TCP socket gets replaced with an SDP socket It is also possible to configure the driver to automatically translate TCP to SDP based on the source IP port the destination or the application name See Section 4 3 5 Mellanox Technologies 43 J Rev 1 5 3 3 1 0 Driver Features The SDP protocol is composed of a kernel module that implements the SDP as a new address fam ily protocol family and a library see Section 4 3 2 that is used for replacing the TCP address family with SDP according to a policy This chapter includes the following sections e Section 4 3 2 libsdp so Library on page 44 e Section 4 3 3 Configuring SDP on page 44 e Section 4 3 4 Environment Variables on page 47 e Section 4 3 5 Converting Socket based Applications on page 47 e Section 4 3 6 BZCopy Zero Copy Send on page 53 e Section 4 3 7 Using RDMA for Small Buffers on page 53 4 3 2 libsdp so Library libsdp so is a dynamically linked library which is used for transparent integration of applica tions with SDP The library is preloaded and therefore takes precedence over glibc for certain socket calls Thus it can transparently replace the T
111. be chosen for routing joie Order 7 1 3 il Y 12 25 23 26 29 2H so 9 6 Quality of Service Management in OpenSM 9 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the pro vided policy on client requests The overall flow for such requests is as follows e The request is matched against the defined matching rules such that the QoS Level def inition is found e Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 3 QoS Manager Administrator InfiniBand subnet with QoS gt OFED 1 3 Manager based nodes OSM gt S ss There are two ways to define QoS policy 134 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR e Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 9 6 2 Advanced QoS Policy File The QoS policy file h
112. be supported by both the BridgeX and by the host side When enabling ALL VLAN all gateways LAG or legacy that have eports belonging A to a gateway group GWG must be configured to the same behavior For example it is impossible to have gateway A2 configured to all vlan mode and A3 to regular mode because both belong to GWG A A gateway that is configured to work in ALL VLAN mode cannot accept login requests from A e vNics that do not support this mode aa e host admin vNics that were not configured to work in ALL VLAN mode by set ting the vlan id value to a all as as described in Section Creating vNICs that Support ALL VLAN Mode on page 71 Creating vNICs that Support ALL VLAN Mode VLANs are created on a vNIC that supports ALL VLAN mode using vconfig e net admin vNics The net admin vNic supports ALL VLAN mode once it is created on a gateway configured with All VLAN mode e host admin vNics To create an ALL VLAN vnic set the VLAN s ID to all A gateway that is configured to work in ALL VLAN mode can only accept login requests from hosts that are also working in a VLAN mode e g the VLAN ID must be set to all This is an example of how to create an ALL VLAN VNic using the mlx4_vnic conf file name eth44 mac 00 25 8B 27 14 78 ib port mlx4 0 1 vid all vnic_id 5 bx 00 00 00 00 00 00 04 B2 eport A10 To create an All VLAN vNic using a specific configuration file add the following line to the
113. by A or ucast_cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being down A very common case that is han dled by the unicast routing cache is host reboot which otherwise would cause two full routing recal culations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destina tion LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used Once MinHop matrices exist each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID This step is common to standard and Up Down routing Each port h
114. by destination group and service id gos match rule use Storage targets destination Storage service id 0x10000000000001 0x10000000000008 0x10000000000FFF gos level name WholeSet end gos match rule gos match rule source Storage use match by source group only gos level name DEFAULT end gos match rule gos match rule use match by all parameters GoOs classe Wo ii source Virtual Servers destination Storage service id 0x0000000000010000 0x000000000001FFFF pkey 0x0F00 0x0FFF gos level name WholeSet end gos match rule end qos match rules 9 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include e Default match rule that is applied to PR MPR query that didn t match any of the other match rules e SDP e SDP application with a specific target TCP IP port range e SRP with a specific target IB port GUID e RDS IPoIB with a default PKey IPoIB with a specific PKey e Any ULP application with a specific Service ID in the PR MPR query e Any ULP application with a specific PKey in the PR MPR query e Any ULP application with a specific target IB port GUID in the PR MPR query Mellanox Technologies 139 Rev 1 5 3 3 1 0 OpenSM Subnet Manager Since any s
115. cation Table 3 Glossary Sheet 1 of 2 Channel Adapter An IB device that terminates an IB link and executes transport func CA Host Channel tions This may be an HCA Host CA or a TCA Target CA Adapter HCA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant communica tion IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connec tivity only LID An address assigned to a port data sink or source point by the Sub net Manager unique within the subnet used for directing packets within the subnet Local Device Node The IB Host Channel Adapter HCA Card installed on the machine System running IBDIAG tools Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Man The Subnet Manager that is authoritative that has the reference con ager figuration information for the subnet See Subnet Manager Multicast Forward ing Tables A table that exists in every switch providing the list of ports to for ward received multicast packet The table is organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot and pro Card NIC vides one or more ports to an Ethernet network Standby Subnet Man A Subnet Manager
116. ce e Querying the firmware version loaded on an HCA board e Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mella nox network adapter bridge switch device It includes query functions to the burnt firmware image and to the binary image file spark This tool burns a firmware binary image to the EEPROM s attached to an InfiniScaleIII switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific MADs over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace mstdump isw and i2c For additional details please refer to the MFT User s Manual docs 1 5 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 9 OpenSM Subnet Manager Mellanox Technologies 19 J Rev 1 5 3 3 1 0 Installation 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed The chapt
117. ct root nodes based on the CA hop length from any switch in the subnet a sta tistical histogram is built for each switch hop num vs number of occurrences If the histo gram reflects a specific column higher than others for a certain node then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage The user can override the node list manually Adi Mellanox Technologies 121 Rev 1 5 3 3 1 0 OpenSM Subnet Manager P If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm Ad 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 3 Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet Up Down routing does not allow LID routing communication between switches that are located insid
118. d b Optional Print in brief mode Reduce the output to show only if errors are present not what they are v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 T Optional Use specified threshold file lt threshold_fil e gt S Optional Show the predefined thresholds N nocolor Optional color mode Use mono mode rather than color mode C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout_ms gt MADs msec lt lid guid gt Mandatory Use the specified port s or node s LID GUID with with G flag G option lt port gt Mandatory Use the specified port without G flag Examples 1 Check aggregated node counter for LID 0x2 gt ibcheckerrs 2 warn counter SymbolErrors 65535 threshold 10 lid 2 port 255 warn counter LinkRecovers 255 threshold 10 lid 2 port 255 ehneshtolicielO Melee Ase Oreo threshold 10 lid 2 port 255 threshold 100 lid 2 port 255 warn counter LinkDowned 12 warn counter RcvErrors 565 warn counter XmtDiscards 441 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port all 2 Check port counters for LID 2 Port 1 gt ibcheckerrs v 2 1 Error check on
119. d for inter switch links Should a link that is one of several such parallel links fail routes are redistributed across the remaining links When the last of sucha set of parallel links fails traffic is rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise illegal i e not allowed by DOR rules Torus 2QoS will introduce such a turn as close as possible to the failed switch in order to route around it n the above example suppose switch T has failed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2Q0S will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns i e those turns that are illegal under DOR This causes the first hop after any such turn to use a separate set of VL values and prevents deadlock in the presence of a single failed switch For any given path only the hops after a turn that is illegal under DOR can contribute to a credit loop that leads to deadlock So in the example above with failed switch T the location of the illegal turn at I in the path from S to D requir
120. ddress LID 0x0000 QPN 0x08004f PSN 0xc9d800 GID fe80 202 c900 708 e811 remote address LID 0x0000 QPN 0x04004f PSN Oxbdde2c GID fe80 202 c900 708 e799 8192000 bytes in 0 01 seconds 4844 83 Mbit sec 1000 iters in 0 01 seconds 13 53 usec iter Defining Ethernet Priority PCP in 802 1q Headers On Server ibv_rc pingpong g 1 i 2 1 4 local address LID 0x0000 QPN 0x1c004f PSN 0x9daf6c GID fe80 202 c900 708 e799 remote address LID 0x0000 QPN 0x1c004f PSN Oxb0a49b GID fe80 202 c900 708 e811 8192000 bytes in 0 01 seconds 4840 89 Mbit sec 1000 iters in 0 01 seconds 13 54 usec iter On Client ibv_rc pingpong g 1 i 2 1 4 sw419 local address LID 0x0000 QPN 0x1c004f PSN Oxb0a49b GID fe80 202 c900 708 e811 remote address LID 0x0000 QPN 0x1c004f PSN 0x9daf6c GID fe80 202 c900 708 e799 8192000 bytes in 0 01 seconds 4855 96 Mbit sec 1000 iters in 0 01 seconds 13 50 usec iter Using rdma_cm Tests On Server ucmatose cmatose starting server initiating data transfers completing sends receiving data transfers data transfers complete cmatose disconnecting disconnected test complete return status 0 Mellanox Technologies 41 Rev 1 5 3 3 1 0 Driver Features On Client ucmatose s 20 4 3 219 cmatose starting client cmatose connecting receiving data transfers sending replies data transfers complete test complete return status 0 This server client run is witho
121. dure on page 25 Usage mnt mlnxofedinstall OPTIONS Options c config lt packages config file gt Example of the configuration file can be found under docs 22 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Mellanox Technologies 23 Rev 1 5 3 3 1 0 Installation 2 3 2 1 minxofedinstall Return Codes Table 1 lists the mlnxofedinstal1 script return codes and their meanings Table 1 mInxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration This can occur when the required hardware is not present on the system 172 Prerequisites are not met For example missing the required software installed or the hardware is not configured correctly 173 Failed to start the mst driver 24 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 2 3 3 Installation Procedure Step 1 Login to the installation machine as root Step 2 Mount the ISO image on your machine hostl mount o ro loop MLNX OFED LINUX lt ver gt lt 0OS label gt lt CPU arch gt iso mnt Step 3 Run the installation script mlnxofedinstall This program will install the MLNX OFED LINUX package on your machine Note that all other Mella
122. e For an InfiniBand port Mellanox FlexBoot implements a network driver with IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Linux software package see www mellanox com gt Products gt InfiniBand VPI SW Drivers The binary code is exported by the device as an expansion ROM image A 1 1 Supported Mellanox Adapter Devices and Firmware The package supports all ConnectX ConnectX 2 ConnectX 3 network adapter devices and cards Specifically adapter products responding to the following PCI Device IDs are supported ConnectX ConnectX 2 ConnectX 3 devices e Decimal 25408 Hexadecimal 6340 e Decimal 25418 Hexadecimal 634a e Decimal 25448 Hexadecimal 6368 e Decimal 26418 Hexadecimal 6732 e Decimal 26428 Hexadecimal 673c e Decimal 26438 Hexadecimal 6746 e Decimal 26448 Hexadecimal 6750 e Decimal 25458 Hexadecimal 6372 e Decimal 26458 Hexadecimal 675a e Decimal 26468 Hexadecimal 6764 e Decimal 26478 Hexadecimal 676e e Decimal 4099 Hexadecimal 1003 190 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 A 1 2 Tested Platforms See the Mellanox FlexBoot Release Notes FlexBoot release notes txt A 1 3 FlexBoot in Mellanox OFED The FlexBoot package is provided as a tarball tgz extension containing the files specified in Appendix A 1 1 Supported Mellanox Adapter Devices and Firmware page 190
123. e lt address family gt lt role gt lt program name gt lt address gt lt port range gt where Mellanox Technologies 47 Rev 1 5 3 3 1 0 Driver Features Note that rules are evaluated in the order of definition So the first match wins If no match is made 1ibsdp will default to both Examples e Use SDP by clients connecting to machines that belongs to subnet 192 168 1 48 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 use sdp connect 192 168 1 0 24 e Use SDP by ttcp when it connects to port 5001 of any machine use sdp listen ttcp se BOIL e Use TCP for any program with name starting with ttcp serving ports 22 to 25 use tcp server ttcp EDA AE e Listen on both TCP and SDP by any server that listen on port 8080 use both server 8080 e Connect ssh through SDP and fallback to TCP to hosts on 11 4 8 port 22 use both connect 11 4 8 0 24 22 Explicit Non transparent Conversion Use explicit conversion if you need to maintain full control from your application while using SDP To configure an explicit conversion to use SDP simply recompile the application replacing PF INET or PF INET with AF INET SDP or AF INET SDP when calling the socket system call in the source code The value of AF INET SDP is defined in the file sdp socket h or you can define it inline define AF INET SDP 27 define PF INET SDP AF INET SDP You can compile and execute the
124. e AR Manager as it may get timeouts on the AR related queries to these switches aa 9 8 2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug in i e it is a shared library libarmgr so that is dynamically loaded by the Subnet Manager Adaptive Routing Manager is installed as a part of Mellanox OFED installation 9 8 3 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing AR Manager can be enabled disabled through SM options file 9 8 3 1 Enabling Adaptive Routing To enable Adaptive Routing perform the following 1 Create the Subnet Manager options file Run opensm c lt options file name gt 2 Add armgr to the event_plugin_name option in the file Event plugin name s event plugin name armgr 3 Run Subnet Manager with the new options file opensm F lt options file name gt Adaptive Routig Manager can read options file with various configuration parameters to fine tune AR mechanism and AR Manager behavior Default location of the AR Manager options file is etc opensm ar_mgr conf To provide an alternative location please perform the following Mellanox Technologies 147 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 1 Add armgr conf file lt ar mgr options file name gt to the event plugin options option in the file Options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Sub
125. e Subnet Manager In case sensing the port protocol fails the port will be configured as an InfiniBand port Aa For ConnectX Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ 00 0 open Link down TX 0 TXE 0 RX 0 RXE 0 Link status The socket is not connected Waiting for link up on netO ok After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from Mellanox Technologies 195 Rev 1 5 3 3 1 0 A 8 A 8 1 A 8 2 For ConnectX InfiniBand Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ 00 0 open CLink down TX 0 TXE 0 RX 0 RXE O Link status The socket is not connected Waiting for link up on netO ok DHCP netO 02 02 c9 0c 78 11 ok netO 11 3 12 2 255 255 255 6 Next server 11 3 12 121 Filename pxeilinux 0 Root path vtftpbootv tftp 7 11 3 12 121 pxeilinux 90 Next FlexBoot attempts to boot as directed by the DHCP server Command Line Interface CLI Invoking the CLI When the boot process begins the computer starts its Power On Self Test POST sequence Shortly after completion of the POST the user will be prompted to press CTRL B to invoke Mel lanox FlexBoot CLI The user has few seconds to press CTRL B before
126. e just loads the LFTs from the file specified with no reaction to real topology Obvi ously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 9 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by default if no routing algorithm is specified It can also be invoked by specifying R minhop The Min Hop algorithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter is supplied by i lt equalize ignore guids file gt ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis 9 5 3 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto dete
127. e parameters include e tx rings num Number of TX rings use 0 for cpus default 0 max 32 e tx rings len Length of TX rings must be power of two default 1024 max 8K e rx_rings num Number of RX rings use 0 for cpus default 0 max 32 e rx rings len Length of RX rings must be power of two default 2048 max 8K e vnic net admin Network administration enabled default 1 e lro num Number of LRO sessions per ring use 0 to disable LRO default 32 max 32 e eport_state_enforce Bring vNic up only when corresponding External Port is up default 0 72 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e discovery _pkeys Vector of up to 24 PKEYs to be used for discovery default OxFFFF array of int For all module parameters list and description run mlx4 vnic_ info I To check the current module parameters run mlx4 vnic info P Default RX TX rings number is the number of logical CPUs threads To set non default values to module parameters the following line should be added to modprobe configuration file e g etc modprobe conf file options mlx4 vnic lt param name gt lt value gt lt param _name gt lt value gt For additional information about discovery pkeys please refer to Section 4 5 3 6 Discovery Partitions Configuration on page 70 4 5 4 2 vNic Interface Naming The mlx4 vnic driver enables the kernel to determine the name of the registered vNic By defau
128. e spine switch systems The reason is that there is no way to allow a LID route between them that does not break the Up Down rule One ramification of this is that you cannot run SM on switches other than the leaf switches of the fabric 9 5 3 1 UPDN Algorithm Usage Activation through OpenSM e Use R updn option instead of old u to activate the UPDN algorithm e Use a lt root_guid file gt for adding an UPDN guid file that contains the root nodes for ranking If the a option is not used OpenSM uses its auto detect root nodes algorithm Notes on the guid list file 1 A valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 9 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also prevents credit loop dead locks If the root guid file is not provided a or root_guid_file options the topology has to be pure fat tree that complies with t
129. e that does not support AR AR Manager will not try to enable AR on this switch If the firmware of this switch was updated to support the AR the AR Manager will need to be restarted by restarting Sub net Manager to allow it to configure the AR on this switch This option can be changed on the fly Default true AR_MODE lt bounded free gt Adaptive Routing Mode free no constraints on output port selection bounded the switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets This option can be changed on the fly Default bounded AGEING_TIME lt usec gt Applicable to bounded AR mode only Specifies how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmis sion burst 32 bit value This option can be changed on the fly Default 30 MAX ERRORS lt N gt lt N gt ERROR WINDOW When number of errors exceeds MAX_ERRORS of send receive errors or timeouts in less than ERROR_WINDOW seconds the AR Manager will abort returning control back to the Subnet Manager This option can be changed on the fly Values for both options 0 Oxffff MAX ERRORS 0 zero tolle rance abort configuration on first error Default 10 ERROR WINDOW 0 mecha nism disabled no error checking Default 5 LOG FI
130. e the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout ms gt MADs msec lt dest dr_path Optional Destination s directed path LID or GUID lid guid gt lt startlid gt Optional Starting LID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Mellanox Technologies 173 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III ellanox Technologies Lid Out Destination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Tech nologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Tech nologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 2 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x00
131. e two ports cat sys class infiniband mlx4 0 ports 1 link layer InfiniBand cat sys class infiniband mlx4 0 ports 2 link layer Ethernet 3 The firmware version is 2 7 700 appears at the top You can also run the following com mand to obtain the firmware version cat sys class infiniband mlx4 0 fw ver Zo Te TOO 4 The IB over Ethernet s Port MTU is 2K byte at maximum however the actual MTU cannot exceed the mlx4 en interface s MTU Since the mlx4 en interface s MTU is 1560 port 2 will run with MTU of IK Please note that RoCE s MTU are subject to IB MTU restrictions The RoCE s MTU values are 256 byte 512 byte 1024 byte and 2K Association of IB Ports to Ethernet Ports It is useful to know how IB ports associate to network ports ibdev2netdev wile 0 pore 2 lt gt Sica mlx4 0 port 1 lt gt ib0 Since both RoCE and mlx4 en use the Ethernet port of the adapter one of the drivers must carry the task of controlling the port state In this implementation it is the task of the mlx4_en driver The mlx4 ib driver holds a reference to the mlx4 en net device for getting notifications about the state of the port as well as using the mlx4 en driver to resolve IP addresses to MAC that are required for address vector creation However RoCE traffic does not go through the mlx4_en driver it is completely offloaded by the hardware Configre an IP Address to mlx4_en Interface Run the following on b
132. ection of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows gos ulps default end qos ulps 0 default SL It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible keywords gos ulps default sdp port num 30000 sdp port num 10000 20000 sdp rds ipoib pkey 0x0001 ipoib any service id 0x6234 any pkey 0x0ABC srp target port guid 0x1234 any target port guid 0x0ABC OxFFFFF end qos ulps ee default SL SL for application running on top of SDP when a destination TCP IPport is 30000 default SL for any other application running on top of SDP SL for RDS traffic SL for IPoIB on partition with pkey 0x0001 default IPoIB partition pkey 0x7FFF match any PR MPR query with a specific Service ID match any PR MPR query with a specific PKey SRP when SRP Target is located on a specified IB port GUID 6 match any PR MPR query with a specific target port GUID Similar to the advanced policy definition matching of PR MPR queries is done in order of appear ance in the QoS policy file such as the first match takes precedence excep
133. efer to the IB spec 9 9 4 Configuring Congestion Control Manager Main Settings To fine tune CC mechanism and CC Manager behavior and set the CC manager main settings perform the following To enables disables Congestion Control mechanism on the fabric nodes set the follow ing parameter enable e The values are lt TRUE FALSE gt The default is true e CC manager configures CC mechanism behavior based on the fabric size The larger the fabric is the more aggressive CC mechanism is in its response to congestion To manually modify CC manager behavior by providing it with an arbitrary fabric size set the following parameter num hosts e The values are 0 48K e The default is o base on the CCT calculation on the current subnet size e The smaller the number value of the parameter the faster HCAs will respond to the congestion and will throttle the traffic Note that if the number is too low it will result in suboptimal bandwidth To change the mean number of packets between marking eli gible packets with a FECN set the following parameter marking rate e The values are 0 OxfffFf e The default is oxa e You can set the minimal packet size that can be marked with FECN Any packet less than this size bytes will not be marked with FECN To do so set the following param eter packet size e The values are 0 0x3fc0 e The default is 0x200 e When number of errors exceeds max_errors of send receive er
134. ellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 4 909 7200 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2012 Mellanox Technologies All rights reserved Mellanox Mellanox Logo BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale PhyX SwitchX Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd Connect IB FabricIT MLNX OS ScalableHPC Unbreakable Link UFM and Unified Fabric Manager are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2877 Table of Contents Chapter 1 Mellanox OFED Overview cece ccc c cere reer e reece csereeeee 13 1 1 Introduction to Mellanox OFED 13 1 2 Introduction to Mellanox VPI Adapters 13 1 3 Mellanox OFED Package 14 EIL ISO Uma Be osc veden eid tnt tae Seip SN ARS aE A RAN teal Be teen wea Mak dane Mam eda sate ata See 14 1 3 2 Software Components 0 0 eee rer rr rer rr rer rss rss esset 14 13 3 EIA Wate sneri ese 8 eet alee Bo Ra
135. er cat sys class infiniband lt device gt ports 1 counters symbol error The command above is performed on Port of the device lt device gt The output value should be 0 if no symbol errors were recorded 3 Bandwidth is expected to vary between systems It heavily depends on the chipset memory and CPU Nevertheless the full wire speed should be achieved by the host With IB SDR the expected unidirectional full wire speed bandwidth is 900MB sec With IB DDR and PCI Express Gen 1 the expected unidirectional full wire speed bandwidth is 1400MB sec With IB DDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is 1800MB sec With IB QDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is 3000MB sec With IB FDR and PCI Express Gen 3 the expected unidirectional full wire speed band width is 6000MB sec To check the adapter s maximum bandwidth use the ib write bw utility p z y To check the adapter s latency use the ib write lat utility P The utilities ib write bw and ib write lat are installed as part of Mellanox OFED aa 6 3 3 System Performance Troubleshooting On some systems it is recommended to change the power saving configuration in order to achieve better performance This configuration is usually handled by the BIOS Please contact the system vendor for more information Mellanox Technologies 101 Rev 1 5 3 3 1 0 MPI
136. er includes the following sections e Section 2 1 Hardware and Software Requirements on page 20 e Section 2 2 Downloading Mellanox OFED on page 21 e Section 2 3 Installing Mellanox OFED on page 21 e Section 2 5 Uninstalling Mellanox OFED on page 32 2 1 Hardware and Software Requirements 2 1 1 Hardware Requirements Platforms e A server platform with an adapter card based on one of the following Mellanox Tech nologies InfiniBand HCA devices MT25408 ConnectX 2 VPI IB EN firmware fw ConnectX2 MT25408 ConnectX VPI IB EN firmware fw 25408 ConnectX 3 VPI IB EN firmware fw ConnectX3 For the list of supported architecture platforms please refer to the Mellanox OFED Release Notes file be Required Disk Space for Installation e 500 MB Device ID For the latest list of device IDs please visit Mellanox website gt 20 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 2 1 2 Software Requirements Operating System e Linux operating system P For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file ba Installer Privileges e The installation requires administrator privileges on the target machine 2 2 Downloading Mellanox OFED Step 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHo
137. ered mode In the host administered mode when a vHub with the requested VLAN tag is not available the vNIC s login request will be rejected e Host administered VLAN configuration in centralized configuration file can be modi fied as follow Add vid lt VLAN tag gt or remove vid property for no VLAN e Host administered VLAN configuration with ifcfg ethX configuration files can be modified as follow Add VNICVLAN lt VLAN tag gt or remove VNICVLAN property for no VLAN Using a VLAN tag value of 0 is not recommended because the traffic using it would not be separated from non VLAN traffic P For Host administered vNics VLAN entry must be set in the BridgeX first For further information please refer to BridgeX documentation 4 5 2 4 EolB Multicast Configuration Configuring Multicast for EoIB interfaces is identical to multicast configuration for native Ether net interfaces EoIB maps Ethernet multicast addresses to InfiniBand MGIDs Multicast GID It L4 ensures that different vHubs use mutually exclusive MGIDs Thus preventing vNics on an different vHubs from communicating with one another 4 5 2 5 EolB and Quality of Service EoIB enables the use of InfiniBand service levels The configuration of the SL is performed through the BridgeX and lets you set different data control service level values per BridgeX box For further information on the use of non default service levels please refer to BridgeX documen tat
138. ers be A The scst_disk module pass thru mode of SCST is not supported by Mellanox OFED a Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst z modprobe scst_vdisk echo open vdisk0 dev md0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk1 dev sda BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk2 dev cciss c1d0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices a p echo add vdisk1 1 gt proc scsi_tgt groups Default devices Pomo echo add vdisk2 2 gt proc scsi_tgt groups Default devices Example 2 working with scst_vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst b modprobe scst_vdisk c echo open vdisk0 dev md0 gt proc scsi_tgt vdisk vdisk d echo open vdisk1 10G file gt proc scsi_tgt vdisk vdisk e echo add vdisk0 0 gt proc scsi_tgt groups Default devices f echo add vdisk1 1 gt proc scsi_tgt groups Default devices 2 Run For all distributions except SLES 11 gt modprobe ib srpt For SLES 11 gt modprobe f ib srpt For SLES 11 please ignore the following error messages in var log messages when loading ib_srpt to SLES 11 distribution s kernel ib srpt no symbol version for scst_unregister ib srpt Unknown symbol scst_unregister Mellanox Technologies 207 Rev 1 5 3 3 1 0 ib srpt no symbol version for scst_register
139. ervers Administrator Manager IB Ethernet a Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 9 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS e Upto 15 Virtual Lanes VL carry traffic in a non blocking manner e Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served e Packets carry class of service marking in the range 0 to 15 in their header SL field e Each switch can map the incoming packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL 84 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 The Subnet Administrator controls the parameters of each communication flow by pro viding them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The fol lowing subsections provide
140. es 141 Rev 1 5 3 3 1 0 OpenSM Subnet Manager e VLArb high table High priority VL Arbitration table IBA 7 6 9 template e SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by qos_ lt type gt _ string Here is a full list of the currently supported sets e qos ca QoS configuration parameters set for CAs e qos rtr_ parameters set for routers e qos sw0_ parameters set for switches port 0 e qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos ca max vls 15 gos_ca high limit 0 qos et Vilan miga Weal 10 230 si30 4s0 30 60 72 30 30 let Lisi Iasi sie Ie Goe Ce views low Weil led Zed sea aed Deal toed Ved ed Sed led Ileal deed aaa dae GEL Ce SWAN Wp Bp Se Oy Og My Sq Dy AN lil ey LS LA gos swe max vls 15 gos swe high limit 0 GON NS Wend mica Med le 220 380 450 530 030 130 s80 980 LOS LiL sO 12st ISR a gos swe vlarb low 0 0 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 4 13 4 14 4 Gos RS slavil Op 27394987 Er M787 Dp MO il 12 1s le T VL arbitration tables both high
141. es that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I i e hop r D cannot contribute to a credit loop because they cannot be used to construct a loop encircling T The hop I r uses a separate VL so it cannot contribute to a credit loop encircling T Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2QoS can also route around Mellanox Technologies 127 Rev 1 5 3 3 1 0 OpenSM Subnet Manager multiple failed switches on the condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus 4 4 tr tr I I I I I I 4 OD HT I I I I I I 3 I u I I I I I I 2 q _R I I I I I 1 NRA RATAR I I I I I I y 0 I I I I I I x 0 a 2 3 4 5 Suppose switches T and R have failed and consider the path from S to D Torus 2QoS will gener ate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that torus 2QoS cannot route without deadlock two failed switches adjacent in a dimension that is not the last dimension routed by DOR here the failed switches are O and T
142. f the following two options c 1 On the command line specify the system name using the option s lt local system name gt 2 Define the environment variable IBDIAG SYS NAME 10 2 2 IB Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the follow ing options 1 On the command line specify the port number using the option p lt local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option 1 lt index of local device gt 2 Define the environment variable IBDIAG_DEV_IDX 10 2 3 Addressing This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies The following addressing modes can be used to define the IB ports e Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination e Using port LIDs Tool option 1 In this mode the source and destination ports are defined by means of their
143. finiBand Fabric Diagnostic Utilities Usa eA BUCKCOVNGI CANES Jao men omenc mes onc 4X LrAkSpeedSuppO rtea A E 2 5 Gbps or 5 0 Gbps lnmnkSpeediinabilediegeamacntcnericie ot 2 5 Gbps or 5 0 Gbps MKS PES CASE TMC s eae see e teers 5 0 Gbps Now change the enabled link speed bport tarea Cam Am 0 D SEE ibportstate C mlx4 0 D 0 1 speed 2 nitial PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 After PortInfo set Port info DR path slid 65535 dlid 65535 0 port 1 Hnkopecdinabl edi 5 nn co os os ow dn ONG 5 0 Gbps IBA extension Show the new configuration gt ibportstate C mlx4 0 D01 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 eA Ser A S ONT O Initialize Phyo TANKS taten ae eA rns LinkUp MAKNA CNOUPPON EE O e ecutcenuce 1X or 4X lala aiaucheinaineloMlecle os on owes osos owe 1X or 4X LINKNIA ENAC EIV E S tener E AE a 4X TRO PES ASUPp OLE Are eee e T 2 5 Gbps or 5 0 Gbps inks PE SUNDS disuse A 5 0 Gbps IBA extension links PECAA CHING S ker sektens Ner steks ker SRS 5 0 Gbps 10 11 ibroute Applicable Hardware InfiniBand switches Description Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop Synopsis vogones a SA av A Fal Fa fo fel FI lias lt siulack WiC ca
144. following very simple TCP application that has been converted explicitly to SDP Compilation GEG I Server O O an Serwer gcc sdp client c o sdp client Usage Server hostio sdp server Client host1 sdp client lt server IP addr gt Example Server host1 sdp_ server accepted connection from 15 2 2 42 48710 read 2048 bytes end of test host1 Mellanox Technologies 49 J Rev 1 5 3 3 1 0 iver Features Client 50 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Mellanox Technologies 51 Rev 1 5 3 3 1 0 ver Features 52 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 printf accepted connection from s u n inet_ntoa client_addr sin addr ntohs client_addr sin port SSS nr reded me buriter INN AEIUINS YA 5 ie sore ONR perror read failed exit EXIT FAILURE else if nr 0 printf socket was closed by remote host n printf read zd bytes n nr printf end of test n close cd close sd return 0 4 3 6 BZCopy Zero Copy Send BZCOPY mode is only effective for large block transfers By setting the sys parameter sdp zcopy thresh toa non zero value a non standard SDP speedup is enabled Messages longer than sdp zcopy thresh bytes in length cause the user space buffer to be pinned and the data to be sent directly from the original buffer This results i
145. for Linux User s Manual Rev 1 5 3 3 1 0 OPTIONS i device lt dev name gt Specify the name of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p port lt port num gt Specify the local device s port number used to connect to the IB fabric pm Dump all pm Counters values into ibdiagnet pm o Reset all the fabric links pmCounters P counter lt lt PM gt lt Value gt gt Print any provided pm that is greater than its provided value Provide a report of the fabric qualities Indicate that UpDown credit loop checking should be done against automatically determined roots Specify the expected link width r routing Ullal tee lw lt 1x 4x 8x 12x gt zie 2 55 Los skip lt ibdiag check gt Specify the expected Skip the execution of Applicable to the fol ink speed the given stage owing stages dup_guids lids links sm nodes info all default None o output_path lt out dir gt Specify the directory where the output files will be placed Specify the threshold for printing errors to screen default 5 Placed default var tmp ibdiagnet2 Print this help message Print the version of the tool pcre MN err h help V version 10 3 2 Output Files Table 13 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 13 ibdiagnet of ibutils2 Output Files Output File Description ibdiagnet2 Ist
146. g networking storage cluster ing virtualization and RDMA with enhanced quality of service e RDMA over Converged Ethernet RoCE The following ULPs can be used over RoCE uDAPL SDP RDS MPI Mellanox Technologies 13 J Rev 1 5 3 3 1 0 Mellanox OFED Overview 1 3 Mellanox OFED Package 1 3 1 ISO Image Mellanox OFED for Linux MLNX_OFED_LINUX is provided as ISO images one per sup ported Linux distribution and CPU architecture that includes source code and binary RPMs firm ware utilities and documentation The ISO image contains an installation script called mlnxofedinstal11 that performs the necessary steps to accomplish the following e Discover the currently installed kernel e Uninstall any InfiniBand stacks that are part of the standard operating system distribu tion or another vendor s commercial stack e Install the MLNX_OFED_ LINUX binary RPMs if they are available for the current kernel e Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 3 2 Software Components MLNX_OFED_ LINUX contains the following software components e Mellanox Host Channel Adapter Drivers mthca IB only mlx4 VPI which is split into multiple modules mlx4 core low level helper mlx4 ib IB mlx4 en Ethernet mlx4 vnic EoIB e Mid layer core Verbs MADs SA CM CMA uVerbs uMADs e Upper Layer Protocols ULPs IPoIB RDS SDP SRP Initiator iSER MPI O
147. g command unless working in ALL VLAN mode To create interfaces with VLAN refer to Section Configuring VLANs on page 66 4 5 3 Retrieving EolB Information 4 5 3 1 mlx4_vnic_info To retrieve information regarding EoIB interfaces use the script mlx4 vnic_ info This script pro vides detailed information about a specific vNic or all EoIB vNic interfaces such as BX info IOA info SL PKEY Link state and interface features If network administered vNics are enabled this script can also be used to discover the available BridgeX boxes from the host side e To discover the available gateway run mlx4 vnic info g Mellanox Technologies 67 J Rev 1 5 3 3 1 0 Driver Features e To receive the full vNic information of eth10 run mlx4 vnic info i eth10 e To receive a shorter information report on eth10 run mlx4 vnic_info s eth10 To get help and usage information run mlx4 vnic_info help 4 5 3 2 ethtool ethtool application is another method to retrieve interface information and change its configura tion EoIB interfaces support ethtool similarly to hardware Ethernet interfaces The supported Ethtool options include the following options c C Show and update interrupt coalesce options 6 Query RX TX ring parameters k K Show and update protocol offloads i Show driver information Show adapter statistics For more information on ethtool run ethtool h 4 5 3 3 Link State
148. hat all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that prevent routing that is free of credit loops and will log warnings and refuse to route If no_ fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that case all paths that do not transit the failed components will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was configured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and message deadlock are Mellanox Technologies 131 Rev 1 5 3 3 1 0 OpenSM Subnet Manager likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capa ble of routing a torus without credit loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To verify that a torus fabric is routed free of credit loops use ibdmchk to analyze data collected via ibdiagnet vlr
149. he following rules e Tree rank should be between two and eight inclusively 122 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all e Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches e Switches of the same rank should have the same number of ports in each UP going port group e Switches of the same rank should have the same number of ports in each DOWN going port group e All the CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and it should only comply with the following rules e Tree rank should be between two and eight inclusively e All the Compute Nodes have to be at the same tree level rank Note that non com pute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algorithm allows leaf switches to have any number of CAs
150. ib srpt Unknown symbol scst_register ib srpt no symbol version for scst_unregister target template ib srpt Unknown symbol scst_unregister target template B On Initiator Machines On Initiator machines manually perform the following steps 1 Run modprobe ib srp 2 Run ibsrpdm c d dev infiniband umadXx to discover a new SRP target umad0 port 1 of the first HCA umadl port 2 of the first HCA umad2 port 1 of the second HCA 3 echo new target info gt sys class infiniband_srp srp mthca0 1 add_target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in the system i e mthca0 root lab104 ibsrpdm c d dev infiniband umad0 id_ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c90200226c 4 root lab104 echo id_ext 0002c90200226cf 4 i0c guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c90200226cf4 gt sys class infiniband srp srp mthca0 1 add_target OR e You can edit etc infiniband openib conf to load the SRP driver and SRP High Avail ability HA daemon automatically that is set SRP LOAD yes and SRPHA ENABLE yes To set up and use the HA feature you need the dm multipath driver and multipath tool e Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The fo
151. iguration file can also be used to set values for additional parameters e g max_cmd_per_lun max sect A continuous background daemon operation providing an automatic ongoing detection and connection capability See Section 4 4 2 4 4 4 2 4 Automatic Discovery and Connection to Targets e Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running e To connect to all the existing Targets in the fabric run srp_daemon e o This util ity will scan the fabric once connect to every Target it detects and then exit Mellanox Technologies 57 J Rev 1 5 3 3 1 0 Driver Features P srp_daemon will follow the configuration it finds in etc srp_daemon conf Thus it will ignore a target that is disallowed in the configuration file a e To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp_daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric e To execute SRP daemon as a daemon you may run run srp daemon found under usxr sbin providing it with the same options used for running srp_ daemon Make sure only one instance of run_srp_daemon runs per port Ad e To execute SRP daemon as a daemon on all the ports run srp_daemon sh found under usr sbin srp daemon sh sends its log to var log srp d
152. iguration file that determines the targets to connect to 1 srp daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c sa o is equivalent to ibsrpdm c These srp_daemon commands can behave differently than the equivalent ibsrpdm command when etc srp_daemon conf is not empty Ad 2 srp daemon extensions to ibs rpdm To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port lt port num gt and to generate output suitable for echo you may execute hostl srp daemon c a o i lt InfiniBand HCA name gt p lt port number gt A To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run ls sys class infiniband To both discover the SRP Targets and establish connections with them just add the e option to the above command Executing srp daemon over a port without the a option will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a It is recommended to use the n option This option adds the initiator_ext to the connect ing string See Section 4 4 2 5 for more details srp daemon has a configuration file that can be set where the default is etc srp_daemon conf Use the f to supply a different configuration file that configures the targets srp_daemon is allowed to connect to The conf
153. ile usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps 104 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 8 MellanoX Messaging MellanoX Messaging MXM provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hardware This includes a variety of enhancements that take advantage of Mellanox networking hardware including e Multiple transport support including RC XRC and UD e Proper management of HCA resources and memory structures Efficient memory registration e One sided communication semantics e Connection management e Receive side tag matching e Intra node shared memory communication These enhancements significantly increase the scalability and performance of message communi cations in the network alleviating bottlenecks within the parallel communication libraries 8 1 Enabling MXM in OpenMPI To enable MXM in OpenMPI please run the following svn co http svn open mpi org svn ompi branches v1 5 ciel Wal 5S autogen sh amp amp configure prefix PWD install enable debug with mxm MXM HOME install with openib o make all amp amp make install ae Mellanox Technologies 105 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 9 OpenSM Subne
154. indow max errors 0 zero tollerance seconds the CC MGR will abort and will allow abort configuration on first error OpenSM to proceed e error window 0 mechanism dis abled no error checking Default 5 ec statistics cycle Enables CC MGR to collect statistics from all nodes Default 0 every cc statistics cycle seconds When the value is set to 0 no statistics are collected Mellanox Technologies 155 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities 10 1 Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric The tools are e Section 10 3 ibdiagnet of ibutils2 IB Net Diagnostic on page 158 e Section 10 4 ibdiagnet of ibutils IB Net Diagnostic on page 160 e Section 10 5 ibdiagpath IB diagnostic path on page 163 e Section 10 6 ibv_devices on page 165 e Section 10 7 ibv_devinfo on page 165 e Section 10 8 ibdev2netdev on page 166 e Section 10 9 ibstatus on page 167 e Section 10 10 ibportstate on page 169 e Section 10 11 ibroute on page 172 e Section 10 12 smpquery on page 175 e Section 10 13 perfquery on page 178 e Section 10 14 ibcheckerrs on page 181 e Section 10 15 mstflint on page 183 e Section 10 16 ibv_asyncwatch on page 187 e Section 10 17 ibdump on page 187
155. ing 54 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id_ext GUID value ioc guid GUID value dgid port GID value pkey ffff service id service 0 value gt sys class infiniband_ srp srp mthca hca number port number add_ target See Section 4 4 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes Itis possible to include additional parameters in the echo command max cmd per lun Default 63 max sect short for max_sectors sets the request size of a command io class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 initiator ext Please refer to Section 9 Multiple Connections e To list the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk I This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 4 4 2 3 SRP Tools ibsrpdm and srp daemon To assist in performing the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and
156. ion 66 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 4 5 2 6 IP Configuration Based on DHCP Setting an EoIB interface configuration based on DHCP v3 1 2 which is available via www isc org is performed similarly to the configuration of Ethernet interfaces When setting the EoIB configuration files verify that it includes following lines For RedHat BOOTPROTO dhep e For SLES BOOTPROTO dchp If EoIB configuration files are included ifefg eth lt n gt files will be installed under etc Z4 sysconfig network scripts on a RedHat machine and under etc sysconfig network on M a SuSE machine DHCP Server Using a DHCP server with EoIB does not require special configuration The DHCP server can run on a server located on the Ethernet side using any Ethernet hardware or on a server located on the InfiniBand side and running EoIB module 4 5 2 7 Static EolB Configuration To configure a static EoIB you can use an EoIB configuration that is not based on DHCP Static configuration is similar to a typical Ethernet device configuration For further information on how to configure IP addresses please refer to your Linux distribution documentation Ethernet configuration files are located at etc sysconfig network scripts on a RedHat machine and at etc sysconfig network on a SuSE machine aa 4 5 2 8 Sub Interfaces VLAN EoIB interfaces do not support creating sub interfaces via the vconfi
157. ion No Advertised link modes Not reported Advertised auto negotiation No Speed Unknown 10000 Duplex Full Port Twisted Pair PHYAD 0 Transceiver internal Auto negotiation off Supports Wake on d Wake on d Current message level 0x00000000 0 Link detected yes 4 5 3 4 Bonding Driver EoIB uses the standard Linux bonding driver For more information on the Linux Bonding driver please refer to lt kernel source gt Documentation networking bonding txt Currently only fail over modes are supported by the EoIB driver load balancing modes including static and dynamic LACP configurations are not supported 4 5 3 5 Jumbo Frames EoIB supports jumbo frames up to the InfiniBand limit of 4K bytes The default Maximum Trans mit Unit MTU for EoIB driver is 1500 bytes To configure EoIB to work with jumbo frames 1 Make sure that the IB HCA and Switches hardware support 4K MTU 2 Configure Mellanox low level driver to support 4K MTU Add mlx4 core module parameter to set 4k mtu 1 3 Change the MTU value of the vNic for example run ifconfig eth2 mtu 4038 interface is 4038 bytes If the vNic is configured to use VLANs then the maximum Due to EoIB protocol overhead the maximum MTU value that can be set for the vNic aa MTU is 4034 bytes due to VLAN header insertion Mellanox Technologies 69 J Rev 1 5 3 3 1 0 Driver Features 4 5 3 6 Discovery Partitions Configuration EoIB enables map
158. itional net 44 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 stat program Without the S option it shows all the information that netstat does plus SDP data host1 sdpnetstat S Assuming that the SDP kernel module is loaded and is being used then the output of the command will be as follows host1 sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address sdp 0 0 193 168 10 144 34216 193 1600 125 1265 sdp 0 884720 193 168 10 144 42724 193 168 10 filenet rmi The example output above shows two active SDP sockets and contains details about the connec tions If the SDP kernel module is not loaded then the output of the command will be something like the following host1 sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address netstat no support for AF INET tcp on this system To verify whether the module is loaded or not you can use the 1smod command ib sdp1250200 The example output above shows that the SDP module is loaded If the SDP module is loaded and the sdpnetstat command did not show SDP sockets then SDP is not being used by any application 4 3 3 2 Monitoring and Troubleshooting Tools SDP has debug support for both the user space 1 ibsdp so library and the ib sdp kernel mod ule Both can be useful to understand why a TCP socket was not redirected over SDP and to help find problems in the SDP implementation User Space SDP Debug User sp
159. ived 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 6 6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver e Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg ib0 e The only meaningful bonding policy in IPoIB is High Availability bonding mode number 1 or active backup e Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only supported value is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions e In the bonding master configuration file e g ifefg bond0 in addition to Linux bond ing semantics use the following parameter MTU 65520 65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 4 6 2 IPoIB Mode Setting on page 77 and are configured with the same A value For IPoIB slaves that work in datagram mode use MTU 2044 If you do not ae set the correct MTU or do not set MTU at all performance of the interface might decrease e In the bonding slave configuration file e g ifcfg ib0 use the same Linux Network Scrip
160. l 0 A Detailed Examples oe sta ks ida E al ode dag ie Poa ee RES EES ee lee 37 4 1 10 Configuring DAPL over RoCE 0 ccc ccc ere rr rr reser eens 42 4 2 Reliable Datagram Sockets 43 AD OVERVIEWS o sedate ay vale vay edly cnet iad nde eal da oda sda dae das dae eae sas 43 42 2 RDS Configuration 3 06 ia6 cede eee be eke be Sa eee a eee ESA sw daw ees 43 4 3 Sockets Direct Protocol 43 ASA OVERVIEWS oS ad Saheb e Be eA BNA BEATS SEAT Ge Gea aaa ea dled de deadlock wears 43 4 3 2 libsdprSO Labrary essea tose wee 8 lyse saxty ath eet esp fs Eee eG 9 8 WSS BERTIE GES SES cots 44 4 3 3 Configuring SDP ty io sis er son elk a Sle digi Me Alle eh el dera ARA NN Sane Soa Sane ees 44 4 3 4 Environment Variables 0 2 0 0 ccc cece rer rr rr een eae 47 4 3 5 Converting Socket based Applications 0 00 c cece cece rr rer sea 47 Mellanox Techologies 1 4 3 6 BZCopy Zero Copy Send 0 ee cee rr rr rss rss ss 53 4 3 7 Using RDMA for Small Buffers 0 00 eee ene 53 4 4 SCSI RDMA Protocol 54 AA OVEM dated cb dae las daly tee Pad dea enantio er ude bee bdit dd nde car RA 54 442 SRP Initiator os saiese Saks act alee Hos Set ere is ee ees eee ged Are D EEA ERA AEA ER 54 4 5 Ethernet over IB EoIB VNic
161. lank GUIDs and MACs where applicable These values can be set later using the sg command see Table 24 below No com Force clear the Flash semaphore on the device No command is clear_semapho re mands allowed allowed when this switch is used Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage burn verify Binary image file lt image gt qq burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g I2C MTUSB 1 nofs burn Burn image in a non failsafe manner skip is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sec tor difference is detected 184 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Table 23 mstflint Switches Sheet 3 of 3 Affected Switch Relevant Description Commands byte_mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages y es All Non interactive mode Assume the answer is yes to all questions no A
162. le ENABLE true LOG FILE tmp ar_mgr log LOG SIZE 100 MAX ERRORS 10 ERROR WINDOW 5 SWITCH 0x12345 ENABLE true AGEING TIME 77 SWITCH 0x0002c902004050 8 AGEING TIME 44 SWITCH Oxabcde ENABLE false 9 9 Congestion Control 9 9 1 Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in i e it is a shared library libec mgr so that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation Mellanox Technologies 151 Rev 1 5 3 3 1 0 OpenSM Subnet Manager The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Con gestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches 9 9 2 Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so per form the following 1 Create the file Run opensm c lt options file name gt 2 Find the event_plugin_name option in the file and add cemgr to it Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt
163. les that use the stack Note that the include files if needed are backported to your kernel e The raw package un backported source files are placed under usr src ofa_kernel lt ver gt e The script openibd is installed under etc init d This script can be used to load and unload the software stack e The script connectx_ port config is installed under sbin This script can be used to configure the ports of ConnectX network adapter cards to Ethernet and or InfiniBand For details on this script please see Section 5 1 Port Type Management Mellanox Technologies 29 J Rev 1 5 3 3 1 0 Installation e The directory etc infiniband is created with the files info and openib conf and connectx conf The info script can be used to retrieve Mellanox OFED installation information The openib conf file contains the list of modules that are loaded when the openibd script is used The connect x conf file saves the ConnectX adapter card s ports configuration to Ethernet and or InfiniBand This file is used at driver start restart etc init d openibd start e The file 90 ib rules is installed under etc udev rules d e If OpenSM is installed the daemon opensmd is installed under etc init d and opensm conf is installed under etc If IPoIB configuration files are included ifcfg ib lt n gt files will be installed under etc sysconfig network scripts ona RedHat machine etc sysconfig network ona S
164. licy min 10 BW Virtual Server App B Server 9 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each example provides the QoS level assignment and their administration via OpenSM configuration files Mellanox Technologies 143 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 9 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI Separate from I O load Min BW of 70 e Storage Control Lustre MDS Low latency e Storage Data Lustre OST Min BW 30 Administration e MPIis assigned an SL via the command line hostl1 mpirun s1 0 e OpenSM QoS policy file In the following policy file example replace OST and MDS with the real port GUIDs a gt gos ulps default 0 default SL for MPI any target port guid OST1 0ST2 0ST3 0ST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL for Lus tre MDS end qos ulps e OpenSM options file qos max vls 8 gos high limit 0 gos vlarb high 2 1 gos vlarb low 0 96 1 224 gos_sl2vl 0 1 2 3 4 5 6 7 15 15 15 15 15 15 15 15 9 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels e Application traffic 144 Mellanox Technologies Mellanox OFED for Linux User s Man
165. ll Non interactive mode Assume the answer is no to all questions vsd lt string gt burn Write this string of up to 208 characters to VSD upon a burn com mand use_image ps burn Burn vsd as it appears in the given image do not keep existing VSD on Flash dual_image burn Make the burn process burn two images on Flash The current default failsafe burn process burns a single image in alternating locations yv Print version info Table 24 mstflint Commands Sheet 1 of 2 Command Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Verify the entire Flash bb Burn Block Burn the given image as is without running any checks sg Set GUIDs ri lt out file gt Read the firmware image on the Flash into the specified file de lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase sector rw lt addr gt Read one DWORD from Flash ww lt addr gt lt data gt Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt lt size gt lt data gt Write a data block to Flash without sector erase Mellanox Technologies 185 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Table 24 mstflint Commands Sheet 2 of 2 Command De
166. llowing is an example of an SRP Target setup file kkkxkkkkkkkkkkkkkkkkkkkxk srpt sh kkkxkkkkkkkkkkkkkkkkkkkkkkkkkkxkxk bin sh modprobe scst scst_threads 1 208 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 modprobe scst_vdisk scst_vdisk ID 100 echo open vdisk0 dev cciss cld0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdiskl dev sdb BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk2 dev sdc BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk3 dev sdd BLOCKIO gt proc scsi_tgt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices echo add vdiskl 1 gt proc scsi_tgt groups Default devices echo add vdisk2 2 gt proc scsi_tgt groups Default devices echo add vdisk3 3 gt proc scsi_tgt groups Default devices modprobe ib srpt echo add mgmt gt proc scsi_tgt trace level echo add mgmt _dbg gt proc scsi_tgt trace level echo add out_of mem gt proc scsi_tgt trace level KKKKKKKKKKKKKKKKKKKKKKK End srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkxk B 3 How to Unload Shutdown 1 Unload ib_srpt modprobe r ib srpt 2 Unload scst and its dev_handlers first modprobe r scst_vdisk scst 3 Unload ofed etc rce d openibd stop Mellanox Technologies 209 Rev 1 5 3 3 1 0 Appendix C mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modprobe conf options mlx4 core pa
167. login again Other packages such as environment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usu ally only settable by root and a per user MPI selection It also shows what the current selections are This command is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting Compiling MPI Applications A A valid Fortran compiler must be present in order to build the MVAPICH MPI stack and tests gt The following compilers are supported by Mellanox OFED s MVAPICH and Open MPI packages Gcc Intel and PGI The install script prompts the user to choose the compiler with which to install the MVAPICH and Open MPI RPMs Note that more than one compiler can be selected simultane ously if desired Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user _guide html To review the default configuration of the installation check the default configuration f
168. lt the Linux kernel assigns each vNic interface the name eth lt N gt where lt N gt is an incremental num ber that keeps the interface name unique in the system The vNic interface name may not remain consistent among hosts or BridgeX reboots as the vNic creation can happen in a different order each time Therefore the interface name may change because of a first come first served kernel policy In automatic network administered mode the vNic MAC address may also change which makes it difficult to keep the interface configuration persistent To control the interface name you can use standard Linux utilities such as IFRENAME 8 IP 8 or UDEV 7 For example to change the interface eth2 name to eth bx01 a10 run ifrename i eth2 n eth bx01 a10 To generate a unique vNic interface name use the mlx4 vnic info script with the u flag The script will generate a new name based on the scheme eth lt pci id gt lt ib port num gt lt gw port id gt vlan id For example if vNic eth2 resides on an InfiniBand card on the PCI BUS ID 0a 00 0 PORT 1 and is connected to the GW PORT ID 3 without VLAN its unique name will be mlx4 vnic_ info u eth2 Gitna iclail ds You can add your own custom udev rule to use the output of the script and to rename the vNic interfaces automatically To create a new udev rule file under etc udev rules d 61 vnic net rules include the line SUBSYSTEM net PROGRAM sbin mlx4 vnic info u k NAME
169. m ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4_ib ko sbin insmod lib modules ib ib mthca ko The following command loading ipoib_helper ko is not required for all OS kernels Please check the release notes Ad sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib ipoib ko In case of interoperability issues between iSCSI and Large Receive Offload LRO change the last command above as follows to disable LRO Mellanox Technologies 201 Rev 1 5 3 3 1 0 A 9 2 sbin insmod lib modules ib ib ipoib ko lro 0 Step 10 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin dhclient cf sbin dhclient conf ibl Step 11 Save the init file Step 12 Close initrd host1 cd tmp initrd_ib host1 find cpio H newc o gt tmp new initrd ib img hostl gzip tmp new init _ib img Step 13 At this stage the modified initrd including the IB driver is ready and located at tmp new init _ib img gz Copy it to the original initrd location and rename it prop erly Case II Ethernet Ports The Ethernet driver requires loading the following modules in the specified order see the exam ple below e mlx4 core ko e mlx4 en k
170. ming switches support 8 data VLs e Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL values e Very short run times with good scaling properties as fabric size increases 9 5 7 1 Unicast Routing Torus 2QoS is a DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension It encodes into a path SL which datelines the path crosses as follows sl 0 for d 0 d lt torus dimensions Char path crosses dateline d returns 0 or 1 Sl SeeduhmenOsseomacteclsumer Od For a 3D torus that leaves one SL bit free which torus 2QoS uses to implement two QoS levels Torus 2Q0S also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows ie Gl WP Sil lt Ge Sil sp cdir port reports which torus coordinate direction a switch port oy NOEL im Ano SETS Or I ore Aey sl2vl iport oport sl 0xl amp sl gt gt cdir oport Thus on a pristine 3D torus i e in the absence of failed fabric switches torus 2QoS consumes 8 SL values SL bits 0 2 and 2 VL values VL bit 0 per QoS level to provide deadlock free routing on a 3D torus Torus 2QoS routes around link failure by
171. mum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED x01 ERROR error messages x02 INFO basic messages low volume x04 VERBOSE interesting stuff moderate volume x08 DEBUG diagnostic high volume x10 FUNCS function entry exit very high volume x20 FRAMES dumps all SMP and GMP frames x40 ROUTING dump FDB routing information x80 currently unused 0 0 0 0 0 0 0 0 Without vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying vf 0xFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option Display this usage info then exit To run osmtest in the default mode simply enter hostl osmtest The default mode runs all the flows except for the Quality of Service flow see Section 9 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file hostl osmtest f c Immediately afterwards run the following command to test opensm hostl osmtest f a Finally it is recommended to occasionally run osmtest v with verbo
172. mware After Installation In case you ran the mlnxofedinstal11 script with the without fw update option and now you wish to manually update firmware on your adapter card s you need to perform the fol lowing steps Step 1 Startmst Step 2 Identify your target InfiniBand device for firmware update 3 Get the list of InfiniBand device names on your machine Mellanox Technologies 31 Rev 1 5 3 3 1 0 Installation 4 Your InfiniBand device is the one with the postfix _pci_cr0 In the example listed above this will be dev mst mt25418 pci cro Step 3 Burn firmware 1 Burning a firmware binary image using mst flint that is already installed on your machine Please refer to MSTFLINT README txt under docs 2 Burning a firmware image from a mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 host1 mlxburn dev dev mst mt25418 pci cr0 fw mnt firmware fw 25408 fw 25408 rel mlx Step 4 Reboot your machine after the firmware burning is completed 2 5 Uninstalling Mellanox OFED Use the script usr sbin ofed_uninstall sh to uninstall the Mellanox OFED package The script is part of the ofed scripts RPM 32 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 3 Configuration Files For the comple
173. n option in the file Options string that would be passed to the plugin s event_plugin_options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt AR Manager options file contains two types of parameters 1 General options Options which describe the AR Manager behavior and the AR parameters that will be applied to all the switches in the fabric 2 Per switch options Options which describe specific switch behavior 148 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Note the following e Adaptive Routing configuration file is case sensitive e You can specify options for nonexisting switch GUID These options will be ignored until a switch with a matching GUID will be added to the fabric e Adaptive Routing configuration file is parsed every AR Manager cycle which in turn is executed at every heavy sweep of the Subnet Manager e Ifthe AR Manager fails to parse the options file default settings for all the options will be used Mellanox Technologies 149 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 9 8 5 1 150 Mellanox Technologies General AR Manager Options Table 7 Adaptive Routing Manager Options File Option File Description Values ENABLE lt true false gt Enable disable Adaptive Routing on fabric switches Note that if a switch was identified by AR Manager as devic
174. n as the origin of the coordinate system used to describe switch location The position parameter for a dateline keyword moves the origin and hence the dateline the specified amount relative to the common switch in a torus seed next_seed If any of the switches used to specify a seed were to fail torus 2QoS would be unable to complete topology discovery successfully The next_seed keyword specifies that the following link and dateline keywords apply to a new seed specification For maximum resiliency no seed specification should share a switch with any other seed specifi cation Multiple seed specifications should use dateline configuration to ensure that torus 2QoS can grant path SL values that are constant regardless of which seed was used to initiate topology discovery portgroup max ports max ports This keyword specifies the maximum number of parallel inter switch links and also the maximum number of host ports per switch that torus 2QoS can accom modate The default value is 16 Torus 2QoS will log an error message during topology discovery if this parameter needs to be increased If this keyword appears multiple times the last instance prevails port order pl p2 p3 This keyword specifies the order in which CA ports on a destination switch are visited when computing routes When the fabric contains switches connected with mul tiple parallel links routes are distributed in a round robin fashion across such links and so cha
175. n less CPU usage and on many systems much higher bandwidth Note that the default value of sdp zcopy thresh is 64KB but is may be too low for some systems You will need to experiment with your hardware to find the best value 4 3 7 Using RDMA for Small Buffers For smaller buffers the overhead of preparing a user buffer to be RDMA ed is too big therefore it is more efficient to use BCopy Large buffers can also be sent using RDMA but they lower CPU utilization This mode is called ZCopy combined mode The sendmsg syscall is blocked until the buffer is transfered to the socket s peer and the data is copied directly from the user buffer at the source side to the user buffer at the sink side To set the threshold use the module parameter sdp zcopy thresh This parameter can be accessed through sysfs sys module ib_sdp parameters sdp_zcopy_ thresh Setting it to 0 disables ZCopy Mellanox Technologies 53 Rev 1 5 3 3 1 0 Driver Features 4 4 SCSI RDMA Protocol 4 4 1 Overview As described in Section 1 4 4 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator con trols the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and
176. n of a specific BridgeX box and a specific eport is referred to as a gateway The gateway is an entity that is visible to the EoIB host driver and is used in the configuration of the network interfaces on the host side For example in the host administered vNics the user will request to open an interface on a specific gateway identifying it by the BridgeX box and eport name Distinguishing between gateways is essential because they determine the network topology and affect the path that a packet traverses between hosts A packet that is sent from the host on a spe cific EoIB interface will be routed to the Ethernet subnet through a specific external port connec tion on the BridgeX box 4 5 1 2 Virtual Hubs vHubs Virtual hubs connect zero or more EoIB interfaces on internal hosts and an eport through a vir tual hub Each vHub has a unique virtual LAN VLAN ID Virtual hub participants can send packets to one another directly without the assistance of the Ethernet subnet external side rout ing This means that two EoIB interfaces on the same vHub will communicate solely using the InfiniBand fabric EoIB interfaces residing on two different vHubs whether on the same gateway or not cannot communicate directly There are two types of vHubs e a default vHub one per gateway without a VLAN ID e vHubs with unique different VLAN IDs Each vHub belongs to a specific gateway BridgeX eport and each gateway has one default vHub and zero
177. n page 62 for more information on vNics configuration Once an EoIB vNic is enslaved to a virtual bridge it can be used by any Guest OS that is supported by the Hypervisor The driver will automatically manage the resources required to serve the Guest OS network virtual interfaces based on their MAC address To see the list of MAC addresses served by an EoIB vNic log into the Hypervisor and run the command mlx4 vnic_ info m lt interface gt 74 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 The driver detects virtual interfaces MAC addresses based in their outgoing packets so you may notice that the virtual MAC address is being detected by the EoIB driver only A after the first packet is sent out by the Guest OS Virtual resources MAC addresses fe cleanup is managed by mlx4_vnic daemon as explained in Section Resources Cleanup on page 76 Multicast Configuration Virtual machines multicast traffic over PV EoIB is supported in promiscuous mode Hence all multicast traffic is sent over the broadcast domain and filtered in the VM level e To enable promiscuous multicast log into the BridgeX CLI and run the command bxm eoib mcast promiscuous Please refer to BridgeX CLI Guide for additional details To see the multicast configuration of a vNic from the host log into the Hypervisor and run mlx4 vnic info i lt interface gt grep MCAST VLANs Virtual LANs are suppo
178. n the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For example a value of 0 will give a PKey with the value 0x8000 Step 2 Create a child interface by running host1 echo lt PKey gt gt sys class net lt IB subinterface gt create child Example host1 echo 0 gt sys class net ib0 create child This will create the interface ib0 8000 Step 3 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 Mellanox Technologies 81 J Rev 1 5 3 3 1 0 Driver Features hostl ifconfig ib0 8000 ib0 8000 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 4 As can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 6 3 3 Step 5 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 9 OpenSM Subnet Man ager 4 6 4 2 Removing a Subinterface To remove a child in
179. net Manager with the new options file opensm F lt options file name gt See an example of AR Manager options file with all the default values in Example of Adaptive Routing Manager Options File on page 151 9 8 3 2 Disabling Adaptive Routing There are two ways to disable Adaptive Routing Manager 1 By disabling it explicitly in the Adaptive Routing configuration file 2 By removing the armgr option from the Subnet Manager options file Adaptive Routing mechanism is automatically disabled once the switch receives setting of the usual linear routing table LFT be Therefore no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing 9 8 4 Querying Adaptive Routing Tables When Adaptive Routing is active the content of the usual Linear Forwarding Routing Table on the switch is invalid thus the standard tools that query LFT e g smpquery dump Ifts sh and oth ers cannot be used To query the switch for the content of its Adaptive Routing table use the smparquery tool that is installed as a part of the Adaptive Routing Manager package To see its usage details run smparquery h 9 8 5 Adaptive Routing Manager Options File The default location of the AR Manager options file is etc opensm ar_mgr conf To set an alterna tive location please perform the following 1 Add armegr conf_file lt ar mgr options file name gt to the event plugin optio
180. nformation The ib bonding driver can be loaded manually or automatically Manual Operation Use the utility ib bond to start query or stop the driver For details on this utility please read the documentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Automatic Operation Automatic ib bonding operation can be configured as follow 1 Using a standard OS bonding configuration For details on this please read the documenta tion for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Notes e Ifthe bondX name is defined but one of bondX_ SLAVES or bondX_IPs is missing then that specific bond will not be created e The bondX name must not contain characters which are disallowed for bash variable names such as lt and P All the newer OSes Bonding can be done with the inbox bonding module ae 212 Mellanox Technologies
181. ng ing the order that CA ports are visited changes the distribution of routes across such links This may be advantageous for some specific traffic patterns The default is to visit CA ports in increasing port order on destination switches Duplicate values in the list will be ignored EXAMPLE Mellanox Technologies 133 Rev 1 5 3 3 1 0 OpenSM Subnet Manager Look for a 2D since x radix is one 4x5 torus torus 1 4 5 y is radix 4 torus dimension need both ym link and yp link configuration yp link 0x200000 0x200005 sw y 0 z 0 gt sw y 1 z 0 ym link 0x200000 0x20000f sw y 0 z 0 gt sw y 3 z 0 z is not radix 4 torus dimension only need one of zm link or zp link configuration zp_link 0x200000 0x200001 sw y 0 z 0 gt sw y 0 2 1 next_seed yp link 0x20000b 0x200010 sw y 2 z2 1 gt sw y 3 z 1 ym link 0x20000b 0x200006 sw y 2 z2 1 gt sw y 1 z 1 zp_link 0x20000b 0x20000c sw y 2 z 1 gt sw y 2 2 2 y dateline 2 Move the dateline for this seed z dateline 1 back to its original position If OpenSM failover is configured for maximum resiliency one instance should run on a host attached to a switch from the first seed and another instance should run on a host attached to a switch from the second seed Both instances should use this torus 2Q0S conf to ensure path SL values do not change in the event of SM failover port_order defines the order on which the ports would
182. nkUp 20 Gb sec 4X DDR Infiniband device mlx4 0 port 2 status default gid base lid sm lid state phys state Tg licies e80 0000 0000 0000 0000 0000 0007 3897 0x1 0x1 4 ACTIVE 5 LinkUp 20 Gb sec 4X DDR Infiniband device mthca0 port 1 status default gid base lid sm lid SEATER phys state rate e80 0000 0000 0000 0002 c900 0101 d151 0x0 0x0 De SNe 5 LinkUp 10 Gb sec 4X Infiniband device mthca0 port 2 status default gid base lid sm lid state phys state GUESS e80 0000 0000 0000 0002 c900 0101 d152 0x0 0x0 2s UNI 5 LinkUp 10 Gb sec 4X 2 List the status of specific ports of specific devices 168 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 gt ibstatus mthca0 1 mlx4 0 2 Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 state Ae ENE phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid ozal sm lid 0x1 SLaces 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR 10 10ibportstate Applicable Hardware All InfiniBand devices Description Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is
183. nox OEM OFED or Distribution IB packages will be removed Uninstalling the previous version of OFED Starting MLNX OFED LINUX 1 5 3 1 0 5 installation nstalling mlnx ofa_ kernel RPM Preparing mlnx ofa_ kernel nstalling kmod mlnx ofa_ kernel RPM Preparing kmod mlnx ofa_kernel nstalling mlnx ofa_kernel devel RPM Preparing mlnx ofa_kernel devel nstalling kernel mft RPM Preparing kernel mft nstalling mlx4 accl sys RPM Preparing mlx4 accl sys nstalling mlx4 accl RPM Preparing mlx4 accl nstalling mpi selector RPM Preparing mpi selector nstall user level RPMs Preparing ibibverbs ibibumad ibrdmacm opensm libs ibibmad ibmverbs ibmqe dapl ibutils2 mvapich gcc libmthca Mellanox Technologies 25 Rev 1 5 3 3 1 0 Installation ibm1x4 ibcxgb3 ibnes ibipathverbs ibibcm infinipath psm openmpi gcc ibsdp compat dapl mpitests openmpi gcc mpitests openmpi intel mpitests mvapich gcc mpitests mvapich intel cc_mgr ar mgr dapl utils ibsim infiniband diags opensm ibutils librdmacm utils perftest qperf ibacm srptools mvapich intel libibverbs utils
184. o A 9 2 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 6 3 1 on page 78 and connected to the client machine 3 An initrd file 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the disk less image Your machine needs to be pre installed with a MLNX EN Linux Driver that is appropriate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File executed by users with expertise in the boot process Improper application of this pro The following procedure modifies critical files used in the boot procedure It must be A cedure may prevent the diskless machine from booting Step 1 Back up your current initrd file 202 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Step 2 Make a new working directory and change to it host1 mkdir tmp initrd_en hostl cd tmp initrd_en Step 3 Normally the initrd image is zipped Extract it using the following command host1 gzip de lt initrd image gt cpio id The initrd files should now be found under tmp initrd_en Step 4 Create a directory for the ConnectX EN modules and copy them host1 mkdir p tmp initrd_en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers host1 cp net mlx4
185. o the file etc modprobe conf options mlx4 en lt param name gt lt value gt lt param name gt lt value gt Mellanox Technologies 95 J Rev 1 5 3 3 1 0 Performance 6 6 1 6 1 2 Performance General System Configurations The following sections describe recommended configurations for system components and or inter faces Different systems may have different features thus some recommendations below may not be applicable PCI Express PCIe Capabilities Table 5 Recommended PCle Configuration PCIe Generation 2 0 Speed 5GT s Width x8 Max Payload size 256 Max Read Request 512 Note For VPI Ethernet adapters with ports configured to run 40Gb s or above it is recom mended to use an x16 PCIe slot to benefit from the additional buffers allocated by the system BIOS Power Management Settings e Set BIOS power management to Maximum Performance e On Intel Processors Only Disable C states of PCI Express Note that these performance optimizations may result in higher power consumption Intel Hyper Threading Technology Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single threaded core This section applies to Intel processors only supporting Hyper Threading De For latency and message rate sensitive applications it is recommended
186. o this target from another port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi_host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path Prerequisites Installation for RHEL4 5 Execute once e Verify that the standard device mapper multipath rpm is installed If not install it from the RHEL distribution Installation for SLES10 Execute once e Verify that multipath is installed If not take it from the installation you may use yast e Update udev Execute once for manual activation of High Availability only e Adda file to etc udev rules d you can call it 91 srp rules This file should have one line ACTION add KERNEL sd 0 9 RUN t sbin multipath M m Mellanox Technologies 59 J Rev 1 5 3 3 1 0 Driver Features When SRPHA ENABLE is set to yes see Automatic Activation of High Availabil 4 ity below this file is created upon each boot of the driver and is deleted when the hi driver is unloaded Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and
187. of multiple devices on the local system Specifies the local device s port number used to connect to the IB fabric Specifies the directory where the output files will be placed default tmp Specifies Specifies Dump all t ibdiagnet pm Reset all If any of the expected link width the expected link speed he fabric links pm Counters into he fabric links pmCounters he provided pm is greater then its provided value print it to screen h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values 10 5 2 Output Files Table 15 ibdiagpath Output Files Output File Description ibdiagpath log A dump ofall the application reports generated according to the provided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links 10 5 3 ERROR CODES 1 The path traced is un healthy 2 Failed to parse command line options 3 More then 64 hops are required for traversing the local port to the Source port and then to the Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File 6 Failed to load required Package 164 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 10 6 ibv_devices Applicable Hardware All InfiniBand devices Description List
188. omputers and or servers SSH Configuration The following steps describe how to configure password less access over SSH Step 1 Generate an ssh key on the initiator machine host1 102 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 host1 ssh keygen t rsa Generating public private rsa key pair Enter file in which to save the key home lt username gt ssh id rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home lt username gt ssh id_rsa Your public key has been saved in home lt username gt ssh id rsa pub The key fingerprint is 38 1b 29 d 4 08 00 4a 0e 50 0 05 44 e7 9 05 lt username gt hostl Step 2 Check that the public and private keys have been generated host1 cd home lt username gt ssh host1 1s host1 ls la total 40 Ca Wi a 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 NaN ae ates 1 root root 1675 Mar 5 04 57 id rsa rw r r 1 root root 404 Mar 5 04 57 id rsa pub Step 3 Check the public key host1 cat id_rsa pub ssh rsa AAAAB3NzaC 1 yc2EAAAABIWAAAQEAI zVY 8VBHQh 90kZN70A1ibUQ7 4RxXm4zHeczyVxpYHaDPyDmgezbYMKrCIVzd10b H ZkCOrpLYviU0oUHd3fvNT Ms0gcGg08PysUf 12FyYjira2Plxyg6mkHLGGqVut fEMmABZ3wNCUg6J2X3G uiuS WXeubZmbXcMrP w4 IWByfH8ajwo6A5WioNbFZElbYeeNf PZ 4UNcgMOAMWp 64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorXxNC0zZ Be4kTnUqm63nQ2z1qVMdL9Fr
189. orithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is constrained to rank ing rules 4 LASH Routing Algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distrib uting the paths between layers LASH is an alternative deadlock free topology agnostic routing algorithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node Mellanox Technologies 119 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 5 DOR Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2Qo0S provides deadlock free routing while supporting two quality of service QoS levels Additionally it can route around multiple failed fabric links or a single failed fabric switch without introducing deadlocks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled
190. ost1 cp infiniband hw mlx4 mlx4 ib ko tmp initrd_ib lib modules ib hostl cp infiniband hw mthca ib mthca ko tmp initrd_ib lib modules ib host1 cp infiniband ulp ipoib ipoib helper ko tmp initrd ib lib modules ib hostl cp infiniband ulp ipoib ib ipoib ko tmp initrd_ib lib modules ib IB requires loading an IPv6 module If you do not have it in your initrd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko tmp initrd_ib lib modules To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd ib sbin If you plan to give your IB device a static IP address then copy if config Otherwise skip this step host1 cp sbin ifconfig tmp initrd_ib sbin If you plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below hostl cp lt path to DHCP client v3 1 3 gt dhclient tmp initrd_ib sbin hostl cp lt path to DHCP client v3 1 3 gt dhclient script tmp initrd_ib sbin hostl mkdir p tmp initrd_ib var state dhcp hostl touch tmp initrd_ib var state dhcp dhclient leases
191. otes einde 08s ceed ete eke o a bea toes Male eee 22 2 3 2 Installation Script 0s ceecu bee eal easing lata oon Glade wee cows occ eed 22 2 33 Installation Procedure senn ae ee dera Sis Gon satu sithe tee ude ay eee aoe VISSA 25 233 4 Installation Results a oT did neg beh ks Gad ee ao aa da dane was ety 29 2 3 5 Post installation Notes 0 0 eee teen et bene ence nes 30 2 4 Updating Firmware After Installation 31 2 5 Uninstalling Mellanox OFED 32 Chapter 3 Configuration Files ccc ccc cece ec ee rere cree cece es rr rr 33 Chapter 4 Driver Features ss s sses sees Sis oie 06 oasis eid ei e sige RENARE O68 RR oe einen OF 4 1 RDMA over Converged Ethernet 34 47131 ROCE OVervieW c6 daadaa cals nay syy edanan daa ete ob ond Oa ae Ba as oases 34 4 1 2 Software Dependencies 0 0 eee eee ccc eee eet e eens 34 4 1 3 Firmware Dependencies 0 ccc cece teen ence eee 34 4 1 4 General Guideliness s orrae rarai aE cece eee EA EE E nes 34 41 5 Ported Applications toi 4 4 30 4 sled e tes Rete a bs ales P94 AEROS aa 34 AsV 6 GID Tab IE Sass era ects ec este aie heels ee aes ee en ea a Sa ay ey 35 Av Usine VLAN SF Hoh E n s kas E deg at dee Tuten alten tn Wage nts A sued cast A out 35 4 1 8 Reading Port Counters Statistics 0 0 eect rr er rr rss sa 36 A
192. oth sides of the link ifconfig eth2 20 4 3 220 ifconfig eth2 eth2 Link encap Ethernet HWaddr 00 02 C9 08 E8 11 inet addr 20 4 3 220 Beast 20 255 255 255 Mask 255 0 0 0 UP BROADCAST MULTICAST MTU 1500 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 0 0 0 b TX bytes 0 0 0 b Make sure that ping is working 38 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 ping 20 4 3 219 PING 20 4 3 219 20 4 3 219 56 84 bytes of data 64 bytes from 20 4 3 219 icmp seq 1 ttl 64 time 0 873 ms 64 bytes from 20 4 3 219 icmp seq 2 ttl 64 time 0 198 ms 64 bytes from 20 4 3 219 icmp seq 3 ttl 6 4 time 0 167 ms W045 9 219 ping Staragicues 3 packets transmitted 3 received 0 packet loss time 2000ms rtt min avg max mdev 0 167 0 412 0 873 0 326 ms Inspecting the GID Table cat sys class infiniband mlx4 0 ports 2 gids 0 e80 0000 0000 0000 0202 c9ff fe08 e811 cat sys class infiniband mlx4 0 ports 2 gids 1 0000 0000 0000 0000 0000 0000 0000 0000 According to the output we currently have one entry only Run an Example Test ibv_rc_pingpong Start the server first ibv_rc pingpong g 0 i 2 local address LID 0x0000 QPN 0x00004f PSN 0x3315f6 GID fe80 202 c9ff fe08 e799 remote address LID 0x0000 QPN 0x04004f PSN Ox2cdede GID fe80 202 c9ff fe08 e811 8192000 byte
193. ows how to configure an IB interface host1 ifconfig ib0 11 4 3 175 netmask 255 255 0 0 Step 2 Optional Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib argument 80 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 The following example shows how to verify the configuration host1 ifconfig ib0 b0 Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 dose erolohe MMRR S iersics iil 4255255 Wweske255 2 95 060 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s 4 6 4 Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri mary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to e Create a subinterface Section 4 6 4 1 e Remove a subinterface Section 4 6 4 2 4 6 4 1 Creating a Subinterface In the following procedure ib0 is used as an example of an IB subinterface To create a child interface subinterface follow this procedure Step 1 Decide o
194. pen MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta e OpenSM InfiniBand Subnet Manager e Utilities Diagnostic tools Performance tests e Firmware tools MFT e Source code for all the OFED software modules for use under the conditions men tioned in the modules LICENSE files gt QIB 14 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Low level driver implementation for all QLogic InfiniPath PCI Express HCAs This driver was not tested by Mellanox Technologies e CXGB3 Provide RDMA and NIC support for the Chelsio S series adapters This driver was not tested by Mellanox Technologies e NES Support for the NetEffect Ethernet Cluster Server Adapters This driver was not tested by Mellanox Technologies e Documentation 1 3 3 Firmware The ISO image includes the following firmware items e Firmware images mlx format for ConnectX and ConnectX 2 network adapters e Firmware configuration INI files for Mellanox standard network adapter cards and custom cards e FlexBoot for ConnectX ConnectX 2 HCA devices e ConnectX EN PXE gPXE boot for ConnectX EN and ConnectX 2 EN devices 1 3 4 Directory Structure The ISO image of MLNX_OFED_ LINUX contains the following files and directories e milnxofedinstall This is the MLNX_OFED_ LINU
195. performs OS bypass for standard socket based applications e Mellanox Unified Fabric Manager UFM software Powerful platform for man aging demanding scale out computing fabric environments built on top of the OpenSM industry standard routing engine e Fabric Collective Accelerator FCA FCA is a Mellanox MPI integrated software package that utilizes CORE Direct technology for implementing the MPI collectives communications 1 2 Introduction to Mellanox VPI Adapters Mellanox VPI adapters which are based on Mellanox ConnectX family adapter devices provide leading server and storage I O performance with flexibility to support the myriad of communica tion protocols and network fabrics over a single device without sacrificing functionality when consolidating I O For example VPI enabled adapters can support e Connectivity to 10 20 40 and 56Gb s InfiniBand switches Ethernet switches emerg ing Data Center Ethernet switches e A single firmware image for dual port ConnectX ConnectX 2 ConnectX 3 adapters that supports independent access to different convergence networks InfiniBand Ether net or Data Center Ethernet per port e A unified application programming interface with access to communication protocols including Networking TCP IP UDP sockets Storage NFS CIFS iSCSI SRP Clustered Storage Clustering MPI DAPL RDS sockets and Management SNMP SMI S e Communication protocol acceleration engines includin
196. ping of VLANs to InfiniBand partitions Mapping VLANs to partitions causes all EoIB data traffic and all vNic related control traffic to be sent to the mapped partitions In rare cases it might be useful to ensure that EoIB discovery packets packets used for discovery of Gateways GWs and vice versa are sent to a non default partition This might be used to limit and enforce the visibility of GWs by different hosts The discovery pkeys module parameter can be used to define which partitions would be used to discove the GWs The module parameters allow the using of up to 24 different PKEYs If not set the default PKEY will be used and only GWs using the default PREY would be discovered For example to configure a host to discover GWs on three partitions Oxffff Oxfffl and 0x3 add the following line to modprobe configuration file options mlx4 vnic discovery pkeys Oxffff 0xfff1 0x3 When using this feature combined with host administrated vnics each vnic should also be config ured with the partition it should be created on For example for creating host admin vnic on I F eth20 with pkey Oxfffl add the following line to ifeg eth20 GW_PKEY 0xfffl f When using a non default partition the GW partitions should also be configured on the GW in the BridgeX Additionally the Subnet Manager must be configured accordingly 4 5 3 7 ALL VLAN In Ethernet over InfiniBand EoIB a vNic is a member of a vHUB that uniquely defines its Vir
197. port gt eth5 Down mlx4 0 port 1 gt ib0 Down mlx4 0 port 2 gt ibl Down mlx4 1 port 1 gt eth2 Down mlx4 1 port 2 gt eth3 Down 10 9 ibstatus Applicable Hardware All InfiniBand devices Description Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h lt device name gt lt port gt Table 17 lists the various flags of the command Table 17 ibstatus Flags and Options Optional Leen Flag ne autor If Not Description Specified h Optional Print the help menu Mellanox Technologies 167 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Table 17 ibstatus Flags and Options Optional Denit Flag e qatar If Not Description y Specified lt device gt Optional All devices Print information for the specified device May specify more than one device lt port gt Optional but All ports of Print information for the specified port only of the requires speci the specified specified device fying a device device name Examples 1 List the status of all available InfiniBand devices and their ports gt ibstatus Infiniband device mlx4 0 port 1 status default gid base lid sm lid state phys state rares fe80 0000 0000 0000 0000 0000 0007 3896 0x3 0x3 4 ACTIVE 5 Li
198. ption runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description al Single MAD response SA queries 32 Multi MAD RMPP response SA queries 85 Multi MAD RMPP Path Record SA queries Without s stress testing is not performed M Multicast ModeThis option specify length of Multicast test OPT Description M Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC Multiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is per formed t timeout This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds al sale tile This option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout v verbose This option increases the log verbosity level The v option may be specified multiple times to further Mellanox Technologies 115 Rev 1 5 3 3 1 0 OpenSM Subnet Manager Snr h help 9 3 2 Running osmtest increase the verbosity level See the vf option for more information about log verbosity This option sets the maxi
199. ptional Default Flag ete ie If Not Description y Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr_show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries V ersion Optional Show version info C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout ms gt MADs msec lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2vl lt addr gt lt portnum gt viarb lt addr gt lt portnum gt guids lt addr gt lt dest dr_path Optional Destination s directed path LID or GUID lid guid gt Examples 1 Query PortInfo by LID with port modifier gt smp
200. query portinfo 1 1 Port info Lid 1 port 1 176 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Mellanox Technologies 177 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities 2 Query SwitchInfo by GUID 3 Query NodelInfo by direct route 10 13 perfquery Applicable Hardware All InfiniBand devices 178 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Description Queries InfiniBand ports performance and error counters Optionally it displays aggregated coun ters for all ports of a node It can also reset counters after reading them or simply reset them Synopsys peter lal cl Gl el fal el HC lt ca name gt FP lt ca por X E lt timeout_ms gt V lt lid guid gt port reset nasi Table 21 lists the various flags of the command Table 21 perfquery Flags and Options Optional Detault NEG Flag M ndator If Not Description y Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 a Optional Apply query to all ports l Optional Loop ports r Optional Reset the counters after reading them C lt ca_name gt Optional Use the
201. r connect and listern to STDERR include the following statement log min level 7 destination stderr A non root user can configure 1ibsdp so to record function calls and return values in the file tmp libsdp log lt pid gt root log goes to var log libsdp 1log for this example by including the following statement in Libsdp conf log min level 2 destination file libsdp log To print errors only to syslog include the following statement log min level 9 destination syslog To print maximum output to the file tmp sdp_ debug log lt pid gt include the following statement log min level 1 destination file sdp debug log Kernel Space SDP Debug The SDP kernel module can log detailed trace information if you enable it using the debug_level variable in the sysfs filesystem The following command performs this host1 echo 1 gt sys module ib sdp parameters sdp debug level 46 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Depending on the operating system distribution on your machine you may need an A extra level parameters in the directory structure so you may need to direct the wi echo command to sys module ib_sdp parameters debug_level Turning off kernel debug is done by setting the sysfs variable to zero using the following com mand host1 echo 0 gt sys module ib sdp parameters sdp debug level To display debug information use the dmesg command host1 dmesg 4 3 4 En
202. r that same 2D 6x5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree 4 I l I l I 3 l l I I I 2 x 4 l I I I 1 I I I I I y 0 x a i 2 3 4 5 Mellanox Technologies 129 Rev 1 5 3 3 1 0 OpenSM Subnet Manager Two things are notable about this master spanning tree First assuming the x dateline was between x 5 and x 0 this spanning tree has a branch that crosses the dateline However just as for uni cast crossing a dateline on a 1D ring here the ring for y 2 that is broken by a failure cannot con tribute to a torus credit loop Second this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric That unfortunately is a compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appropriately select ing a root switch In the 2D 6x5 torus example assume now that the switch at 3 2 i e the root for a pristine fabric fails Torus 2QoS will generate the following master spanning tree for that case 4 I I I l 3 l l I I l 2 I I l l 1 4 X8 4 l I I l l y 0 I x 1 2 3 4 5 Assuming the y dateline was between y 4 and
203. rameter lt value gt and or options mlx4 ib parameter lt value gt and or options mlx4 en parameter lt value gt and or options mlx4 fc parameter lt value gt The following sections list the available m1x4 parameters C 1 mlx4_core Parameters set_4k mtu Attempt to set 4K MTU to all ConnectX ports int DECCA Priority based Flow Control policy on TX 7 0 Per priority bit mask uint pferx Priority based Flow Control policy on RX 7 0 Per priority bit mask uint debug_level Enable debug tracing if gt 0 int block loopback Block multicast loopback packets if gt 0 int msi x Attempt to use MSI X if nonzero int MGE rate HESSE Use mac table steering for Ethernet ports default 0 int og_num mac Log2 max number of MACs per ETH port 1 7 int use prio Enable steering by VLAN priority on ETH ports 0 1 default 0 bool fast_drop Enable fast packet drop when no recieve WQEs are posted int og num qp Log maximum number of QPs per HCA int og num srq Log maximum number of SRQs per HCA int og _rdmarc per gp Log number of RDMARC buffers per QP int og num cq Log maximum number of CQs per HCA int og num mcg Log maximum number of multicast groups per HCA int og num mpt Log maximum number of memory protection table entries per HCA int og num mtt Log maximum number of memory translation table segments per HCA int og_mtts per seg Log2 number of MIT entries per segment 0 7 int enable gos
204. register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indicators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 9 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits config F lt file name gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists create config c lt file name gt OpenSM will dump its configuration to the specified file and exit 106 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Mellanox Technologies 107 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 108 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Mellanox Technologies 109 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 110 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Mellanox Technologies 111 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 112 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 help h Display this usage info then exit 9 2 2 Environment Variables The following environment variables
205. ressing in which case the local port on the machine running the tool is assumed to be the source 10 5 1 SYNOPSYS ibdiagpath n lt src name dst name gt 1 lt src lid dst lid gt d Gol 2 Pce el e Some wl ooille s lt sys name gt ic lt dev index gt c p lt port num gt o lt out dir gt lw lt 1x 4x 12x gt ls lt 2 5 5 10 gt pm pc P lt lt PM counter gt lt Trash Limit gt gt OPTIONS Mellanox Technologies 163 Rev 1 5 3 3 1 0 n lt src name dst name gt lt see llatel ders tich d lt pl p2 p3 gt c lt count gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt o lt out dir gt lw lt 1lx 4x 12x gt ails TO pm pc P lt PM lt Trash gt gt Names of t he source and destination ports as defined in the topology file source may be omit ted gt local port is assumed to be the source Source and destination LIDs source may be omit ted gt the local port is assumed to be the source Directed rou te from the local node which is the source and the destination node The minima number of packets to be sent across each link default 100 Enable ver Specifies Specifies if a topology Specifies bose mode the topology file name the local system name Meaningful only file is specified the index of the device of the port used to connect to the IB fabric in case
206. rors or timeouts in less than error_window seconds the CC MGR will abort and will allow OpenSM to pro ceed To do so set the following parameter max errors error window The values are max_errors 0 zero tollerance abort configuration on first error error window 0 mechanism disabled no error checking 0 48K Mellanox Technologies 153 e The default is 5 Congestion Control Manager Options File Table 9 Congestion Control Manager General Options File Rev 1 5 3 3 1 0 OpenSM Subnet Manager Option File Description Values enable Enables disables Congestion Control mechanism on Values lt TRUE FALSE gt the fabric nodes Default True num_hosts Indicates the number of nodes The CC table values Values 0 48K are calculated based on this number Default 0 base on the CCT calculation on the current sub net size Table 10 Congestion Control Manager Switch Options File Option File Description Values threshold Indicates how aggressive the congestion mark ing should be 0 0xf 0 no packet marking e Oxf very aggressive Default Oxf marking rate The mean number of packets between marking eligi ble packets with a FECN Values 0 Oxffff Default Oxa packet_size Any packet less than this size bytes will not be marked with FECN Values 0 0x3fc0 Default 0x200 Table 11 Congestion Control Manager CA
207. rted in EoIB vNic level where VLAN tagging untagging is done by the EoIB driver e To enable VLANs on top of a EoIB vNic a Create a new vNic interface with the corresponding VLAN ID b Enslave it to a virtual bridge to be used by the Guest OS The VLAN tagging untagging is trans parent to the Guest and managed in EoIB driver level The vconfig utility is not supported by EoIB driver a new vNic instance must be cre Z4 ated instead For further information see Section 4 5 2 3 VLAN Configuration on w page 65 A Virtual Guest Tagging VGT is not supported The model explained above applies to Virtual Switch Tagging VST only be Migration Some Hypervisors provide the ability to migrate a virtual machine from one physical server to another this feature is seamlessly supported by PV EoIB Any network connectivity over EoIB will automatically be resumed on the new physical server The downtime that may occur during this process is minor Mellanox Technologies 75 J Rev 1 5 3 3 1 0 Driver Features Resources Cleanup When a virtual interface within the Guest OS is no longer connected to an EoIB link its MAC address need to be removed from the EoIB driver The cleaning is managed by the Garbage Col lector GC service The GC functionality is included in the mlx4 vnic daemon python script sbin mlx4 vnicd e To enable disable the mlx4_vnic daemon a Edit the etc infiniband mlx4 vnic conf file by including the
208. s InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv_devices Examples 1 List the names of all available InfiniBand devices gt ibv_ devices device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 10 7 ibv_devinfo Applicable Hardware All InfiniBand devices Description Queries InfiniBand devices and prints about them information that is available for use from user space Synopsis ibv_devinfo d lt device gt i lt port gt 1 v Table 16 lists the various flags of the command Table 16 ibv_devinfo Flags and Options Optional Lonny wa aoe Mandatory If Not Description Specified d lt device gt Optional First found Run the command for the provided IB device ib device device dev lt device gt Mellanox Technologies 165 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Table 16 ibv_devinfo Flags and Options Optional Default Flag e dator If Not Description y Specified i lt port gt Optional All device Query the specified device port lt port gt ib ports port lt port gt l Optional Inactive Only list the names of InfiniBand devices list V Optional Inactive Print all available information about the InfiniBand verbose device s Examples 1 List the names of all available InfiniBand devices gt ibv_devinfo 1 2 HCAs found mthca0 mlx4 0 2 Query
209. s detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes e SM report e Number of nodes and systems e Hop count information maximal hop count an example path and a hop count histo gram e All CA to CA paths traced e Credit loop report e mgid mlid HCAs multicast group and report e Partitions report e IPoIB report In case the IB fabric includes only one CA then CA to CA paths are not reported 4 Furthermore if a topology file is provided ibdiagnet uses the names defined in it for be the output reports 162 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 10 4 3 ERROR CODES Failed to fully discover the fabric Failed to parse command line options Failed to intract with IB fabric Failed to use local device or local port Failed to use Topology File Failed to load requierd Package oy IT Ses Goo so I 10 5 ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utili
210. s ee lea ele cee 146 9 8 2 Installing the Adaptive Routing 2 0 ee cee eee eee 147 9 8 3 Running Subnet Manager with Adaptive Routing Manager 000 147 9 8 4 Querying Adaptive Routing Tables 0 2 0 0 ec teenies 148 9 8 5 Adaptive Routing Manager Options File 0 0 ee eee 148 9 9 Congestion Control 151 9 9 1 Congestion Control Overview sssereeeererererererererr ere rer ere rer rr resa 151 9 9 2 Running OpenSM with Congestion Control Manager sssesersrrrerer ers eee 152 9 9 3 Configuring Congestion Control Manager 0 0 0 ccc eee eee eee 152 9 9 4 Configuring Congestion Control Manager Main Settings 0 0005 153 Chapter 10 InfiniBand Fabric Diagnostic Utilities 0 0 cece ee ee ee eee eee ee 156 10 1 Overview 156 10 2 Utilities Usage 156 10 2 1 Common Configuration Interface and Addressing 0 0000 c ee ee eee 156 10 2 2 IB Interface Definition 0 0 an aa aaa ce E TEREA eens 157 10 2 3 Addressing 4 oy Avan oe We cha Sus dees Eee Gea der M la bd krk Rekar Ue as 157 10 3 ibdiagnet of ibutils2 IB Net Diagnostic 158 10 31 SYNOPSYS earranta ik okie kk es ben See a sa peng ela te gee eae ek 158 10 32 Output Piles
211. s file qos max vls 8 gos high limit 0 gos vlarb high 1 32 2 96 3 96 4 96 gos vlarb low 0 1 gos slzy O 1 2 eg Up lb 1d 1S 15 15 1S 15 15 e Partition configuration file Default 0x7ff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL full 9 8 Adaptive Routing 9 8 1 Overview P Adaptive Routing is at beta stage Adi Adaptive Routing AR enables the switch to select the output port based on the port s load AR supports two routing modes e Free AR No constraints on output port selection 146 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 e Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric switches It scans all the fabric switches deduces which switches support Adaptive Routing and configures the AR functionality on these switches Currently Adaptive Routing Manager supports only link aggregation algorithm Adaptive Routing Manager configures AR mechanism to allow switches to select output port out of all the ports that are linked to the same remote switch This algorithm suits any topology with several links between switches Especially it suits 3D torus mesh where there are several link in each direction of the X Y Z axis If some switches do not support AR they will slow down th
212. s in 0 01 seconds 4730 13 Mbit sec 1000 iters in 0 01 seconds 13 85 usec iter Mellanox Technologies 39 I Rev 1 5 3 3 1 0 Driver Features Then start the client ibv_rc pingpong g 0 i 2 sw419 local address LID 0x0000 QPN 0x04004f PSN Ox2cdede GID fe80 202 c9ff fe08 e811 remote address LID 0x0000 QPN 0x00004f PSN 0x3315f6 GID fe80 202 c9ff fe08 e799 8192000 bytes in 0 01 seconds 4787 84 Mbit sec 1000 iters in 0 01 seconds 13 69 usec iter Add VLANs Make sure that the 8021 q module is loaded modprobe 8021q Add the VLAN device vconfig add eth2 7 Added VLAN with VID 7 to IF eth2 Configure an IP address for it ifconfig eth2 7 7 4 3 220 Examine the GID table cat sys class infiniband mlx4 0 ports 2 gids 0 fe80 0000 0000 0000 0202 c9ff fe08 e811 cat sys class infiniband mlx4 0 ports 2 gids 1 fe80 0000 0000 0000 0202 c900 0708 e811 According to the output we now have two entries Run the Example Again Now on VLAN On Server ibv_ rc pingpong g 1 i 2 local address LID 0x0000 QPN 0x04004f PSN Oxbdde2c GID fe80 202 c900 708 e799 remote address LID 0x0000 QPN 0x08004f PSN 0xc9d800 GID fe80 202 c900 708 e811 8192000 bytes in 0 01 seconds 4824 50 Mbit sec 1000 iters in 0 01 seconds 13 58 usec iter 40 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 On Client ibv_rc pingpong g 1 i 2 sw419 local a
213. scription rb lt addr gt lt size gt Read a data block from Flash out file swreset SW reset the target InfniScale IV device This command is supported only in the In Band access method Possible command return values are 0 successful completion 1 error has occurred 7 the burn command was aborted because firmware is current Examples 1 Find Mellanox Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE gt sbin lspci d 15b3 634a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn P The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 2 Verify the ConnectX firmware using its ID using the results of the example above gt mstflint d 04 00 0 v ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0
214. scst_vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices B 1 Prerequisites and Installation 1 SRP targer is part of the OpenFabrics OFED software stacks Use the latest OFED distribu tion package to install SRP target On distribution default kernels you can run scst_vdisk blockio mode to obtain good performance De 2 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from http scst sourceforge net downloads html b Untar scst 1 0 1 1 G tar zxvf scst 1 0 1 1 tar gz 8 Cel FREM c Installscst 1 0 1 1 as follows make amp amp make install B 2 How to run A On an SRP Target machine 1 Please refer to SCST s README for loading scst driver and its dev handlers drivers scst_vdisk block or file IO mode nullio Regardless of the mode you always need to have lun 0 in any group s device list A Then you can have any lun number following lun 0 it is not required to have the lun M numbers in ascending order except that the first lun must always be 0 206 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 P Setting SRPT LOAD yes in etc infiniband openib conf is not enough as it only loads the ib srpt module but does not load scst not its dev_handl
215. should be setup P In OFED this part of the policy is ignored SL2VL and VLArb tables should be config ured in the OpenSM options file opensm opts ba lll QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits P Path Bits are not implemented in OFED aa IV Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the following fields e SRC and DST to lists of port groups e Service ID to a list of Service ID values or ranges e QoS Class to a list of QoS Class values or ranges 4 7 4 CMA Features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma_resolve_add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 4 7 4 1 IPoIB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 86 Mellanox Technologies Mellanox OFED for
216. sity to verify that noth ing in the fabric has changed 116 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 9 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm partitions conf To change this filename you can use opensm with the Pconfig or P flags The default partition is created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P Key value of 0x7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial membership 9 4 1 File Format Notes e Line content followed after character is comment and ignored by parser General File Format lt Partition Definition gt lt PortGUIDs list gt Partition Definition PartitionName PKey flag value defmember full limited where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value for this partition Only low 15 bits will be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of this partition defmember full limited specifies default membership for port guid list Default is limited Currently recognized flags are
217. specified by G or io_ guid_file option These nodes will be allowed to use switches the wrong way around a specific number of times specified by Y HY or Y max reverse hopsY With the proper max reverse hops and io guid file values you can ensure full connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt gt N3 With a max reverse hops value of 2 N1 N2 and N3 will all have routes between them Using max_reverse_hops creates routes that use the switch in a counter stream way This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or similar Ax Also having routes the other way around can cause credit loops 9 5 4 2 Activation through OpenSM e Use R ftree option to activate the fat tree algorithm LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm is invoked instead be 9 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path routing algorithm that enables topology agnostic deadlock free routing within communication networks When computing the routing function LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as
218. st entries in the display The following example shows a system with an installed Mellanox HCA hostl lspci v grep Mellanox 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 Step 2 Download the ISO image to your host The image s name has the format MLNX_OFED_LINUX lt ver gt lt OS label gt lt CPU arch gt iso You can download it from http Avww mellanox com gt Products gt IB SW Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO image Run the following com mand and compare the result to the value provided on the download page host1 mdSsum MLNX OFED LINUX lt ver gt lt 0S label gt iso 2 3 Installing Mellanox OFED The installation script mlnxofedinstal1 performs the following e Discovers the currently installed kernel e Uninstalls any software stacks that are part of the standard operating system distribu tion or another vendor s commercial stack e Installs the MLNX_OFED LINUX binary RPMs if they are available for the current kernel e Identifies the currently installed InfiniBand and Ethernet network adapters and auto matically upgrades the firmware 1 The firmware will not be updated if you run the install script with the without fw update option Mellanox Technologies 21 Rev 1 5 3 3 1 0 Installation 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OF
219. st the defined match rules such that the target QoS Level definition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level 4 8 Atomic Operations 4 8 1 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those defined by the IB spec Atomicity guarantees Atomic Ack generation ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec Mellanox Technologies 87 Rev 1 5 3 3 1 0 Driver Features 4 8 1 1 Masked Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic_response va if compare add va amp compare add _mask then va va amp Swap_mask swap amp swap mask l return atomic response The additional operands are carried in the Extended Transport Header Atomic response generation and packet format for MskCmpSwap is as for standard IB Atomic operations 4 8 1 2 Masked Fetch and Add MFetchAdd 88 Mellanox Technologies The MFetchAdd Atomic operation extends the functionality of the standard I
220. stalling the card in an x4 slot will significantly limit bandwidth gt To obtain the current setting for Max Read Req enter setpci d 15b3 68 w To obtain the PCI Express slot link width and speed enter setpci d 15b3 72 B 1 If the output is neither 81 nor 82 card then the card is NOT installed in an x8 PCI Express slot 2 The least significant digit indicates the link speed 1 for PCI Express Gen 1 2 5 GT s 2 for PCI Express Gen 2 5 GT s Note If you are running InfiniBand at QDR 40Gb s 4X IB ports you must run PCI Express Gen 2 6 3 2 InfiniBand Performance Troubleshooting InfiniBand IB performance depends on the health of IB link s and on the IB card type IB link speed 10Gb s or SDR 20Gb s or DDR 40Gb s or QDR 56Gb s or FDR also affects perfor mance 100 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 A A latency sensitive application should take into account that each switch on the path adds 200nsec at SDR and 150nsec for DDR aa 1 To check the IB link speed enter ibstat Check the value indicated after the Rate string 10 indicates SDR 20 indicates DDR and 40 indi cates QDR 56 indicates FDR 2 Check that the link has NO symbol errors since these errors result in the re transmission of packets and therefore in bandwidth loss This check should be conducted for each port after the driver is loaded To check for symbol errors ent
221. subnet mask to each HCA port like any other network adapter card i e you need to prepare a file called ifcfg ib lt n gt for each port The first port on the first HCA in the host is called interface ib0 the second port is called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 6 3 1 or on a static configuration Sec tion 4 6 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 6 3 3 Mellanox Technologies 77 Rev 1 5 3 3 1 0 Driver Features 4 6 3 1 IPolB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP is performed similarly to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp If IPoIB configuration files are included ifefg ib lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine be etc sysconfig network on a SuSE machine A A patch for DHCP is required for supporting IPoIB For further information please see the REAME which is available under the docs dhep directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hardware address To overcome this problem DHCP over InfiniBand messages convey a client identifier field used to identify the DH
222. t Manager 9 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Man agement Model 13 Subnet Management 14 and Subnet Administration 15 9 2 opensm Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will
223. t for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched 9 6 6 1 IPoIB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the multi cast group that OpenSM creates for each IPoIB partition 140 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent ipoib gt lt SL gt ipoib pkey O0x7fff lt SL gt any pkey 0x7fff lt SL gt 9 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The following two match rules are equivalent sdp SEO any service id 0x0000000000010000 0x000000000001ffff lt SL gt 9 6 6 3 RDS Similar to SDP RDS PR query is matched by Service ID The Service ID for RDS is 0x000000000106PPPP where PPPP
224. t note e Itis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions Examples Default 0x7fff ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 limi 0x2134af2306 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited SharelO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited SharelO 0x80 0x123453 0x123454 0x123455 full 0x123456 0x123457 will be limited ShareIO 0x80 defmember limited 0x123456 0x123457 0x123458 full ShareIO 0x80 defmember full 0x123459 0x12345a SharelO 0x80 defmember full 0x12345b 0x12345c limited 0x12345d The following rule is equivalent to how OpenSM used to run prior to the partition manager Default 0x7fff ipoib ALL full 9 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm 118 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Based on the minimum hops to each node where the path length is optimized 2 UPDN Algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat tree Routing Algorithm This alg
225. t service is Reliable Connected RC by default but it may also be configured to be Unreliable Datagram UD The interface supports unicast multicast and broadcast For details see Chapter 4 6 IP over InfiniBand RoCE RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Ethernet net works It encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type RDS Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram deliv ery between sockets over RC or TCP IP For more details see Chapter 4 2 Reliable Datagram Sock ets SDP Sockets Direct Protocol SDP is a byte stream transport protocol that provides TCP stream semantics SDP utilizes InfiniBand s advanced protocol offload capabilities Because of this SDP can have lower CPU and memory bandwidth utilization when compared to conventional imple mentations of TCP while preserving the TCP APIs and semantics upon which most current net work applications depend For more details see Chapter 4 3 Sockets Direct Protocol Mellanox Technologies 17 Rev 1 5 3 3 1 0 Mellanox OFED Overview SRP SRP SCSI RDMA Protocol is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP driver known as the SRP Initiator differs
226. te list of configuration files please refer to MELNX OFED configuration files txt Mellanox Technologies 33 J Rev 1 5 3 3 1 0 Driver Features 4 Driver Features 4 1 RDMA over Converged Ethernet 4 1 1 RoCE Overview RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Ethernet net works It encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type While the use of GRH is optional within IB subnets it is mandatory when using RoCE Verbs applications written over IB verbs should work seamlessly but they require provisioning of GRH information when creating address vectors The library and driver are modified to provide for mapping from GID to MAC addresses required by the hardware 4 1 2 Software Dependencies In order to use RoCE over Mellanox ConnectX R hardware the mlx4 en driver must be loaded Please refer to MLNX_EN README txt for further details 4 1 3 Firmware Dependencies In order to use RoCE over Mellanox ConnectX R hardware RoCE requires ConnectX firm ware version 2 7 000 or higher Features such as loopback require higher firmware versions 4 1 4 General Guidelines Since RoCE encapsulates InfiniBand traffic in Ethernet frames the corresponding net device must be up and running In case of Mellanox hardware mlx4_en must be loaded and the corresponding interface configured e Make sure that mlx4 en ko is loaded To verify the module is loaded run
227. te system shutdown Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three cases 1 Without High Availability When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shut down etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown 4 5 Ethernet over IB EoIB vNic The Ethernet over IB EoIB mlx4_vnic module is a network interface implementation over Infini Band EoIB encapsulates Layer 2 datagrams over an InfiniBand Datagram UD transport service The InfiniBand UD datagrams encapsulates the entire Ethernet L2 datagram and its payload To perform this operation the module performs an address translation from Ethernet layer 2 MAC addresses 48 bits long to InfiniBand layer 2 addresses m
228. terface subinterface run echo lt subinterface PKey gt sys class net lt ib interface gt delete child Using the example of Step 2 echo 0x8000 gt sys class net ib0 delete child Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example above 4 6 5 Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step 1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality In the fol lowing example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command hostl ping c 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seq 0 ttl 64 time 0 079 ms 64 bytes from 11 4 3 176 icmp seq 1 ttl 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seq 3 ttl 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seq 4 ttl 64 time 0 065 ms 1 4 3 106 piling Sitaitisities 82 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 5 packets transmitted 5 rece
229. th a specific vHub group This vHub group is connected to a BridgeX external port and has a VLAN tag attribute When creating configuring a vNic you define the VLAN tag it will use via the vid or the VNICVLAN fields if these fields are absent the vNic will not have a VLAN tag The vNic s VLAN tag will be present in all EoIB packets sent by the vNics and will be verified on all packets received on the vNic When passed from the InfiniBand to Ethernet the EoIB encapsulation will be disassembled but the VLAN tag will remain For example if the vNic eth23 is associated with a vHub that uses BridgeX bridgeO1 eport A10 and VLAN tag 8 all incoming and outgoing traffic on eth23 will use a VLAN tag of 8 This will be enforced by both BridgeX and destination hosts When a packet is passed from the internal fabric to the Ethernet subnet through the BridgeX it will have a true Ethernet VLAN tag of 8 Mellanox Technologies 65 J Rev 1 5 3 3 1 0 Driver Features The VLAN implementation used by EoIB uses operating systems unaware of VLANs This is in many ways similar to switch tagging in which an external Ethernet switch adds strips tags on traf fic preventing the need of OS intervention EoIB does not support OS aware VLANs in the form of veonfig Configuring VLANs To configure VLAN tag for a vNic add the VLAN tag property to the configuration file in host administrated mode or configure the vNic on the appropriate vHub in network administ
230. that is currently quiescent and not in the role of a ager Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administra tor SA An application normally part of the Subnet Manager that imple ments the interface for querying and manipulating subnet manage ment data Subnet Manager SM One of several entities involved in the configuration and control of the subnet Unicast Linear For warding Tables LFT A table that exists in every switch providing the port through which packets should be sent to each LID Mellanox Technologies 11 Table 3 Glossary Sheet 2 of 2 Rev 1 5 3 3 1 0 Virtual Protocol Interconnet VPI adpater ports A Mellanox Technologies technology that allows Mellanox channel adapter devices ConnectX to simultaneously connect to an Infini Band subnet and a 10GigE subnet each subnet connects to one of the Related Documentation Table 4 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 Release 1 2 1 The InfiniBand Architecture Specification that is provided by IBTA IEEE Std 802 3ae 2002 Amendment to IEEE Std 802 3 2002 Document PDF SS94996 Part 3 Carrier Sense Multiple Access with Collision Detection CSMA CD Access Method and Physical Layer Specifications Amendment Media Access Control MAC Parame ters Physical Layers and Management
231. the Port GUID Method To obtain the port GUID run the following commands The following MFT commands assume that the Mellanox Firmware Tools MFT 3 package has been installed on the client machine host1 mst start host1 mst status The device name will be of the form dev mst mt lt dev_id gt pci _cr0 conf0 Use this device name to obtain the Port GUID via the following query command flint d lt MST DEVICE NAME gt q Example with ConnectX 2 QDR MHJH29B XTR Dual 4X IB QDR Port PCIe Gen2 x8 Tall Bracket ROHS R6 HCA Card CX4 Connectors as the adapter device mage type ConnectX FW Version 2 9 1000 Rom Info type PXE version 3 3 400 devid 26428 proto VPI Device ID 26428 Description ode Portl Port2 Sys image GUIDs 0002c9030005cffa 0002c9030005cffb 0002c9030005cffc 0002c9030005cffd ACs 0002c905cffa 0002c905cffb Board ID MT_0DD0110009 VSD PSID T 0DD0110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 05 cf fb Extracting the Port GUID Method Il An alternative method for obtaining the port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure below Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 fon PCIOZ 00
232. the closer the tree is to be fully populated the more effective the shift communication pattern will be In general even if the root list is pro vided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftr ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will match the routing tables 9 5 4 1 Routing between non CN Nodes The use of the cn_guid file option allows non CN nodes to be located on different levels in the fat tree In such case it is not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around Spinel Spine2 Spine 3 ears N NI Switch N2 Switch N3 Kol X Going down to compute nodes 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn_guid_file OpenSM options Mellanox Technologies 123 Rev 1 5 3 3 1 0 OpenSM Subnet Manager To solve this problem a list of non CN nodes can be
233. the device mlx4_0 and print user available information for its Port 2 gt ibv_devinfo d mlx4 0 i 2 hca id mlx4 0 fw ver 2 5 944 node guid 0000 0000 0007 3895 sys image guid 0000 0000 0007 3898 vendor id 0x02c9 vendor part_id 25418 hw ver 0xA0 board id MT _04A0140005 phys port cnt 2 port 2 State PORT ACTIVE 4 max mtu 2048 4 active mtu 2048 4 sm lid 1 port_lid il pore Ime 0x00 10 8 ibdev2netdev ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link 166 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 10 8 1 SYNOPSYS ibdiagnet v h OPTIONS v Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt ib0 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 2 DOWN gt ibl Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 1 DOWN gt eth2 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 2 DOWN gt eth3 Down sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev mlx4 0
234. the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack 4 7 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chronol ogy approach to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fab ric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Ser
235. the switch with node GUID 0x2001 would point in the positive x direction while xm link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the negative x direction All the link keywords for a given seed must specify the same from switch In general it is not necessary to configure both the positive and negative directions for a given coordinate either is sufficient However the algorithm used for topology discovery needs extra information for torus dimensions of radix four see TOPOLOGY DISCOVERY in torus 2QoS 8 For such cases both the positive and negative coordinate directions must be specified Based on the topology specified via the torus mesh keyword torus 2QoS will detect and log when it has insufficient seed configuration x dateline position y dateline position z dateline position 132 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 In order for torus 2QoS to provide the guarantee that path SL values do not change under any con ditions for which it can still route the fabric its idea of dateline position must not change relative to physical switch locations The dateline keywords provide the means to configure such behavior The dateline for a torus dimension is always between the switch with coordinate 0 and the switch with coordinate radix 1 for that dimension By default the common switch in a torus seed is take
236. the torus is significantly degraded i e there are many missing switches or links it may happen that torus 2QoS is unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condition occurs if torus 2Q0S is misconfigured i e the radix of a torus dimension as configured does not match the radix of that torus dimension as wired and many switches links in the fabric will not be placed into the torus 9 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration con figuration unless it is invoked with Q Since torus 2QoS depends on such functionality for cor rect operation always invoke OpenSM with Q when torus 2Q0S is in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines supported by OpenSM except torus 2QoS there is a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used for unicast QoS configuration will be honored by torus 2QoS For multicast QoS con figuration only SL values 0 and 8 should be used with torus 2QoS Since SL to VL map configuration must be under the complete control of torus 2QoS any config uration via qos_sl2vl qos s
237. tion will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value To set the values for packet rate limits and for moderation time high and low values use the following command gt ethtool C eth lt x gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N e To set interrupt coalescing settings when adaptive moderation is disabled use 94 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 gt ethtool c eth lt x gt rx usecs N rx frames N usec settings correspond to the time to wait after the last packet is sent recetved before triggering an interrupt Ad To query pause frame settings run gt ethtool a eth lt x gt e To set pause frame settings run gt ethtool A eth lt x gt rx on off tx onloff e To query ring size values run gt ethtool g eth lt x gt e To modify rings size run gt ethtool G eth lt x gt rx lt N gt tx lt N gt e To obtain additional device statistics run gt ethtool S eth lt x gt e To perform a self diagnostics test run gt ethtool t eth lt x gt e The mlx4_en parameters can be found under sys module mlx4_en or sys module mlx4 en parameters depending on the OS and can be listed using the command gt modinfo mlx4 en To set non default values to module parameters the following line should be added t
238. tn EE 0 REV SOME SEE LON Si Sco e oletens 0 MINKIN TE Oj SAS ESO S e neboo de 0 EXCBUOVE REUNA TRON SE n 0 VIPS DOOD eCENaayeve ee E E screens 0 HALENS BEIRA arter OA ener TORSO 0 OV Ao bons OOOH OUOS HCO OOO DUE 0 KM EP KES A avihes wate Seve en Woe NAGE WAG 0 NOMAD ACS Hind oo Ao OOOO OUI OG A ONS 0 10 14ibcheckerrs Applicable Hardware All InfiniBand devices Description Validates an IB port or node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold_file using the same format as the dump can be specified using the t lt file gt option Synopsis iacheckerrto Fel S wl Ie Fr Schimasinolel Files l el SN lt meeollorel aC Ca nawel Er ca_port t timeout_ms lt lid guid gt lt port gt Table 22 lists the various flags of the command Table 22 ibcheckerrs Flags and Options Optional Default Flag cere r If Not Description Y Specified h hel Optional Print the help menu p p p Mellanox Technologies 181 Rev 1 5 3 3 1 0 InfiniBand Fabric Diagnostic Utilities Table 22 ibcheckerrs Flags and Options Optional Default Flag ote tak If Not Description y Specifie
239. to avoid deadlock LASH analyzes routes and ensures deadlock freedom between switch pairs The link A from HCA between and switch does not need virtual layers as deadlock will not arise an between switch and HCA In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by maintaining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the pro cess 124 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 3 Once this stage has been completed itis highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It
240. to disable Hyper Threading 96 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 6 2 Performance Tuning for Linux You can use the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 6 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance e Disable the TCP timestamps option for better CPU utilization sysctl w net ipv4 tcp timestamps 0 e Disable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp_sack 0 e Increase the maximum length of processor input queues sysctl w net core netdev_ max backlog 250000 Increase the TCP maximum and default buffer sizes using setsockopt sysctl w net core rmem_max 16777216 sysctl w net core wmem_max 16777216 sysctl w net core rmem default 16777216 sysctl w net core wmem default 16777216 sysctl w net core optmem_max 16777216 e Increase Linux s auto tuning of TCP buffer limits The minimum default and maxi mum number of bytes to use are sysctl w net ipv4 tcp rmem 4096 87380 16777216 sysctl w net ipv4 tcp wmem 4096 65536 16777216 6 2
241. ts semantics In particular DEVICE ib0 e In the bonding slave configuration file e g ifcfg ib0 8003 the line TYPE Infini Band is necessary when using bonding over devices configured with partitions p_key e For RHEL users In etc modprobe b bond conf add the following lines alias bond0 bonding e For SLES users It is necessary to update the MANDATORY DEVICES environment variable in etc sysconfig net work config with the names of the IPoIB slave devices e g ib0 ib1 etc Otherwise bonding mas ter may be created before IPoIB slave interfaces at boot time It is possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ether net bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bonding master Mellanox Technologies 83 I Rev 1 5 3 3 1 0 Driver Features Restarting openibd does no keep the bonding configuration via Network Scripts You A have to restart the network service in order to bring up the bonding master After the configuration is saved restart the network service by running etc init d network aa restart 4 7 Quality of Service 4 7 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand S
242. uSE machine The installation process unlimits the amount of memory that can be pinned by a user space application See Step 5 e Man pages will be installed under usr share man Firmware The firmware of existing network adapter devices will be updated if the following two conditions are fullfilled 1 You run the installation script in default mode that is without the option without fw update 2 The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image HP If an adapter s Flash was originially programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image gt e Incase your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file 2 3 5 Post installation Notes e Most of the Mellanox OFED components can be configured or reconfigured after the installation by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 30 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 2 4 Updating Fir
243. ual Rev 1 5 3 3 1 0 IPoIB UD and CM and SDP Isolated from storage Min BW of 50 e SRP Min BW 50 Bottleneck at storage nodes Administration e OpenSM QoS policy file P In the following policy file example replace SRPT with the real SRP Target port GUIDs Adi gos ulps default ipoib sdp srp target port guid SRPT1 SRPT2 SRPT3 end qos ulps os ie e OpenSM options file gos max vls 8 gos high limit 0 gos vlarb high 1 32 2 32 gos vlarb low 0 1 gos SIG 01 2734n By Oy pl 15 1S 15 1d US 15 15 9 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels e Management traffic ssh IPoIB management VLAN partition A Min BW 10 e Application traffic IPoIB application VLAN partition B Isolated from storage and database Min BW of 30 e Database Cluster traffic RDS Mellanox Technologies 145 Rev 1 5 3 3 1 0 OpenSM Subnet Manager Min BW of 30 e SRP Min BW 30 Bottleneck at storage nodes Administration e OpenSM QoS policy file P In the following policy file example replace SRPT with the real SRP Initiator port GUIDs gt gos ulps default ipoib pkey 0x8001 ipoib pkey 0x8002 rds srp target port guid SRPT1 SRPT2 SRPT3 end gos ulps te GS he Ee gt e OpenSM option
244. ult mode less chance for packet loss but it uses more memory In this mode ibdump stops after lt size gt bytes are captured decap Optional Decapsulate port mirroring headers Should be used when capturing RSPAN traffic Examples 1 Run ibdump gt ibdump IB device 3 a ON IB port gal Dump file sniffer pcap Sniffer WOEs max burst size 4096 188 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 Mellanox Technologies 189 Rev 1 5 3 3 1 0 Appendix A Mellanox FlexBoot A 1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology FlexBoot supports remote Boot over InfiniBand BoIB and over Ethernet Using Mellanox Virtual Protocol Interconnect VPI technologies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target iSCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox ConnectX products FlexBoot is based on the open source project iPXE available at http www ipxe org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then it connects to a DHCP server to obtain its assigned IP address and net work parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some other servic
245. uning for Linux 97 6 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 97 6 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 97 6 2 3 Interrupt Moderation 220s eiee sen sosse nee ae ae eee ea T RAW Lee Sede Seale eee 98 6 2 4 Interrupt ATENY ncaa ennio dade sda Sten hasben sA NN cae tele teeta dete 98 6 2 5 Preserving Your Performance Settings After A Reboot 0 0 0 0 e eee ee 99 6 3 Performance Troubleshooting 100 6 3 1 PCI Express Performance Troubleshooting 0 0 c eee eee eee eee eee 100 6 3 2 InfiniBand Performance Troubleshooting 0 0 cee eee cee nee 100 6 3 3 System Performance Troubleshooting 0 0 c cece eee cee eee ee 101 Chapter 7 MPI Message Passing Interface 0 0 cece ccc cece reece cence veee 102 7 1 Overview 102 7 2 Prerequisites for Running MPI 102 RAE SSH Configuration aseeseen ek oe ince ee Kane a a Mano hae Siar g a SiR aaah side d r at 102 7 3 MPI Selector Which MPI Runs 104 2 Mellanox Techologies 7 4 Compiling MPI Applications 104 Chapter 8 MellanoX Messaging
246. ut PCP or VLAN because the IP address used does not belong to a VLAN interface If you specify a VLAN IP address then traffic should go over VLAN Type Of Service TOS The TOS field for rdma_cm sockets can be set using the rdma set option API just as it is set for regular sockets If the user does not set a TOS the default value 0 will be used Within the rdma_cm kernel driver the TOS field is converted into an SL field The conversion formula is as follows SL TOS gt gt 5 e g take the 3 most significant bits of the TOS field In the hardware driver the SL field is converted into PCP by the following formula PCP SL amp 7 take the 3 least significant bits of the TOS field Note SL affects the PCP only when the traffic goes over tagged VLAN frames 4 1 10 Configuring DAPL over RoCE The default dat conf file which contains entries for the DAPL devices does not contain entries for the DAPL over RDMA_CM over RoCE devices To add the missing entries perform the following Step 1 Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices ports Step 2 Add anew entry line according to the format below to the dat conf file for each output line of the ibdev2netdev utility lt IA Name gt u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 lt ethX gt lt port gt Parameter Description Example lt IA Name gt The device s IA name The name must be _ of
247. vice ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the target IP and port number ULPs might also provide QoS Class The CMA then creates Ser vice ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes 4 7 3 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four subsec tions Mellanox Technologies 85 J Rev 1 5 3 3 1 0 Driver Features I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription ll Fabric Setup Defines how the SL2VL and VLArb tables
248. vid is specified or value 1 is set the vNic will be assigned to the default vHub associated with the GW vnic_id A unique number per vNic between 0 and 16K bx The BridgeX box system GUID or system name string eport The string describing the eport name Mellanox Technologies 63 J Rev 1 5 3 3 1 0 Driver Features Table 2 mlx4_vnic conf file format Field Description pkey Optional field If discovery _pkey module parameter is set this value will control which partitions would be used to discover the gateways For more information about discovery_pkeys please refer to Section 4 5 3 6 Discovery Partitions Configuration on page 70 vNic Specific Configuration Files ifcfg ethX EoIB configuration can use the ifcfg ethX files used by the network service to derive the needed configuration In such case a separate file is required per vNic Additionally you need to update the ifcfg ethX file and add some new attributes to it On Red Hat Linux the new file will be of the form DEVICE eth2 HWADDR 00 30 48 7d de e4 BOOTPROTO dhcp ONBOOT yes BXADDR BX001 BXEPORT A10 VNICIBPORT m1x4 0 1 VNICVLAN 3 Optional field GW_PKEY 0xfffl The fields used in the file for vNic configuration have the following meaning Table 3 Red Hat Linux mlx4_vnic conf file format Field Description DEVICE An optional field The name of the interface that is displayed when running ifconfig
249. vironment Variables For the transparent integration with SDP the following two environment variables are required 1 LD_PRELOAD this environment variable is used to preload libsdp so and it should point to the 1ibsdp so library The variable should be set by the system administrator to usr lib libsdp so or usr 1lib64 libsdp so 2 LIBSDP_ CONFIG FILE this environment variable is used to configure the policy for replacing TCP sockets with SDP sockets By default it points to etc libsdp conf 3 SIMPLE LIBSDP ignore libsdp conf and always use SDP 4 3 5 Converting Socket based Applications You can convert a socket based application to use SDP instead of TCP in an automatic also called transparent mode or in an explicit also called non transparent mode Automatic Transparent Conversion The libsdp conf configuration policy file is used to control the automatic transparent replacement of TCP sockets with SDP sockets In this mode socket streams are converted based upon a destination port a listening port or a program name Socket control statements in libsdp conf allow the user to specify when 1ibsdp should replace AF_INET SOCK_STREAM sockets with AF_SDP SOCK_STREAM sockets Each con trol statement specifies a matching rule that applies if all its subexpressions must evaluate as true logical and The use statement controls which type of sockets to open The format of a use statement is as follows us
250. we_sl2vl etc must and will be ignored and a warning will be gener ated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise if traffic is not deliv ered fairly across each of these two VL ranges Torus 2Qo0S will detect and warn if VL arbitration is configured unfairly across VLs in the range 0 3 and also in the range 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via gos_vlarb_high qos_vlarb_low etc 9 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the correct path SL for connection setup Applications that use rdma_cm for connection setup will automati cally meet this requirement If a change in fabric topology causes changes in path SL values required to route without credit loops in general all applications would need to repath to avoid message deadlock Since torus 2Q0S has the ability to reroute after a single switch failure without changing path SL values repa thing by running applications is not required when the fabric is routed with torus 2QoS Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover provided t
251. x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634f 0x000 20 DDR OK 0x00006350 0x0000f29b 0x008f4c DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x0004749c 0x0005913f 0x01l1ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 186 Mellanox Technologies Mellanox OFED for Linux User s Manua Rev 1 5 3 3 1 0 0x0007a124 0x0007bdf 0x001cdc 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007 0af 0x000518 DDE a Ole Configuration OK 0x0007 0b0 0x0007f0fb 0x00004c 0x0007 0fc 0x0007 2a7 0x0001ac Jump addresses OK FW Configuration OK FW image verification succeeded Image is bootable 10 16ibv_asyncwatch Applicable Hardware All InfiniBand devices Description Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv_asyncwatch Examples 1 Display asynchronous events gt ibv_asyncwatch mlx4 0 async event FD 4 10 17 ibdump Applicable Hardware Mellanox ConnectX ConnectX 2 ConnectX 3 adapter devices Description Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX ConnectX 2 adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical tr
252. you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient cf lt client conf file gt lt IB network interface name gt Example of a configuration file for the ConnectX PCI Device ID 26428 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier f 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 Example of a configuration file for InfiniHost III Ex PCI Device ID 25218 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 In order to use the configuration file run hostl dhclient cf dhclient conf ibl 4 6 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the instal lation script with a configuration file using the n option containing the full IP configuration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface e A static IPoIB configuration e An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring IP addresses The following code lines are an excerpt from a sample IPoIB configuration file Static settings
253. ywhere and there is no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning port groups port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment use SRP Targets port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1l000000000FFFF end port group port group name Virtual Servers The syntax of the port name is as follows node description Pnum node description is compared to the NodeDescription of the node and Pnum is a port number on that node port name vsl HCA 1 P1 vs2 HCA 1 P1 end port group using partitions defined in the partition policy port group name Partitions partition Partl pkey 0x1234 end port group using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for all the nodes in the subnet Mellanox Technologies 137 Rev 1 5 3 3 1 0 OpenSM Subnet Manager 138 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 3 3 1 0 show matching
254. zes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing is used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to the one specified in the p option Otherwise an error is reported When ibdiagpath queries for the performance counters along the path between the P source and destination ports it always traverses the LID route even if a directed route is specified If along the LID route one or more links are not in the ACTIVE state ibdi AR agpath reports an error Moreover the tool allows omitting the source node in LID route add

Download Pdf Manuals

image

Related Search

Related Contents

DJ-Tech Pro MIX101 DJ mixer  Emerson Fisher GX Data Sheet  Tanaka THT-2530/2540 User's Manual  600 Graphic Display Control 2006-2014  Eton E100 Deep silver  USB-7230/7250 USB 2.0-based Digital I/O Module v. 2.00  TAFCO WINDOWS NU2-287V-I Installation Guide  

Copyright © All rights reserved.
Failed to retrieve file