Home
Mellanox OFED Linux User`s Manual
Contents
1. 195 9 17 ibdump 195 Appendix A Mellanox FlexBoot sweell Cee 198 Aed A O E E lita 198 A 2 Burning the Expansion ROM Image 200 A 3 Preparing the DHCP Server in Linux Environment LL 201 A Subnet Manager Opens Noa ssi ossia ofa eden eee eee es Be hee ces SE o di 203 Aco PLP Seve tcc lola a Cee ere tek aed vaiolo 203 ALG BIOS Conficuration lt A KS OR ake oh Shaws 203 isl NOPETANOUS e Sok Se Sense deste oS Ls AS ilaele 203 A 8 Command Line Interface CLI 205 AI DISKIGSS eT sn aaa 207 PLO SC ST IBOON aida A ee Are rn teed 213 A ANTES aa Rae ee 214 Appendix B SRP Target Driver esd ninia eds is esiti aa 216 Bel JPrerequisites and Installation serrer dll avrai sie lai 216 Be SHOWS s is ii ate Soe hee ea 216 B 3 Howto Unload Shutdown staba b r ride ee SAA ds rdr tr RR 219 Appendix C mlx4 Module Parameters ccc ccc cc ccc ccc eee cece eee rece eens 220 Cul MIKA Core Parame ice arie hohe eek OLA kee toe Dea Dae ee ees wae 220 C2 H ET T orto ita Sal eee eee iii Re eee eee Ree 221 Co mixd core TT teri da ds ee Bede a e e ia 221 Cd ODES TG Parametels cade dai a a a under da 221 Appendix D ib bonding Driver for Systems Using SLES10 SP4 ooooooooo 222 DL Usine the ib2bondin DAVE A E ead le ad 222 4 Mellanox Techologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 List of Tables T
2. 2 4 x 1 y 0 0 1 2 3 4 5 Two things are notable about this master spanning tree First assuming the x dateline was between x 5 and x 0 this spanning tree has a branch that crosses the dateline However just as for uni cast crossing a dateline on a 1D ring here the ring for y 2 that is broken by a failure cannot con tribute to a torus credit loop Second this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric That unfortunately is a compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appropriately select ing a root switch In the 2D 6x5 torus example assume now that the switch at 3 2 1 e the root for a pristine fabric fails Torus 2QoS will generate the following master spanning tree for that case 2 3 2 1 g y 0 0 1 2 3 4 G Assuming the y dateline was between y 4 and y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as 1t occurs on a 1D ring the ring for x 3 that 1s broken by a failure as in the above example Mellanox Technologies 135 Rev 1 5
3. Poirino cx LE e To obtain additional device statistics run gt elthtool S ethn lt x gt e To perform a self diagnostics test run ino nie ilo e The mlx4 en parameters can be found under sys module mlx4 en or sys module mlx4 en parameters depending on the OS and can be listed using the command gt modinfo mlx4 en To set non default values to module parameters the following line should be added to the file etc modprobe conf sorten laten parenMane valse 8 Panam Male lt Wallle uae 96 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 97 Rev 1 5 3 1 0 0 Performance 6 Performance 6 1 General System Configurations The following sections describe recommended configurations for system components and or inter faces Different systems may have different features thus some recommendations below may not be applicable 6 1 1 PCI Express PCle Capabilities Table 5 Recommended PCle Configuration PCIe Generation Speed Width Max Payload size Max Read Request Note For VPI Ethernet adapters with ports configured to run 40Gb s or above it is recom mended to use an x16 PCIe slot to benefit from the additional buffers allocated by the system 6 1 2 BIOS Power Management Settings e Set BIOS power management to Maximum Performance e On Intel Processors Only Disable C states of PCI Express Note that
4. Step 13 At this stage the modified initrd including the IB driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it prop erly A 9 2 Case Il Ethernet Ports The Ethernet driver requires loading the following modules in the specified order see the exam ple below e mlx4 core ko e mlx4 en ko A 9 2 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites I The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 7 3 1 on page 78 and connected to the client machine 3 An initrd file 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the disk less image Your machine needs to be pre installed with a MLNX EN Linux Driver that is appropriate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File executed by users with expertise in the boot process Improper application of this pro F The following procedure modifies critical files used in the boot procedure It must be A cedure may prevent the diskless machine from booting Step1 Backup your current initrd file Step 2 Make a new working directory and change to it Mellanox Technologies 211 Rev 1 5 3 1 0 0 host1 mkdir tmp initrd en fio cielo lel Mamo dinero Step 3 Normally the initrd image is zipped Extract it using the fol
5. The default is to visit CA ports in increasing port order on destination switches Duplicate values in the list will be ignored EXAMPLE Look for a 2D since x radix 18 one 4x5 TOTUS CORO dB Ao y is radix 4 torus dimension need both ym link and yp link configuration yp link 0x200000 0x200005 sw y 0 z 0 gt sw y 1 z 0 ym link 0x200000 0x20000 sw y 0 z 0 gt sw y 3 z 0 z is not radix 4 torus dimension only need one of nozioni in ato Pn ECO RE Loto zp link 0x200000 0x200001 sw y 0 z 0 gt sw y 0 z 1 next seed yp link 0x20000b 0x200010 sw y 2 z 1 gt sw y 3 z 1 ym link 0x20000b 0x200006 sw y 2 z 1 gt sw y 1 z 1 zp link 0x20000b 0x20000c sw y 2 z 1 gt sw y 2 z 2 gdo Mile oe endo rele seis msec le IN Dare uo less OT uncut Sosa EO If OpenSM failover is configured for maximum resiliency one instance should run on a host attached to a switch from the first seed and another instance should run on a host attached to a switch from the second seed Both instances should use this torus 2Q0S conf to ensure path SL values do not change in the event of SM failover portrorder delineate ender On whieh thie ports would we chosen for routing AREA E to iN Sas SOR ER 8 6 Quality of Service Management in OpenSM 8 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Poli
6. To create the vHBA enter gt echo erch 55 SS Sacate 4 2 3 3 Creating vHBAs That Use Link Pause The m1x4 en Ethernet driver supports link pause by default To change this setting you can use the following command gt dio ie onora To create a vHBA run t gt echo IO TONI e e 4 3 Reliable Datagram Sockets 4 3 1 Overview Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram deliv ery between sockets over RC or TCP IP RDS is intended for use with Oracle RAC 11g For programming details enter host1 man rds Mellanox Technologies 47 Rev 1 5 3 1 0 0 Driver Features 4 3 2 RDS Configuration The RDS ULP is installed as part of Mellanox OFED for Linux To load the RDS module upon boot edit the file etc infiniband openib conf and set RDS LOAD yes S For the changes to take effect run etc init d openibd restart hai 4 4 Sockets Direct Protocol 4 4 1 Overview Sockets Direct Protocol SDP is an InfiniBand byte stream transport protocol that provides TCP stream semantics Capable of utilizing InfiniBand s advanced protocol offload capabilities SDP can provide lower latency higher bandwidth and lower CPU utilization than IPoIB or Ethernet running some sockets based applications SDP can be used by applications and improve their performance transparently that is without any recompilation Since SDP has the same socket semantics as TCP an exi
7. etc infiniband mlx4_vnic conf The mlx4 vnic conf file consists of lines each describing one vNic The following file format is used name gt ao Ue POE MIO nei eo enon I The fields used in the file have the following meaning Table 2 mix4_vnic conf file format Field Description name The name of the interface that is displayed when running ifconfig mac The mac address to assign to the vNic ib port The device name and port number in the form device name port number The device name can be retrieved by running ibv_devinfo and using the output of hca_id field The port number can have a value of 1 or 2 vid VLAN ID an optional field If it exists the vNic will be assigned the VLAN ID specified This value must be between 0 and 4095 If no vid is specified or value 1 is set the vNic will be assigned to the default vHub associated with the GW vnic id A unique number per vNic between 0 and 16K bx The BridgeX box system GUID or system name string eport The string describing the eport name vNic Specific Configuration Files ifefg ethX EoIB configuration can use the ifefg ethX files used by the network service to derive the needed configuration In such case a separate file is required per vNic Additionally you need to update the ifefg ethX file and add some new attributes to it On Red Hat Linux the new file will be of the form DEVICE eth2 HWADDR 00 30 48 7d de e4 Mellanox T
8. 24 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 2 3 2 1 minxofedinstall Return Codes Table 1 lists the mlnxofedinstall1 script return codes and their meanings Table 1 mlnxofedinstall Return Codes The Installation ended successfully No firmware was found for the adapter device Failed to start the ms t driver The installation failed Mellanox Technologies 25 Rev 1 5 3 1 0 0 Installation 2 3 3 Installation Procedure Step1 Login to the installation machine as root Step 2 Mount the ISO image on your machine e a mount o ro loop Oa LINUX lt ver gt lt 0S a EH arch gt iso mnt Step 3 Run the installation script mlnxofedinstall This program will install the MLNX OFED LINUX package on your machine Note that all other Mellanox OEM OFED or Distribution IB packages will be removed Uninstalling the previous version of OFED Deak cme vin LEDERE Na Sa tone 26 Mellanox Technologies Installing mlnx ofa kernel RPM Preparing mlnx ofa kernel FERRARINI RARA a AAA AAN aia AAA Installing kmod mlnx ofa kernel RPM Preparing kmod mlnx ofa_ kernel fr ae a Ae a AE T a a a aaa aaa Fae ae a t rt TTT Iistal Mine mi ora rene devel Rey Preparing miro tas Keune Sage ved Installing kernel mft RPM Preparing e kernel mft Installing mlx4 accl sys RPM Pepe E L Mee cle T Installing mlx4 accl RPM Preparing me l Installing mpi
9. Mellanox Technologies 153 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 4 If some switches do not support AR they will slow down the AR Manager as it may get timeouts on the AR related queries to these switches da 8 8 2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug in 1 e it is a shared library libarmgr so that is dynamically loaded by the Subnet Manager Adaptive Routing Manager is installed as a part of Mellanox OFED installation 8 8 3 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing AR Manager can be enabled disabled through SM options file 8 8 3 1 Enabling Adaptive Routing To enable Adaptive Routing perform the following I Create the Subnet Manager options file Run opensm c lt options file name gt 2 Add armgr to the event plugin _name option in the file Event plugin name s event plugin name armgr 3 Run Subnet Manager with the new options file opensm F lt options file name gt Adaptive Routig Manager can read options file with various configuration parameters to fine tune AR mechanism and AR Manager behavior Default location of the AR Manager options file is etc opensm ar mgr conf To provide an alternative location please perform the following l Add armgr conf file lt ar mgr options file name gt to the event plugin options option in the file Options string that would be passed to the plugin s event pl
10. s Manual Rev 1 5 3 1 0 0 A 1 2 Tested Platforms See the Mellanox FlexBoot Release Notes FlexBoot release notes txt A 1 3 FlexBoot in Mellanox OFED The FlexBoot package is provided as a tarball tgz extension containing the files specified in Appendix A 1 1 Supported Mellanox Adapter Devices and Firmware page 198 I APXE ROM image file for each of the supported Mellanox network adapter devices Specif ically the following images are included ConnectX ConnectX 2 ConnectX 3 images N ConnectX FlexBoot lt PCI Device ID gt ROM lt version gt mrom where the number after the ConnectX FlexBoot prefix indicates the corresponding PCI Device ID of the ConnectX ConnectX 2 ConnectX 3 device 2 Additional documents under docs dhcp Mellanox Technologies 199 Rev 1 5 3 1 0 0 A 2 Burning the Expansion ROM Image A 2 1 Burning the Image on ConnectX ConnectX 2 ConnectX 3 4 This section is valid for ConnectX ConnectX 2 devices with firmware versions 2 8 0600 or later and ConnectX 3 firmware ha Prerequisites I Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot release notes txt 2 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package version 2 7 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under
11. tmp libsdp log lt pid gt root log goes to var log libsdp log for this example by including the following statement in 1ibsdp conf log min level 2 destination file libsdp log To print errors only to syslog include the following statement log min level 9 destination syslog To print maximum output to the file tmp sdp debug log lt pid gt include the following statement og mae ewe o scimeicicn vee 509 delgue dog Kernel Space SDP Debug The SDP kernel module can log detailed trace information if you enable it using the debug level variable in the sysfs filesystem The following command performs this host1 echo 1 gt sys module ib sdp debug level Depending on the operating system distribution on your machine you may need an 4 extra level parameters in the directory structure so you may need to direct the pe echo command to sys module ib_sdp parameters debug_ level Turning off kernel debug is done by setting the sysfs variable to zero using the following com mand esile eco U ss Module ado slo sone it To display debug information use the dme sg command host1 dmesg 4 4 4 Environment Variables For the transparent integration with SDP the following two environment variables are required I LD PRELOAD this environment variable is used to preload 1ibsdp so and it should point to the 1ibsdp so library The variable should be set by the system administrator to usr lib libsdp so or usr lib64
12. 2 Ifnot found the first port that is UP physical link state is LinkUp Examples I Query the status of Port I of CA mlx4 0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate O SUS Ol NEE R eT LE LRR E O ports tatus default gid fe80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sm lid 0x3 Staves 20900 IR phys state Se liada rate 20 Gb sec 4X DDR NS O SIA PORNOS Oo o o IR Ae e Initialize R 0 ona a LinkUp Racchette IX or 4X Lenku OC Na ai H ck leeee week a eae ere 1X or 1x TW OE NEE Reni 4X MKS De Cts UPPOO Ee aa ee Leo GOS Or o 0 Gbps TIPS PE CE TE Ne yaa acronis Teo GEPS Oro O 1GOps SETS OSC CIC Tote Ea USOS 2 Query the status of two channel adapters using directed paths POE tarea ODORI Rorelmio e ei ao Wee daro Se Raro oa Io Initialize E T A Uat a sc LinkUp KINE CA EEO E RE 1X or 4X a oia MI RT 1X or 4X ISE de O ton 4X SSR pesto poleo a aerea Leo El Cie 5 0 Gbps Enko pec CE TG le a we sa a A GPe Or oO GDES LENS peo RE RR Os Mellanox Technologies 177 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities 3 Change the speed of a port 178 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Show the new configuration POE tarea ee ODO NOR eR RR 2 L T IT ET T Initialize Piye A e e LinkUp LS UDS arg ago nce were 1X or 4X A O e AOS ANO
13. 2 7 000 b s nros 0000 13 00 0 e To query stateless offload status run Apro Eek ea e To set stateless offload status run E D terio tx onor e onore ieg onl oril eso on ork e To query interrupt coalescing settings run to ethit ool e eria By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern To enable disable adaptive interrupt moderation use the following command gt ethtool C eth lt x gt adaptive rx on off Mellanox Technologies 95 Rev 1 5 3 1 0 0 Working With VPI e Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value To set the values for packet rate limits and for moderation time high and low values use the following command etaton Ceta a pke farter low N pke rate nigh Nr us ee eE UE RSS Ses higa N e To set interrupt coalescing settings when adaptive moderation is disabled use 71 gt eoo c a rx usecs NIER sane S usec settings correspond to the time to wait after the last packet is sent received before triggering an interrupt e To query pause frame settings run ENN a STO e To set pause frame settings run i gt coo A cuado fra omlort x om lor t e To query ring size values run gt ethtool g eth lt x gt e To modify rings size run
14. 9 2 9 2 1 162 Mellanox Technologies Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric The tools are e Section 9 3 ibdiagnet of ibutils2 IB Net Diagnostic on page 164 e Section 9 4 ibdiagnet of ibutils IB Net Diagnostic on page 166 e Section 9 5 ibdiagpath IB diagnostic path on page 169 e Section 9 6 1bv devices on page 171 e Section 9 7 ibv_devinfo on page 171 e Section 9 8 ibdev2netdev on page 173 e Section 9 9 ibstatus on page 173 e Section 9 10 ibportstate on page 175 e Section 9 11 ibroute on page 179 e Section 9 12 smpquery on page 182 e Section 9 13 perfquery on page 186 e Section 9 14 ibcheckerrs on page 189 e Section 9 15 mstflint on page 191 e Section 9 16 ibv_asyncwatch on page 195 e Section 9 17 ibdump on page 195 Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation syn opsis and options descriptions error codes and examples Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter
15. A packet that is sent from the host on a spe cific EoIB interface will be routed to the Ethernet subnet through a specific external port connec tion on the BridgeX box 4 6 1 2 Virtual Hubs vHubs Virtual hubs connect zero or more EoIB interfaces on internal hosts and an eport through a vir tual hub Each vHub has a unique virtual LAN VLAN ID Virtual hub participants can send packets to one another directly without the assistance of the Ethernet subnet external side rout ing This means that two EoIB interfaces on the same vHub will communicate solely using the InfiniBand fabric EoIB interfaces residing on two different vHubs whether on the same gateway or not cannot communicate directly There are two types of vHubs e a default vHub one per gateway without a VLAN ID e vHubs with unique different VLAN IDs Each vHub belongs to a specific gateway BridgeX eport and each gateway has one default vHub and zero or more VLAN associated vHubs A specific gateway can have multiple vHubs distinguishable by their unique VLAN ID Traffic coming from the Ethernet side on a specific eport will be routed to the relevant vHub group based on its VLAN tag or to the default vHub for that GW if no vLan ID is present 4 6 1 3 Virtual NIC vNic A virtual NIC is a network interface instance on the host side which belongs to a single vHub on a specific GW The vNic behaves similar to any regular hardware network interface The host can
16. Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2QoS can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus 5 H I I I I I I 4 H D Ho I I I I I I SB DI 1I u I I I I I I 2 Ho q R Ho I I I I I I 1 m_ ___ S n _ T o o_ I I I I I I y 0 HZ OA Ho I I I I I I 0 1 a 3 4 Suppose switches T and R have failed and consider the path from S to D Torus 2QoS will gener ate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that torus 2QoS cannot route without deadlock two failed switches adjacent in a dimension that is not the last dimension routed by DOR here the failed switches are O and T 5 4 4 4 4 I I I I I 4 I I I I 3 4 tooo D I I I I 2 4 I q r I l I I I 1 m 5 n 0 T p I I I I y 0 I I l I I I x 1 2 3 4 5 In a pristine fabric t
17. MP Land Lastres oo ee kb bee Gee 150 8 7 2 EDC SOA 2 tier IPoIB and SRP oas Trenne area nea neces 151 Slo ENDE tE IPoIB RDS SRP A dd eli Lu a 152 8 8 Adaptive Routing 153 S L OVO Wii hat bi ed econ are Gal RD eG te tind Ree ice as 153 8 8 2 Installing the Adaptive Routing 154 8 8 3 Running Subnet Manager with Adaptive Routing Manager 154 8 8 4 Querying Adaptive Routing Tables nnana anaana aaaeeeaa 155 8 8 5 Adaptive Routing Manager Options FG 0 ee eee nee 155 8 9 Congestion Control 158 8 91 Congestion Control OVErview pis ile st See he beet 158 8 9 2 Running OpenSM with Congestion Control Manager aanne anauen 158 8 9 3 Configuring Congestion Control Manager na aaas aaan aaeeea 158 8 9 4 Configuring Congestion Control Manager Main Settings 159 Chapter 9 InfiniBand Fabric Diagnostic UtilitieS ooooooooooooo aaee 162 9 1 Overview 162 9 2 Utilities Usage 162 9 2 1 Common Configuration Interface and Addressing rs 162 52 2 ia o seg bt rst a a a edo beth Perk wee e ba 163 Mellanox Techologies 3 J 925 JXOGIESSING ved A a a 163 9 3 ibdiagnet of ibutils2 IB Net Diagnostic 164 Soul SNO
18. RER rr NEAR EEA RER iaa ai ara aaa rana aaa fr ae a ae a AE a a A a aa aa a ia aaa aia ana rana aia fr ae a A ae a a T a aa a a T aaa fr ae a ae a A T a ae aa a aaa fr Ae Ae AE R f tr dr TTT Mellanox Technologies 21 Rev 1 5 3 1 0 0 Installation libibmad static libibmad static Tipipumed Statie libibumad static rds devel mft dapl devel static dapl devel static libibverbs devel static libibverbs devel static mlnxofed docs Ored seripts libibverbs librdmacm libibumad libmverbs opensm libs compat dapl dapl IE libsdp libmthca libmlx4 ike xcs libnes libipathverbs libsdp devel libibcm devel dapl devel compat dapl devel opensm devel libibmad devel libmge devel libmverbs devel libibumad devel librdmacm devel libibverbs devel Device mo sie 2 fr ae a a ea AE EAE r ea E E AE AE E E E E E E EE E EE a T T 0 TT EEEE EEIE AE r E E E E E AE E E E E E E E EE E EEE E EEEE EEEE EEEE EAE a AE AE T E E EAE AE AE E T T E E E E EAE E E E E E E EEE E EE E T EE EEE EEEE EEE E T r E E E E E AE E E E E E E EEE E EEE T EEEE EEEE E EAE AE rr T E E EEE E EEEE EEEE EEEE E EAE r E EE E EAE E E E E E E EEE E EEEE EEEE A EEEE a ae a AE EAE r T E E E E EE E E E E E E EEE E EEEE EEEE EEEE EEEE T r E E E E AE AE E E E E E EEE E EE E E EEEE EEEE EEE r E E E E EAE E E E E E E EEE E EEE E T EE EEE EEEE EEIE AE EAE AE AE AE T AE E EAE AE AE AE AE T E E E E E AE AE E E E E E EEE E
19. The socket is not connected RXE 8 x Operation canceled 1 neti O60 02 c9 0cC 786 12 on PCIOS 00 0 open Link up TX 1 4 O Rx 0 HAE OO E a A 8 3 2 ifopen Opens the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Mellanox Technologies 205 Rev 1 5 3 1 0 0 Example 1PXE gt ifopen netl A 8 3 3 ifclose Closes the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example 1PXE gt ifclose netl A 8 3 4 autoboot Starts the boot process from the device s A 8 3 5 sanboot Starts the boot process of an iSCSI target Example ioe Eas MO OM SES OR O AN a OS nee A 8 3 6 echo Echoes an environment variable Example PAE SEO T pe A 8 3 7 dhcp A network interface attempts to open the network interface and then tries to connect to and com municate with the DHCP server to obtain the IP address and filepath from which the boot will occur Example EPXE eden A 8 3 8 help Displays the available list of commands A 8 3 9 exit Exits from the command line interface 206 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 A 9 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the initrd Image must include a device driver module and be configured to load that driver This can be achi
20. ZO SRP TTT C r E oes R SARK e Seta ohn sn Bey eel steed 60 4 6 Ethernet over IB EoIB vNic 67 4 6 1 Ethernet over IB Dopolo Vs ois sr ke ee haben wa A Rae rr rr rss ses 67 46 27 ESBL On T Te es o toe ee aie Bae Aue ee ee geni 68 4 6 3 Retrieving EoIB Information sx ss sg s Sa 0 0 ae Ka K RRR eee eee eens 73 46 4 Advanced EOL SeN Seana A ie TI 4 7 IP over InfiniBand TT AE IO GC iit ee SR te vs Boas de ns eil Sea dro Ea 77 43 2 POIR Mode S MNE seine eat Gale eee ame dada Bebe TA AS IPoIB Conn ourna ostante oe ee ee a ira SA 77 44 ET ee a 28 ia aha eee alba Ee py aa hain aed leales Mae 81 4 5 Nentyme IPoIB Te esos os is ts a alal 83 4 7 0 Bondme POB di Nk dda Mick SB i A e es 83 4 8 Quality of Service 85 45 1 Quality Of Service Overview ars oe a a a cad 85 452 MOOS Ar ICC A 86 4 95 SUPpponicd H asar kena tele ewe ee ee ee ee eee ewan aes 86 Ao CMA PEA US asa lca cerca shed dt arene dada pie 87 495 RO AUS tas Sc bo Pi oe eet whe ie ee ee eh hae e ies Bae es 88 4 9 Atomic Operations 88 4 9 1 Enhanced Atomic Operations oi Laps lus eek os es es eet Shown tebe de eddies 88 4 10 Socket Acceleration 90 ATOA OVEM W is Great Cited Aletha n
21. e auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds Table 4 lists the ConnectX port configurations supported by VPI Table 4 Supported ConnectX Port Configurations Port 1 Configuration Port 2 Configuration i ib eth Note that the configuration Port eth and Port2 ib is not supported Also note that FCoE can run only on a port configured as eth and the m1x4 en driver must be loaded The port link type can be configured for each device in the system at run time using the sbin connectx port config seript This utility will prompt for the PCI device to be modified 1f there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device This utility also has a non interactive mode Mellanox Technologies 93 Rev 1 5 3 1 0 0 Working With VPI loa COMMECIOS SOIL comi TE E T D gt E Sai a 5 2 InfiniBand Driver The InfiniBand driver m1x4 ib handles InfiniBand specific functions and plugs into the InfiniBand midlayer 5 3 Ethernet Driver 5 3 1 Overview MLNX EN driver is composed from mlx4 core and mlx4 en kernel modules and exposes the following ConnectX ConnectX 2 capabilities Single Dual port Up to 16 Rx queues per port 5 Tx queu
22. echo command may take some time The SM must be running while the command executes 60 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Itis possible to include additional parameters in the echo command max cmd per lun Default 63 max sect short for max sectors sets the request size of a command io class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 initiator ext Please refer to Section 9 Multiple Connections e To list the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk 1 This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 4 5 2 3 SRP Tools ibsrpdm and srp_daemon To assist in performing the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and srp daemon which e Detect targets on the fabric reachable by the Initiator for Step 1 e Output target attributes in a format suitable for use in the above echo command Step 2 The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage sc
23. flood the SA with queries according to the stress mode pieces ilow 3 l STET HITLER T tables t run trap 64 65 flow this flow requires running of external tool Default all flows except QoS Wi Wat This option specifies the wait time for trap 64 65 in seconds It is used only when running f t the trap 64 65 flow Default 10 sec gt Se This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows Obl Pescriorion Mellanox Technologies 119 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 120 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 121 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 8 3 2 Running osmtest 8 4 8 4 1 To run osmtest in the default mode simply enter hostl osmtest The default mode runs all the flows except for the Quality of Service flow see Section 8 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file hostil osmtest f c Immediately afterwards run the following command to test opensm hostl osmtest f a Finally it 1s recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric
24. vendor for more information 104 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 7 MPI Message Passing Interface 7 1 Overview S PGI compiler does not support RHEL6 0 thus MLNX_OFED v1 5 2 will not include openmpi and mvapich with PGI compiler on RHEL6 hai Mellanox OFED for Linux includes the following MPI implementations over InfiniBand and RoCE e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 6 lists some useful MPI links Table 6 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mp1 org MVAPICH MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections e Prerequisites for Running MPI page 105 e MPI Selector Which MPI Runs page 107 e Compiling MPI Applications page 107 7 2 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login 1 e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that
25. 1 2 July 04 2010 e Updated Figure 1 Mellanox OFED Stack on page 16 Rev 1 5 1 1 May 18 2010 e Added Section 4 1 10 Configuring DAPL over RoCE on page 43 Rev 1 5 1 April 22 2010 e Added Section 5 1 7 Reading Port Counters Statistics the section A Detailed Example was moved to become Section 5 1 8 Rev 1 5 March 29 2010 e Updated Figure 1 Mellanox OFED Stack on page 16 e Added support for ConnectX 2 devices e Added support for ROMA over Converged Ethernet RoCE see Chapter 5 RoCE e Modified Section 7 3 3 1 How to Know SDP Is Working e Added Section 7 3 7 Using RDMA for Small Buffers 6 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 e Added support for NFS over ROMA NFSoRDMA Chapter 10 NFSoRDMA e Added Section 11 5 2 Important Note on RoCE Support on page 114 in Chapter 7 MPI Message Passing Interface e Modified Section 8 2 1 opensm Syntax on page 109 e Added Chapter 5 e Added ibdiagnet of ibutils2 and ibdump to Chapter 9 InfiniBand Fabric Diagnostic Utilities e Appendix B is now called Mellanox FlexBoot instead of BoIB FlexBoot supports Virtual Protocol Interconnect VPI e Added Section 6 3 3 System Performance Troubleshooting e Added the parameter setting VIADEV RENDEZVOUS THRESHOLD 8192 Sec tion 11 2 3 MPI Performance Tuning Rev 1 40 1 Chan
26. 2 Environment Variables The following environment variables control opensm behavior e OSM TMP DIR Mellanox Technologies 117 Rev 1 5 3 1 0 0 OpenSM Subnet Manager Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet lst opensm fdbs andopensm mcfdbs By default this directory is var log e OSM CACHE DIR open sm stores certain data to the disk such that subsequent runs are consistent The default directory used is var cache opensm The following file is included in it guid2lid stores the LID range assigned to each GUID 8 2 3 Signaling When opensm receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSRI can be used to trigger a reopen of var log opensm 1og for logrotate pur poses 8 2 4 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter hostl opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file nessage registers only general major events the second file opensm log includes details of reported error
27. A i 4X CERES E OTA I A R Ae GEES IONES Line Spee so E E en 5 0 Gbps IBA extension e EA NE SS ESA SIONI 9 11 ibroute Applicable Hardware InfiniBand switches Description Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop Synopsis ibroute h d v V a n D G M s lt smlid gt C ee E lt Canport gt Sle m ahs estara rio curse aia lt endlid gt Table 19 lists the various flags of the command Table 19 ibportstate Flags and Options Default If Not Description Specified Optional Mandatory d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a ll Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v Mellanox Technologies 179 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 19 ibportstate Flags and Options Optional a ll Optional G uid Optional M ulticast Optional P lt ca_port gt Optional t Optional lt timeout_ms gt lt dest dr_path Optional lid guid gt Examples Default If Not Description Specified
28. Ack generation ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec 4 9 1 1 Masked Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic MSs Oost a TT Compare tada a copa re ad Mask stenen va va amp swap mask swap amp swap mask 88 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 retira atomic response The additional operands are carried in the Extended Transport Header Atomic response generation and packet format for MskCmpSwap is as for standard IB Atomic operations 4 9 1 2 Masked Fetch and Add MFetchAdd The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the held boundary parameter specifies the field boundaries The pseudocode below describes the operation Dutra fe Ie lean TEU value ci bl b2 co value 2 return value amp I define MA
29. By default OpenSM searches for the partitions configuration file under the name usr etc opensm partitions conf To change this filename you can use opensm with the Pconfig or P flags The default partition is created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P Key value of 0x7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial membership File Format Notes e Line content followed after character is comment and ignored by parser General File Format lt param Poe Partition Definition PartitionName PKey flag value defmember full limited where 122 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 PartitionName string will be used with logging When omitted an empty string will be used PKey Po orione o los pd A Demn e Mene I O mon flag US coco mdc O meca DA or aris ama Eon defmember full limited specifies default membership for port guid list Default is limited Currently recognized flags are here Ue indicates that this partition may be used for IPoIB as a result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps meu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 s
30. EA elia 100 024 Interrupt ATA idas 101 6 2 5 Preserving Your Performance Settings After A Reboot 0 0 0 cee eee 102 6 3 Performance Troubleshooting 102 6 3 1 PCI Express Performance Troubleshooting LL 102 6 3 2 InfiniBand Performance Troubleshooting 0 0 0 0 ccc eee 103 2 Mellanox Techologies J 6 3 3 System Performance Troubleshooting ers sr sr rss ess 104 Chapter 7 MPI Message Passing Interface ooooooooooooooooc c r ooooornoso 105 7 1 Overview 105 7 2 Prerequisites for Running MPI 105 21 1 5 B E 21 UO Rad eaten eae ees sees Bee Gas Be ee eGR 105 7 3 MPI Selector Which MPI Runs 107 7 4 Compiling MPI Applications 107 Chapter 8 OpenSM Subnet Manager ooooooooooooooocccrcroroooccrr romo roo 109 8 1 Overview 109 8 2 opensm Description 109 A SOC 51 UA ae apa oes aaah Stee ch he Beka il nio ori Seah emesis RNA AE 109 22 LBMVITONIMENE VAniableS ic dit caren eek rs seme are bach hae baad eee 117 Boe SICA Mors tee ast aa GA ee a ara Chee pd 118 Oz sRUMMING OMPeNSi espa frit head i eh etre cs Sd n vt Sen ee Re
31. HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 1 On the command line specify the file name using the option t lt topology file name gt 2 Define the environment variable IBDIAG TOPO FILE To specify the local system name to an diagnostic tool use one of the following two options 6 I On the command line specify the system name using the option s lt local system name gt 2 Define the environment variable IBDIAG SYS NAME 9 2 2 IB Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the follow ing options 1 On the command line specify t
32. If running the rule as follow sao oda ceo Sonne ny Wee Zea O poe mer aciell joo Loy the acceleration 1S applied on sockets ot uperi application that Connect to TGP ports POO o o H 3 Remove all the rules reset 90 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Removes all the rules at once Foie exemple Moss echo mese a a e polis olo Riello Se moss cate piro met accel oo Loy 4 The m1x4 accl module cannot be removed To disable its rules capabilities run echo reset gt proc net accl policy gt 4 10 3 Kernel Space Socket Accelaration Debug The Socket Accelaration kernel module can log detailed trace information if you enable it using the mlx4_accl_dbg and mlx4_accl_ sys dbg variables in the sysfs file system To log the infor mation run the following commands host1 echo 1 gt sys module mlx4 accl parameters mlx4 accl dbg host1 echo 1 gt sys module mlx4 accl sys parameters mlx4 accl sys dbg To turn OFF kernel debug set the sysfs variable to zero using the following command host1 echo 0 gt sys module mlx4 accl parameters mlx4 accl dbg host1 echo 0 gt sys module mlx4 accl sys parameters mlx4 accl sys dbg To display debug information use the dme sg command host1 dmesg 4 11 Huge Pages Support for Queue Resources Buffer resources for QPs and CQs can now be set to use huge pages When using huge pages the HCA needs less MTT resou
33. Installing Mellanox OFED on page 22 e Section 2 5 Uninstalling Mellanox OFED on page 33 2 1 Hardware and Software Requirements 2 1 1 Hardware Requirements Platforms e A server platform with an adapter card based on one of the following Mellanox Tech nologies InfiniBand HCA devices MT25408 ConnectX 2 VPI IB EN FCoE firmware fw ConnectX2 MT25408 ConnectX VPI IB EN FCoE firmware fw 25408 ConnectX 3 Ready Note To receive ConnectX 3 firmware please contact your Mellanox representative S For the list of supported architecture platforms please refer to the Mellanox OFED Release Notes file ha Required Disk Space for Installation e 400 MB Device ID 4 For the latest list of device IDs please visit Mellanox website hai Mellanox Technologies 21 J Rev 1 5 3 1 0 0 Installation 2 1 2 Software Requirements 2 2 2 3 l Operating System e Linux operating system gt For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file Installer Privileges e The installation requires administrator privileges on the target machine Downloading Mellanox OFED Step 1 Step 2 Step 3 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHost entries in the display The following example shows a system with an installed Mell
34. L S IMidslaver CO a AN A RES awe wee 17 id Opn COE imita ld es Mes ade WM tes tds 17 I E E NN 17 eG gt PP oats S Arr Bet IA Cees 18 Lay dniiniBand subnet Manaos Fra Pek Nea cade eee eet ede 18 lid Diagnosi MMS caera eur a lor Lie nia D a 19 1 4 9 Mellanox Firmware Tools 4 3 Gia ie cel Lav Allie 19 1 5 Quality of Service 20 Chapter 2 TOstall tiOR escri caia A ANA a ae ot 21 2 1 Hardware and Software Requirements 21 ALL Hardware Requirements crac isa eb eta 21 DD SOWIE REQUIEM 3 pe ar ie ts tae Peed 22 2 2 Downloading Mellanox OFED 22 2 3 Installing Mellanox OFED 22 Zool Presmstallaion Notes aii il weer th ee he ee Rees 23 25 2 a AAA Dn ok a a Bae AR RA de asi 23 e Installation Procede dias 26 23A Dl SPG Result id da 30 2 35 POStnstallation NOUS lt ant autos do e ieri 32 2 4 Updating Firmware After Installation 32 2 5 Uninstalling Mellanox OFED 33 Chapters Configuration x usadas Aa 34 Chapterd Driver Features 15 003 a a EAN AA ed beets 35 4 1 RDMA over Converged Ethernet 35 c N ROCES VE WE E ale aiuta ae ee ol ae Aa 35 412 Soare DEPENAENCIESS fans yin is Das tate re ea Caen Ar
35. Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the IB Driver to the initrd File The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this pro cedure may prevent the diskless machine from booting Step 1 Backup your current initrd file Step 2 Make a new working directory and change to it host1 mkdir tmp initrd ib host1 cd tmp initrd ib Step 3 Normally the initrd image is zipped Extract it using the following command host gzip dc lt initrd image cpio 1d The initrd files should now be found under tmp initrd ib Step 4 Create a directory for the InfiniBand modules and copy them host1 mkdir p tmp initrd ib lib modules ib Most MOSES MOS TELS host1 host1 hosts NOSE Ls ness host1 Moses host1 host1 hosts NOSE Ls noses ho heel cd lib modules uname r updates kernel drivers cp infiniband core ib addr ko tmp initrd ib lib modules ib do araceli Come alo dores lso Mo mend H medi es ads cp infiniband core ib mad ko tmp initrd ib lib modules ib ep inpiniband core RO um Finale et AED modules 1b Cp nad core eno tmp nena to AS modules 1b Cp imfiniband core ib uverbs ko mp initrd 1b lib modulles 1b cp infiniband core ib ucm ko tmp initrd ib lib modules ib cp infiniband core ib umad ko tmp ini
36. OxO002 023 SRS por tou 0x000 c90 Frere Oda IMTA 96 intanrscale 1t ht Mellano leeh nologies OxO00S O00 Switch porcgquid 0000586800406 MI 1596 Intiniscale 11l Me lanos Techs nologies IRIS me Crane I Aoter RTE E UxOUO Ze e eA L 0x0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 Mellanox Technologies 181 Rev 1 5 3 1 0 0 0x0008 024 5 valid lids dumped InfiniBand Fabric Diagnostic Utilities Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 Dump all non empty mlids of switch with Lid 3 gt ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infinis cale III Mellanox Technologies Ports MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 0xc022 Oxen 0xc024 0xc040 0xc041 0xc042 0 1 O A o E E O o 12 valid mlids dumped 9 12 smpquery Applicable Hardware All InfiniBand devices Description Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsys IP EES Mec lt Ca porra dr path lid guid gt op params ao HT 4 D FGI leis Sse C lt ca_ name gt node name map lt node name map gt Pe lt op gt lt dest 182 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Table 20 lists the various flags of the command Table 20
37. Rev 1 5 3 1 0 0 host1 gzip tmp new init en img At this stage the modified initrd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it prop erly A 10 iSCSI Boot Mellanox FlexBoot enables an ISCSI boot of an OS located on a remote iSCSI Target It has a built in 1SCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd Linux There are two instances of connection to the remote iSCSI Target the first 1s for getting the kernel and initrd via FlexBoot and the second is for loading other parts of the OS via initrd If you choose to continue loading the OS after boot through the HCA device driver please verify that the initrd image includes the HCA driver as described in Section A 8 A 10 1Configuring an iSCSI Target in Linux Environment Prerequisites Step1 Make sure that an iSCSI Target is installed on your server side You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 3 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target ign line
38. SPC 3 www t10 org ftp t10 drafts spc3 spc3r21b pdf e Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf e Basic functionality task management and limited error handling 4 5 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or change the value of SRP LOAD in etc infiniband openib conf to yes 4 For the changes to take effect run etc init d openibd restart When loading the ib srp module it is possible to set the module parameter A srp sg tablesize This is the maximum number of gather scatter entries per I O af default 12 4 5 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 4 5 2 4 explains how to do this automatically e Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running e To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id ext GUID value ioc guid GUID value dgid port GID value pkey ffff service id service 0 value gt sys class infiniband srp srp mthca hca number port number add_ target See Section 4 5 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above
39. Show all LIDs in range including invalid entries Do not try to resolve destinations Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 Use GUID address argument In most cases 1t is the Port GUID Example 0x08 1040023 Use lt smlid gt as the target LID for SM SA queries Use the specified channel adapter or router Use the specified port Override the default timeout for the solicited MADs msec Destination s directed path LID or GUID Starting LID in an MLID range Ending LID in an MLID range Show multicast forwarding tables The parameters lt star tlid gt and lt endlid gt specify the MLID range 1 Dump all Lids with valid out ports of the switch with Lid 2 gt Loroute 2 Unicast ads 00209 Or onee hid 2 quad 00002000 2E 600 MIAT o E CEI Mellanox Technologies bicol OUT DESNO Port Info EEE 000 gt Switch poriguid OOO C O FEE MAS iantrinis cole Id Me Mano x Lech nologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Tech nologies Ox000G O07 Channel Adapter portoumd Ox000Z2c90S 00000393 VACCA 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 180 Mellanox Technologies Mellanox OFED for Linux User s Ma
40. Technologies 1 8 3 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities 184 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 2 Query SwitchInfo by GUID 3 Query Nodelnfo by direct route Mellanox Technologies 185 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities gt smpquery D nodeinfo 0 Node nio DR parmi iy a O BICE a 1 S Cuan rere Arey Sere beta oa Pare ret is Node pea a a o ee dues Channel Adapter RIE TNE Aaa en eer 2 ESE IO CIA EI RARI A EA 0x0002c9030000103b UR 0x0002c90300001038 POr SAN ate A 0x0002c90300001039 A O 128 PE 0x634a RS tant a ano 0x000000a0 IG VG E Cho arin Auer Os Avera eer an eA re ern 1 CLORO 0x0002c9 9 13 perfquery Applicable Hardware All InfiniBand devices Description Queries InfiniBand ports performance and error counters Optionally it displays aggregated coun ters for all ports of a node It can also reset counters after reading them or simply reset them Synopsys porque T gie Me Meet conan ica oo liek E SOU RS IN T a poa e elas lA Table 21 lists the various flags of the command Table 21 perfquery Flags and Options Default If Not Description Specified Print the help menu Optional Mandatory Flag Use GUID address argument In most cases 1t is the Port GUID Example 0x08 1040023 d ebug Optional Raise the IB debug level May be used several times for
41. Tes Lo e image quid 000 HIH E vendor id 0x02c9 vendor part id 26428 hw ver 0xB0 board id MT 0DD0120009 AS OTE MES 7 porte 1 State PORTE max mtu 2048 4 active mtu 2048 4 sm tia 0 Port lia port imc 0x00 liak loyer 1E poss Siete E CI NCI T max mtu 2048 4 SEES e AZZ ml PCA a port imc 0x00 link layer Ethernet Notes regarding the command output I The InfiniBand port port 1 is in PORT INIT state and the Ethernet port port 2 is in PORT ACTIVE state You can also run the following commands to obtain the port state 38 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 cat sys class infiniband mlx4 0 ports 1 state 2S INIT cat sys class infiniband mlx4 0 ports 2 state 4 ACTIVE i 2 Look at the link layer parameter of each port In this case port 1 is IB and port 2 is Ethernet Nevertheless port 2 appears in the list of the HCA s ports You can also run the following commands to obtain the link layer of the two ports cat sys class infiniband mlx4 0 ports 1 link layer InfiniBand cat sys class infiniband mlx4 0 ports 2 link layer F NSTRET 3 The firmware version is 2 7 700 appears at the top You can also run the following com mand to obtain the firmware version cat sys class infiniband mlx4 0 fw ver 251 100 T 4 The IB over Ethernet s Port MTU is 2K byte at maximum however the actual MTU cannot exceed the mlx4 en interfa
42. The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm 4 QoS support has to be turned on in order that SL VL mappings are used a S LMC gt 0 is not supported by the LASH routing If this is specified the default routing algorithm is invoked instead hai For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do mesh analysis This will add an addi tional phase that analyses the mesh to try to determine the dimension and size of a mesh If it deter mines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs 8 5 6 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths Instead of spreading traffic out across different paths with the same shortest distance it 130 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 chooses among the available shortest paths based on an ordering of dimensions Each port must be co
43. User s Manual Rev 1 5 3 1 0 0 Next FlexBoot attempts to boot as directed by the DHCP server A 8 Command Line Interface CLI A 8 1 Invoking the CLI When the boot process begins the computer starts its Power On Self Test POST sequence Shortly after completion of the POST the user will be prompted to press CTRL B to invoke Mel lanox FlexBoot CLI The user has few seconds to press CTRL B before the message disappears see figure Press Ctrl B for the iPAE command line _ Alternatively you may skip invoking CLI right after POST and invoke it instead right after Flex Boot starts booting Once the CLI is invoked you will see the following prompt 1PXE gt A 8 2 Operation The CLI resembles a Linux shell where the user can run commands to configure and manage one or more PXE port network interfaces Each port is assigned a network interface called neti where i is 0 1 2 lt of interface gt Some commands are general and are applied to all network inter faces Other commands are port specific therefore the relevant network interface is specified in the command A 8 3 Command Reference A 8 3 1 ifstat Displays the available network interfaces in a similar manner to Linux s ifconfig 1PxE gt ifstat neto o 0o 02 c9 03 00 Oc 78 11 on FCI Z 00 0 closedi LLink down Te 6 THE 24 RX 11 RZ3E 11 1 Link status The socket is not connected TXE Z x Mo such file or directory RXE 3 x
44. VA Ee TIE E one e Us IAS piangere 3 packets transmitted 3 received 0 packet loss time 2000ms rtt min avg max mdev 0 167 0 412 0 873 0 326 ms Inspecting the GID Table cat sys class infiniband mlx4 0 ports 2 gids 0 ele 0000 0000 OOOO 00200 eee it cat sys class infiniband mlx4 0 ports 2 gids 1 0000 0000 0000 0000 0000 0000 0000 0000 it According to the output we currently have one entry only Run an Example Test ibv_rc_pingpong Start the server first TONE CONI local address LID 0x0000 OPN Ox0O0004E PSN Ox3315f6 GID fed0332023c9fi ie0 e799 remote address LID 0x0000 ENANA PSN Ox2edede GID fes03 2023E YEf ie0l eHll 8192000 bytes in 0 01 seconds 4730 13 Mbit sec 1000 1ters in 0 01 seconds 13 85 usec iter T 40 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Then start the client elba te pingpong g T local address LID 0x0000 OPN 0x04004E PSN Ox2cdece GID rest s lt 207 corr eu e ol remote address LID 0x0000 OPN OxOO0004E PSN 0x3315E6 GID rests 2207 corr revere 149 8192000 bytes in 0 01 seconds 4787 84 Mbit sec 1000 iters in 0 01 seconds 13 69 usec iter T Add VLANs Make sure that the 8021 q module is loaded modprobe 8021q Add the VLAN device vconfig add eth2 7 Added VLAN with VID 7 to IF eth2 T Configure an IP address for it AE OM EA NA Examine the GID table cat sys class infiniband mlx4 0 ports 2
45. command Table 18 ibportstate Flags and Options Default If Not Description Specified Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v ME ee dee ______ D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08 1040023 Optional K the default timeout for the solicited M ADS iena ms gt msec lt dest dr path Optional Destination s directed path LID or GUID lid eye guid gt lt portnum gt Optional Destination s port number Destination s port Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria Optional Mandatory 1 The first ACTIVE port that is found 176 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0
46. conf file which contains entries for the DAPL devices does not contain entries for the DAPL over ROMA CM over RoCE devices To add the missing entries perform the following Step1 Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices ports Step2 Add a new entry line according to the format below to the dat conf file for each output line of the ibdev2netdev utility lt IA Name gt u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 lt ethX gt lt port gt Parameter Description Example lt IA Name gt The device s IA name The name must be unique ofa v2 ethx lt ethX gt The associated Ethernet device used by ROCE eth3 lt port gt The port number l The following is an example of the ibdev2netdev utility s output and the entries added per each output line Mellanox Technologies 43 Rev 1 5 3 1 0 0 Driver Features 4 2 4 2 1 4 2 2 Example sw419 ibdev2netdev MA e Ea mia Opor di E ofa v2 eth2 u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 eth2 2 ofa v2 eth3 u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 eth3 1 Fibre Channel over Ethernet Fibre Channel over Ethernet FCoE is still at beta level in this release ha FCoE Overview The FCOE feature provided by Mellanox OFED allows connecting to Fibre Channel FC targets on an FC fabric using an FCoE capable switch or gateway Key features incl
47. ead 118 8 3 osmtest Description 118 SA Sars ak ties a ra a te lo ce dh ey Nt ee es Eas dr rn kia ice sita 119 Sid RUNING T se oe eae ood Uae de sale N S need meee ahd Sha R TAN A ee oss 122 8 4 Partitions 122 Seki lle ORM oc sc Sake ide E tir ea dd Sindh as th USES 122 8 5 Routing Algorithms 124 Soil Pitsctor Topology Cnanees rad iaa 126 5 2 Win Hep Al somthin 3 40 ts over paket ne okie ae bee ete Hee eee dee i 126 So UPDN Alora adi art bi Oe habeas heed 126 S54 Ratstres Routine ALSO sli heehee he Aisa 127 5 5 LASH Route Aloomthm irlandese 129 8 5 6 DOR Routing Algorithm 0 0 LL 130 8 5 7 Torus 2QoS Routing Algorithm LL 131 8 6 Quality of Service Management in OpenSM 139 SO a A RESI LIDIA Rara 139 8 02 Advanced GOS Policy Piles eta dad ridad Sti ei lei dad 140 5 Stimple DoS Policy Definition as a pi 141 64 Policy File Syntax Guidelines a ri ii 141 0 0 Exampl s of Advanced Policy File ie A ras 142 8 6 6 Simple QoS Policy Details and Examples ooo 145 8 6 7 SL2VL Mapping and VL Arbitration 0 0 0 0 rss srt rr rr ses 148 9 05 Deployment Example a dl ea 149 8 7 QoS Configuration Examples 150 Sl Typical HPC Example
48. gids 0 TES UDI OO OOOO coto O eo cat sys class infiniband mlx4 0 ports 2 gids 1 Fes 00000000 0000 lt 0202 c900 gt 0708 881 1 According to the output we now have two entries Run the Example Again Now on VLAN On Server Ei pon eZ local address LID 0x0000 OPN O0x040045 PSN OxbddeJc GID te90 7202 C 9000109002799 remote address LID 0x0000 OPN 0x08004f PSN 0xc9d800 GID fe80 202 c900 708 e811 8192000 bytes in 0 01 seconds 4824 50 Mbit sec 1000 1ters in 0 01 seconds 13 58 usec iter Mellanox Technologies 41 J Rev 1 5 3 1 0 0 Driver Features On Client i UDC pingpong eg NE ev local address LID 0x0000 OPN Ox050O024E PSN Ua SGS UID GID TS 202 6900 70656811 remote address LID 0x0000 OPN 0x04AO004E PSN Oxbdde7c GID ress 220270900108 TRE 8192000 bytes in 0 01 seconds 4844 83 Mbit sec 1000 iters in 0 01 seconds 13 53 usec iter Defining Ethernet Priority PCP in 802 1q Headers On Server RENON q A local address LID 0x0000 OPN Oxtle004E PSN 0zx9dattoc GID teet 2023 6900 7082 6799 remote address LID 020000 OPN 0x1c004f PSN 0xb0a49b GID fe80 202 c900 708 e811 8192000 bytes in 0 01 seconds 4840 89 Mbit sec 1000 iters in 0 01 seconds 13 54 usec iter On Client AO A o e e o local address LID 0x0000 OPN Oxlec004F PSN Uxb0a4 No GID e SHS 2072 C300 os eo ld remote address LID 0x0000 OPN Oxle004F PSN Ux9danGc GID e803 3202 C900 70836799 8192000 bytes in 0 01 sec
49. gt N3 With a max reverse hops value of 2 N1 N2 and N3 will all have routes between them _ Using max reverse hops creates routes that use the switch in a counter stream way S This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or similar Also having routes the other way around can cause credit loops 8 5 4 2 Activation through OpenSM e Use R ftree option to activate the fat tree algorithm 4 LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm 1s invoked instead hai 8 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path routing algorithm that enables topology agnostic deadlock free routing within communication networks When computing the routing function LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock 4 from HCA between and switch does not need virtual layers as deadlock will not arise gt LASH analyzes routes and ensures deadlock freedom between switch pairs The link sil between switch and HCA In more detail the algorithm works as follows Mellanox Technologies 129 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 1 LASH determines the shortest path between al
50. have multiple interfaces that belong to the same vHub 4 6 2 EolB Configuration mlx4 vnic module supports two different modes of configuration which is passed to the host mlx4 vnic driver using the EoIB protocol e host administration where the vNic is configured on the host side e network administration where the configuration is done by the BridgeX Both modes of operation require the presence of a BridgeX gateway in order to work properly The EoIB driver supports a mixture of host and network administered vNics 68 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 6 2 1 EOIB Host Administered vNic In the host administered mode vNics are configured using static configuration files located on the host side These configuration files define the number of vNics and the vHub that each host administered vNic will belong to 1 e the vNic s BridgeX box eport and VLAN id properties The mlx4 vnic_confd service is used to read these configuration files and pass the relevant data to the mlx4 vnic module EoIB Host Administered vNic supports two forms of configuration files e Central Configuration File etc infiniband mlx4 vnic conf e vNic Specific Configuration Files ifcfg ethX Both forms of configuration supply the same functionality If both forms of configuration files exist the central configuration file has precedence and only this file will be used Central Configuration File
51. higher debug levels ddd or d d d B Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Table 21 perfquery Flags and Options Optional mere dea If Not Description i Specified Apply query to all ports Optional Optional R eee rea t Optional Override the default timeout for the solicited MADs lt timeout_ms gt msec port reset_ma sk Examples perfquery r 32 1 read performance counters and reset perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports pori guen R 32 2 XE eSek on error counters OF Dorr 2 perfquery R 32 2 0x 000 reset only non error counters of port 2 1 Read local port s performance counters gt perfquery LORE SCORE oo potted Blow ches RR 1 A eee e Nan a ene ara 0x1000 IIE A A RA 0 A ns agate Sad n 0 T T n 0 aR OO 0 ROERO MOTS PAV ELTODO tt daa daa 0 RO OWE ER 0 Mellanox Technologies 187 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities 2 Read performance counters from LID 2 all ports 3 Read then reset performance counters from LID 2 port 1 188 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 SOO TE EEO E T A aa 0 binkReceowers
52. level also shows the versatility of markets that Mellanox OFED applies to Mellanox Technologies 15 Rev 1 5 3 1 0 0 Mellanox OFED Overview Figure 1 Mellanox OFED Stack Back end App Middleware Front end Eth Cluster Config Mgmnt Life Sciences Block Storage HPC Application Application Mellanox VPI Device HCA NIC MU Markets MEI Linux MI OFED in Linux Applications MS OFED The following sub sections briefly describe the various components of the Mellanox OFED stack 1 4 1 mthca HCA IB Driver mthca is the low level driver implementation for the following Mellanox Technologies HCA InfiniBand devices InfiniHost InfiniHost III Ex and InfmiHost III Lx 1 4 2 mlx4 VPI Driver mlx4 is the low level driver implementation for the ConnectX and ConnectX 2 adapters designed by Mellanox Technologies ConnectX ConnectX 2 can operate as an InfiniBand adapter as an Ethernet NIC or as a Fibre Channel HBA The OFED driver supports InfiniBand and Ether net NIC configurations To accommodate the supported configurations the driver is split into four modules 16 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 mlx4_core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mix4_ib Handles
53. libspd so 2 LIBSDP CONFIG FILE this environment variable is used to configure the policy for replacing TCP sockets with SDP sockets By default it points to etc libsdp conf 3 SIMPLE LIBSDP ignore libsdp conf and always use SDP Mellanox Technologies 51 J Rev 1 5 3 1 0 0 Driver Features 4 4 5 Converting Socket based Applications You can convert a socket based application to use SDP instead of TCP in an automatic also called transparent mode or in an explicit also called non transparent mode Automatic Transparent Conversion The libsdp conf configuration policy file is used to control the automatic transparent replacement of TCP sockets with SDP sockets In this mode socket streams are converted based upon a destination port a listening port or a program name Socket control statements in libsdp conf allow the user to specify when libsdp should replace AF INET SOCK STREAM sockets with AF SDP SOCK STREAM sockets Each con trol statement specifies a matching rule that applies if all its subexpressions must evaluate as true logical and The use statement controls which type of sockets to open The format of a use statement is as follows use lt address family gt lt role gt program namel gt lt address gt lt port rangel gt where lt address family gt can be one of sdp for specifying when an SDP should be used tcp for specifying when an SDP socket should not be matched Dot tor s
54. lt data gt rb lt addr gt lt size gt out Read a data block from Flash file swreset SW reset the target InfniScale IV device This command is supported only in the In Band access method Mellanox Technologies 193 Rev 1 5 3 1 0 0 Possible command return values are 0 successful completion 1 error has occurred InfiniBand Fabric Diagnostic Utilities 7 the burn command was aborted because firmgsere LS current Examples I Find Mellanox Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE SD Ape a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn 4 The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI K ID Repository Website at http pci ids ucw cz read PC 15b3 di 2 Verify the ConnectX firmware using its ID using the results of the example above metia 102 2 00T 0 y ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x00
55. made 1ibsdp will default to both Examples e Use SDP by clients connecting to machines that belongs to subnet 192 168 1 use sap comete TI e Use SDP by ttcp when it connects to port 5001 of any machine use sdp Listen CECD 290 041 e Use TCP for any program with name starting with ttcp serving ports 22 to 25 use tcp server ttcp ALSO e Listen on both TCP and SDP by any server that listen on port 8080 use both server 8080 e Connect ssh through SDP and fallback to TCP to hosts on 11 4 8 port 22 Mellanox Technologies 53 Rev 1 5 3 1 0 0 Driver Features use both connect 11 4 89 0 24 22 Explicit Non transparent Conversion Use explicit conversion if you need to maintain full control from your application while using SDP To configure an explicit conversion to use SDP simply recompile the application replacing PF INET or PF INET with AF INET SDP or AF INET SDP when calling the socket system call in the source code The value of AF INET SDP is defined in the file sdp socket h or you can define it inline define AF INET SDP 27 define PF INET SDP AF INET SDP You can compile and execute the following very simple TCP application that has been converted explicitly to SDP Compilation NO IEEE aereo ap mote ne Usage SeEryer s Mosto sce SN Client nos sora elle ite Se Sion Ie ato Gl Example Server lio gine SONS Se accepted connection from 15 2 2 42 48710 read 2048
56. multipath and the SRP daemon Each ini tiator is connected to the same target from several ports HCAs The DM multipath is responsible for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fabric and send the ib_srp module requests to connect to them as well Operation When a path from portl to a target fails the ib_srp module starts an error recovery process If this process gets to the reset host stage and there is no path to the target from this port ib_srp will remove this scsi host After the scsi host is removed multipath switches to another path to this target from another port HCA When the failed path recovers 1t will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path Prerequisites Installation for RHEL4 5 Execute once e Verify that the standard device mapper multipath rpm is installed I
57. of all available InfiniBand devices OVAS noi 2 HCAS found mthca0 mlx4 0 2 Query the device mlx4_ 0 and print user available information for its Port 2 ALOE SO A 7 hca id mlx4 0 Leve si Deo ae node quid II Soho SERI MACRO 0000 0000 0007 3898 vendor id 0x02c9 vena pant aid 25418 hw ver OxA0 Board rd MT 04A0140005 phys port emt 2 DOG 2 STATES PORT ACTIVE 4 max mtu 2048 4 active mtu 2048 4 S 1 poned 1 port Imo 0x00 172 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 9 8 ibdev2netdev ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link 9 8 1 SYNOPSYS Tra qiero ev OPTIONS v Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mixa 0 MTZ264Z0 MTTO06GK00034 FALCON ODR fw 2 7 9288 port 1 ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt 1b0 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 2 DOWN gt 1b1 Down mix KMR MITO SOU O nawk Duel Pore aw 221 9400 pore DOWN gt euhZ Down Dix MI od II Sas Duce Pome DI O Ce DONNE gt a Down sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev Ie gt erie own m
58. on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below hostile cp pati to DHCP client vo iS dhclient mp im trde DS on nosing dois isle LR Re ee o a EE ado siena host1 mkdir p tmp initrd ib var state dhcp hostl touch tmp initrd ib var state dhcp dhclient leases este cp bin uname Em ante di bam hostl cp usr bin expr tmp initrd ib bin NOS Cpr corn NES onto e bin hostl cp bin hostname tmp initrd ib bin Create a configuration file for the DHCP client as described in Section 4 7 3 1 and place it under tmp initrd ib sbin The following is an example of such a file called delien a dont dhclient conf Ine value indicates a hexadecimal number Fora Connects device interface ib send dhep cltent 1dentitier PEZ 008000020 0008022005000 Mellanox Technologies 209 Rev 1 5 3 1 0 0 Step 9 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd ib init and add the following lines at the point you wish the IB driver to be loaded The order of the following commands for loading modules is critical M7 echo loading E IA sbin insmod lib modules ipv6 ko lt M echo loading IB driver sbin insmod lib modules ib ib addr ko sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insm
59. re transmission of packets and therefore in bandwidth loss This check should be conducted for each port after the driver is loaded To check for symbol errors enter dae cycy class a ia deres pones Counters Salo ciao The command above is performed on Port 1 of the device lt device gt The output value should be 0 if no symbol errors were recorded 3 Bandwidth is expected to vary between systems It heavily depends on the chipset memory and CPU Nevertheless the full wire speed should be achieved by the host With IB SDR the expected unidirectional full wire speed bandwidth is 900MB sec With IB DDR and PCI Express Gen 1 the expected unidirectional full wire speed bandwidth is 1400MB sec With IB Y DDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is 1800MB sec With IB QDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is 3000MB sec To check the adapter s maximum bandwidth use the ib write bw utility Mellanox Technologies 103 Rev 1 5 3 1 0 0 Performance To check the adapter s latency use the ib write lat utility S The utilities ib write bw and ib write lat are installed as part of Mellanox OFED 6 3 3 System Performance Troubleshooting On some systems it is recommended to change the power saving configuration in order to achieve better performance This configuration is usually handled by the BIOS Please contact the system
60. root guid file gt for adding an UPDN guid file that contains the root nodes for ranking If the a option is not used OpenSM uses its auto detect root nodes algorithm Notes on the guid list file I A valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 8 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K Mellanox Technologies 127 Rev 1 5 3 1 0 0 OpenSM Subnet Manager ary N Trees by handling for non constant K cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also prevents credit loop dead locks If the root guid file is not provided a or root_guid_ file options the topology has to be pure fat tree that complies with the following rules e Tree rank should be between two and eight inclusively e Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all e Switches of the same rank should have the same number of DO
61. then the previous LID of the same LMC group c Ifnone prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node Mellanox Technologies 125 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 8 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign lids option is specified ni Sonno This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obvi ously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 8 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by default if no routing algorithm is specified It can als
62. these performance optimizations may result in higher power consumption 6 1 3 Intel Hyper Threading Technology Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single threaded core S This section applies to Intel processors only supporting Hyper Threading ha 4 For latency and message rate sensitive applications it 1s recommended to disable Hyper Threading 98 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 6 2 Performance Tuning for Linux You can use the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 6 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance e Disable the TCP timestamps option for better CPU utilization sysctl w net 1pv4 tep timestamps 0 e Disable the TCP selective acks option for better CPU utilization Sed esa ol e Increase the maximum length of processor input queues ce cores edema ba riales 900 010 e Increase the TCP maximum and default buffer sizes u
63. they are Increase verbosity level May be used several times for Use GUID address argument In most cases it is the Port Use specified threshold file Show the predefined thresholds Optional Use mono mode rather than color mode t Optional lt timeout_ms gt lt port gt Mandatory without G flag Examples Use the specified port I Check aggregated node counter for LID 0x2 gt lbcheckerrs 2 twa Counter warn counter warn counter warn counter warn counter Error check on SymbolErrors 605535 ER ESS T I O E LinkRecovers 255 ene Rota kG Mads 2 ori 755 lankbowned 2 threshold TO Mid 2 pore 235 EGO 0 T mes olor RR XmtDiscards 441 mares E LR L OS OZ MESS E REED GL TLEER D port all 2 Check port counters for LID 2 Port 1 gt ibcheckerrs y 2 I Error check on Ge MRS oC nes eno Sc mao ge TGR RE 3 Check the LID2 Port 1 using the specified threshold file Use the specified channel adapter or router Use the specified port Override the default timeout for the solicited MADs msec lt lid guid gt Mandatory with Use the specified port s or node s LID GUID with G G flag option FAILED OK 190 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 gt Gein AES SymbolErrors 10 LinkRecovers 10 LinkDowned 10 RevErrors 10 RevRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDiscards 100 XmtConstraintErrors 100 RevCo
64. will need to actively turn it off Running the SM w o the CC Manager is not sufficient as the hardware still continues to function in accordance to the previous CC configuration For further information on how to turn OFF CC please refer to Section 8 9 3 Configuring Conges tion Control Manager on page 158 8 9 3 Configuring Congestion Control Manager Congestion Control CC Manager comes with a predefined set of setting However you can fine tune the CC mechanism and CC Manager behavior by modifying some of the options To do so perform the following 1 Find the event plugin options option in the SM options file and add the following Cont pielen lt ce ng O ons ile nee Options string that would be passed to the plugin s S OLUGIMOOE LONG COIE CON fer ee O EONS mille mame 2 Run the SM with the new options file opensm F lt options file name gt 158 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 S To turn CC OFF set enable to FALSE in the Congestion Control Manager configura tion file and run OpenSM ones with this configuration hai For the full list of CC Manager options with all the default values See Configuring Congestion Con trol Manager on page 158 For further details on the list of CC Manager options please refer to the IB spec 8 9 4 Configuring Congestion Control Manager Main Settings To fine tune CC mechanism and CC Manager behavior
65. www mellanox com gt Downloads Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run met start mort status The device name will be of the form mt lt dev id gt pci cr0 conf0 2 2 Create and burn the composite image Run flint dev lt mst device name gt brom lt expansion ROM image gt Example on Linux flint dev dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Example on Windows flint dev mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Removing the Expansion ROM Image Remove the expansion ROM image Run flint dev lt mst device name gt drom 1 Depending on the OS the device name may be superceded with a prefix 200 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 A When removing the expansion ROM image you also remove Flexboot from the boot device list da A 3 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB A 3 1 Installing the DHCP Server To add IPoIB support in DHCP client server please refer to docs dhcp README A 3 2 Configuring the DHCP Server A 3 2 1 For ConnectX Family Devices When a FlexBoot client boots it sends the DHC
66. 0 s Policy min 20 BW Policy min 40 adi s gals App A Server App B man N App B Server Service Access Points Traffic class SRP Service Level 1 Traffic class IPoIB Policy min 30 BW Service Level 3 Policy min 10 BW Pa App A Server Virtual Server 8 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each example provides the QoS level assignment and their administration via OpenSM configuration files 8 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels e MPI Separate from I O load Min BW of 70 e Storage Control Lustre MDS Low latency e Storage Data Lustre OST Min BW 30 Administration e MPI is assigned an SL via the command line hostl mpirun sl 0 e OpenSM QoS policy file S In the following policy file example replace OST and MDS with the real port GUIDs ha 150 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 qos ulps default 20 default Si for MPT any orge Porn OS IRON IR MOSS O sl or lusie OST any arget port guid MDSI MDS 52 Irene use cre MDS end gos ulps e OpenSM options file qos max vls 8 gos high limit 0 GOS miso Ina gia 28 Il qos vlarb low 0 96 1 224 gos sslZvil 07 1 27574797 0 1 Lo bs lo 13 lo lo Lo Lo 8 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS co
67. 0010dc 0x00004947 0x00004948 0x000052c7 0x000052c8 0x0000530b 0x0000530c 0x0000542f 0x00005430 0x0000634f 0x00006350 0x0000 29b 0x0000 29c 0x0004749b 0x0004749c 0x0005913 0x00059140 0x0007a123 0x0007a124 0x0007bdff 0x0010a4 OMS TT 7 0x000980 0x000044 0x000124 0x000 20 0x008f4c 0x038200 0x011ca4 0x020fe4 0x001cdc POOP Ok POOT Ok Con REG R OK GUID OK Image Info OK DORIS SOK DER OF DDR TOR TURIS TOF DORITE OF DDR OK 194 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007f0af 0x000518 Configuration OK 0x0007 0b0 0x0007f0fb 0x00004c Jump addresses OK 0x0007 0fc 0x0007 2a7 0x000lac FW Configuration OK FW image verification succeeded Image is bootable 9 16 ibv_asyncwatch Applicable Hardware All InfiniBand devices Description Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv_asyncwatch Examples 1 Display asynchronous events gt ibv_asyncwatch SS nu sh Ded 9 17 ibdump Applicable Hardware Mellanox ConnectX ConnectX 2 adapter devices Description Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX ConnectX 2 adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analys
68. 01 d151 base lid 0x0 eno JLo 0x0 Stater INN phys state pe LinkUp 174 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 AE 10 Gb sec 4X Inftinrband device mentali port 2 Status default gid fe80 0000 0000 0000 0002 c900 0101 d152 base lid 0x0 side 0x0 Skate GIAVERA phys state 5 LinkUp Fares 10 Gb sec 4X 2 List the status of specific ports of specific devices ALO Status tical ib Iii Tata pand device implica pont Status default gid fe80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 Sud os 0x0 STATER ZO VIA phys state 5 LinkUp Kane 10 Gb sec 4X infame nose R ne uso default gid fe80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sida 0x1 States 4 ACTIVE phys state Se aldo rate 20 Gb sec 4X DDR 9 10 ibportstate Applicable Hardware All InfiniBand devices Description Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a swich port then ibportstate can be used to e disable enable or reset the port Mellanox Technologies 175 Rev 1 5 3 1 0 0 e validate the port s link width and speed against the peer port Synopsis Porro Mei Ne Cico name E lt a H teo Us lt dest dr path lid guid gt lt portnum gt lt op gt lt value gt Table 18 lists the various flags of the
69. 2000 to the switch with node GUID 0x2001 would point in the negative x direction All the link keywords for a given seed must specify the same from switch In general it is not necessary to configure both the positive and negative directions for a given coordinate either is sufficient However the algorithm used for topology discovery needs extra information for torus dimensions of radix four see TOPOLOGY DISCOVERY in torus 2QoS 8 For such cases both the positive and negative coordinate directions must be specified Based on the topology specified via the torus mesh keyword torus 2QoS will detect and log when 1t has insufficient seed configuration x dateline position y date lime poslrion ida teline Poston In order for torus 2Q0S to provide the guarantee that path SL values do not change under any con ditions for which it can still route the fabric 1ts idea of dateline position must not change relative to physical switch locations The dateline keywords provide the means to configure such behavior The dateline for a torus dimension is always between the switch with coordinate 0 and the switch with coordinate radix 1 for that dimension By default the common switch in a torus seed is taken as the origin of the coordinate system used to describe switch location The position parameter for a dateline keyword moves the origin and hence the dateline the specified amount relative to the common switch in a torus seed next seed If a
70. 3 1 0 0 OpenSM Subnet Manager 8 5 7 3 Torus Topology Discovery The algorithm used by torus 2QoS to construct the torus topology from the undirected graph repre senting the fabric requires that the radix of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm is to examine the cube formed by the eight switch locations bounded by the corners x y z and x 1 y 1 7 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that is consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm is based on examining the topology of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2QoS will detect and report when it has insufficient configuration for a torus with radix 4 dimensions In the event the torus 1s significantly degraded 1 e there are many missing switches or links it may happen that torus 2QoS is unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condi
71. 4 1 8 Reading Port Counters Statistics It is possible to read port statistics in the same way it is done for regular InfiniBand ports The information is available from the sysfs at sys class infiniband lt device gt ports lt port number gt counters and the supported counters are port rev packets port xmit packets port rcv data and port xmit data These counters count InfiniBand data only and do not account for Ethernet traffic For example to read the number of transmited packets run gt cat sys class infiniband lt device gt ports lt port number gt counters port xmit packets S RoCE traffic is not shown in the associated Etherent device s counters since it is offloaded by the hardware and does not go through Ethernet network driver had Mellanox Technologies 37 I Rev 1 5 3 1 0 0 Driver Features 4 1 9 A Detailed Example This section provides a step by step example of using InfiniBand over Ethernet RoCE Installation and Driver Loading The MLNX OFED installation script installs RoCE as part of mlx4 and mlx4 en and other mod ules See Section 2 3 Installing Mellanox OFED for details on installation S The list of the modules that will be loaded automatically upon boot can be found in the configuration file etc infiniband openib conf Enter the following command to display the current run of MLNX OFED ibv devinfo hca id mlx4 0 transports InfiniBand 0 fw ver 2 1100 node gur KEO 0 O0
72. 400000 pkt rate high 450000 rx usecs 16 rx frames 88 rxeusecs irgs 0 Ex frames rg 0 e o 6 ELM O e a Ori rx usecs UY rx trames gt erhtool Stal Coalesce parameters for ethl Adao IVe Ro OTE T ORT pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 TRUE SIG IONI 100 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 6 2 4 Interrupt Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer in RedHat gt etc init d irqbalance stop The following command turns off the IRQ balancer in SLES gt es Ate cl as lele osi sue The following command assigns the affinity of a single interrupt vector Vecio hexadecimal Dib mese POS Td e CEDE a where bit i in lt hexadecimal bit mask gt indicates whether processor core i is in lt irq vector gt s affinity or not 6 2 4 1 Example Script for Setting Interrupt Affinity 4 On systems that support NUMA it is recommended to set IRQs from different net 9 work devices to processor cores that reside on different physical CPU
73. 9 pertqueryElass and Opuons cicazi sini ttontrotvi0ipi belga valo den os y 193 ibcheckerrs Flags and Options criari iano 196 ERIR HE SWITCHES srta ele planers Bw aha dg Cae eae ee eae se dan 199 PRES HI COMMANGS eessen eee ven SEARS 200 drum Opuoas 0 Di ida 203 Rev 1 5 3 1 0 0 Revision History Rev 1 5 3 1 0 0 July 2011 e Added Section 4 10 Socket Acceleration on page 90 e Added Section 4 11 Huge Pages Support for Queue Resources on page 91 e Updated Section 8 9 Congestion Control on page 158 e Added Section 8 5 7 6 Torus 2QoS Configuration File Syntax on page 137 e Updated Section A Mellanox FlexBoot on page 198 Rev 1 5 2 2 1 0 March 03 2011 e New version of MLNX_OFED no changes to this document Rev 1 5 2 October 16 2010 e Complete reorganization of the document s chapters e Removed section Section 10 NFSoRDMA on page 86 e Added Section 4 9 Atomic Operations on page 88 and its subsections e Updated Section 2 3 Installing Mellanox OFED on page 22 e Removed ibspark tool e Added Section 8 5 7 Torus 2QoS Routing Algorithm on page 131 Rev 1 5 1 3 September 16 2010 e Added Section 5 1 2 Firmware Dependencies on page 61 e Updated Section 5 1 7 Reading Port Counters Statistics on page 63 e Updated Section 5 1 Port Type Management on page 93 e Added Section 4 2 2 4 Enabling Disabling FCoE Services on page 46 Rev 1 5
74. A token is any contiguous group of non whitespace characters Any tokens on a line fol lowing the recognized configuration tokens described below are ignored torus mesh x radix m M t T y radix m M t T z radix m M t T Either torus or mesh must be the first keyword in the configuration and sets the topology that torus 2QoS will try to construct A 2D topology can be configured by specifying one of x_radix y radix or z radix as 1 An individual dimension can be configured as mesh open or torus looped by suffixing its radix specification with one of m M t or T Thus mesh 3T 4 5 and torus 3 4M 5M both specify the same topology Note that although torus 2QoS can route mesh fabrics its ability to route around failed compo nents 1s severely compromised on such fabrics A failed fabric componentis very likely to cause a disjoint ring see UNICAST ROUTING in torus 2QoS 8 xp link sw0 GUID swl GUID yp link sw0 GUID swl GUID zp link sw0 GUID swl GUID xm link sw0 GUID swl GUID ym link sw0 GUID swl GUID zm link sw0 GUID swl GUID These keywords are used to seed the torus mesh topology For example xp_ link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID Mellanox Technologies 137 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 0x2001 would point in the positive x direction while xm link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x
75. Band fabric to a 1 or 10 GigE Ethernet sub net 4 6 1 Ethernet over IB Topology EoIB is designed to work over an InfiniBand fabric and requires the presence of two entities e Subnet Manager SM The required subnet manager configuration is not unique to EoIB but rather similar to other InfiniBand applications and ULPs e BridgeX gateway The BridgeX gateway is at the heart of EoIB On one side usually referred to as the internal side it is connected to the InfiniBand fabric by one or more links On the other side usually referred to as the external side it is connected to the Ethernet subnet by one or more ports The Ethernet connections on the BridgeX s external side are called external ports or eports Every BridgeX that is in use with EoIB needs to have one or more eports connected Mellanox Technologies 67 Rev 1 5 3 1 0 0 Driver Features 4 6 1 1 External Ports eports and Gateway The combination of a specific BridgeX box and a specific eport is referred to as a gateway The gateway is an entity that is visible to the EoIB host driver and is used in the configuration of the network interfaces on the host side For example in the host administered vNics the user will request to open an interface on a specific gateway identifying it by the BridgeX box and eport name Distinguishing between gateways is essential because they determine the network topology and affect the path that a packet traverses between hosts
76. CAs This driver was not tested by Mellanox Technologies e CXGB3 Provide ROMA and NIC support for the Chelsio S series adapters This driver was not tested by Mellanox Technologies e NES Support for the NetEffect Ethernet Cluster Server Adapters This driver was not tested by Mellanox Technologies e Documentation 1 3 3 Firmware The ISO image includes the following firmware items e Firmware images mlx format for ConnectX and ConnectX 2 network adapters e Firmware configuration INI files for Mellanox standard network adapter cards and custom cards e FlexBoot for ConnectX ConnectX 2 HCA devices e ConnectX EN PXE gPXE boot for ConnectX EN and ConnectX 2 EN devices 1 3 4 Directory Structure The ISO image of MLNX OFED LINUX contains the following files and directories e mlnxofedinstall This is the MLNX OFED LINUX installation script e uninstall sh This is the MLNX OFED LINUX un installation script e lt RPMS folders gt Directory of binary RPMs for a specific CPU architecture e firmware Directory of the Mellanox IB HCA firmware images including Boot over IB e src Directory of the OFED source tarball and the Mellanox Firmware Tools MFT tarball e docs Directory of Mellanox OFED related documentation 1 4 Architecture Figure I shows a diagram of the Mellanox OFED stack and how upper layer protocols ULPs interface with the hardware and with the kernel and user spaces The application
77. CoE switch is configured for automatic FCoE negotiation e MTU if MTU of the Ethernet device is changed from the default 1500 put the cor rect value here Configure the mlx4 en Ethernet driver to support PFC Add the following line to the file etc modprobe conf and restart the network driver NASA WIR Come MAS tin OS 4 2 2 2 Starting FCoE Service Make sure the network is up modprobe mlx4 en Then run gt ete lait d mlzxfe start VHBAs will be instantiated on DCBX monitored interfaces and SCSI LUNs will get mapped For Manual instantiation of VHBAs please see Section 4 2 3 1 Manual vHBA Control Mellanox Technologies 45 Rev 1 5 3 1 0 0 Driver Features 4 2 2 3 Stopping FCoE Service Run gt etc init d mlxfc stop Only when the mlxfc service is stopped and the mlx4_en module is removed can the mlx4 core module be removed as well De 4 2 2 4 Enabling Disabling FCoE Services To enable disable FCoE upon boot please edit the file etc mlxfc mlxfc conf and set the following variables to either YES or NO Start FCoE FCOE Yes No 4 2 3 FCoE Advanced Usage Advanced usage will probably be needed when connected to FCoE switches that do not support the Cisco like FCoE DCBX auto negotiation 4 2 3 1 Manual vHBA Control Manual control allows creating and destroying vHBAs and signaling link up and link down to existing VHBAs This is done using sysfs operations When using the pre
78. EE E E EEE EEE EEEE ae a AE r T E EE AE E AE E E E E E E E EE E EEEE EEEE EEEE ae a AE EAE a AE AE T E E EAE AE AE E T T E E E E EAE E E E E E E EEE E EEEE EEE EEE EEEE E EAE AE r T E E E AE EAE AE E E T E E EEE E EEE E EEE EEE A EEEE EEEE EAE r E E E E EAE E E E E E E EEE E EE E E EEE EEE EEEE EEE r E E EE EAEE E E E E T E EEE E EEE E EEE EE EEEE EEEE T T E E EAE AE AE E T T E EE E EAE AE E E E E E E EE E EE E E T EEEE EEEE E EAE r T E E E AE E AE AE E E E E E EEE E EEE E EEE EEE EEEE E EAE AE EAE r E E E E EAE AE E E E E E EEE E EEE E EEEE A EEEE EEIE AE r T E E E E EE E E E E E E EEE E EEEE E EEEE EEEE EEEE r E E E E EAE E E E E E E EEE E EEE E EEEE EEEE EEEE EAE r E E E E EAE AE E E T E E EEE E EE E E EE EEE EEEE EIEE r E E E E E AE E E T E E E EE E EEE T EEEE EEEE E EAE AE r T E E EE E EE E E E E E E EEE E EEEE EEEE EEEE E EAE r E E EE AE EE E E T E E EEE E EEEE EEEE EEEE E EAE AE EAE a AE AE T E E EAE AE AE E T T E EE E EAE E E E E E E EEE E EEE E EEEE EEEE EEEE T r T E E E E EAE E E E E E E EEE E EEE E EEE EEEE ae a E EAE AE AE AE T E E E AE AE AE E AE T E E E E E AE AE E E T E E EEE E EE E E E EEEE EEEE EEIEIE EAE a AE AE T E E EAE AE AE AE T T E E E AE EAE E E E E E E EEE E EE E E EEEE EEEE E EAE AE T T E ae AE AE T T E E E E EAE AE E E E E E EEE E EEE E EEEE EEEE E EAE AE T r E E AE E E AE AE E E E E E E EE E EE E E EEE EEE EEEE ae a AE EAE a AE AE T E aa E E T T E E E E E AE AE E E T E E EEE E EE E E EE E
79. EEE EEEE E EAE AE r E E E E EAE E E E T E E EEE E EEE E EEE EEE A EEEE EEEE EAE r E E E E EAE E E E E E E EEE E EE E E EEE EEE EEEE EEE r E E EE EAEE E E E E T E EEE E EEE E EEE EE EEEE EEEE T T E E EAE AE AE E T T E EE E EAE AE E E E E E E EE E EE E E T EEEE EEEE E EAE r T E E E AE E AE AE E E E E E EEE E EEE E EEE EEE EEEE E EAE AE EAE r E E E E EAE AE E E E E E EEE E EEE E EEEE A EEEE E EAE r T E EE E EE E E E E E EEEE EEE EEEE EEEE EIEE r E E E E EAE E E E E E E EEE E EEE E EEEE 02 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 Link Width 8x PCI Link Speed 2 5Gb s Installation finished successfully In case your machine has the latest firmware no firmware update will occur and the Y installation script will print at the end of installation a message similar to the following gt Installation finished successfully The firmware version 2 9 1000 is up to date Note To force firmware update use force fw update flag 28 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates gt Error message I Querying device E Can t auto detect fw configuration file Step 4 In case the installation script performed firmware
80. For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has failed torus 2QoS would use the path S m p o T r D Note that it can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say x 5 and x 0 can be ignored for path segments on that ring One result of this is that torus 2QoS can route around many simultaneous link failures as long as no ID ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabrics Note that in the case where there are multiple parallel links between a pair of switches torus 2QoS will allocate routes across such links in a round robin fashion based on ports at the path destina tion switch that are active and not used for inter switch links Should a link that is one of several such parallel links fail routes are redistributed across the remaining links When the last of such a set of parallel links fails traffic is rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise illegal i e not allowed by DOR rules Torus 2QoS will introduce such a turn as clos
81. HA_ENABLE yes e To set up and use the HA feature you need the dm multipath driver and multipath tool e Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of an SRP Target setup file KKKKKKKKKKKKKKKKKKKKKKK STOL sh KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK bin sh PST SEST scst untreads 1 0 Upro eses Eos Es Es ds SRD eco pen a de actos SICURI CIO Deve ceo toy dls desk echo open ars der sdb BlOCkIO gt pros sos Eg vdrsk uds Schou pedis den sc BOC Y pros CORI vas asi echo open vdisk3 dev sdd BLOCKIO gt proc scsi tgt vdisk vdisk Scho def proc eg Groups Morante devices eno add fur EE RT Ses toh groupe Derault devices cl II oo Ss deg e 009 Dota EN RES cano adds OS tr devices modprobe ib srpt so e a peo Seal ec ie ere SE sel Scion ada nto PEDO Ses iia io Vee Sn a eis oa A OO sal ja is el KKKKKKKKKKKKKKKKKKKKKKK End srpt sh KKEKKKKKKKKKKKKKKKKKKKKKKKKKK B 3 How to Unload Shutdown 1 Unload ib srpt odos H sue one 2 Unload scst and its dev_handlers first GETO uo Sesto molle asi 3 Unload ofed etc rc d openibd stop Mellanox Technologies 219 Rev 1 5 3 1 0 0 Appendix C mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modpobe conf and or and or and or options mlx4 core parameter lt value gt options mlx4
82. HE dos a Ur 0 o carl eg qos ewe vlaro low O0 ls 224 ode 914 080 124 024 od 0st sa ey Los 1474 gostene sla di0 0401 VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos lt type gt high limit indicates the number of high priority packets that can be transmitted 148 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 without an opportunity to send a low priority packet Specifically the number of bytes that can be sent is high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded 4 If the 255 value is used the low priority VLs may be starved De A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low pri
83. IP addresses The following code lines are an excerpt from a sample IPoIB configuration file otavac settings all values provided by thio TPIS IPADDR ib0 11 4 3 175 MES ZO ESO NETWORK ib0 11 4 0 0 BROADCAST IA 75512 55 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 IPADDR ib0 11 4 NEMA Shen Un 7 557 a NETWORK ib0 11 4 0 0 BROADCAST pS A 25 52790 ONBOOT ib0 1 Based on the first eth lt n gt interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 TRADER E et ares ETT ZO 04000 NETWORK ib0 11 4 0 0 BROADCAST 1b0 11 4 255 255 ONBOOT ib0 1 80 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 7 3 3 Manually Configuring IPoIB 4 This manual configuration persists only until the next reboot or driver restart ha To manually configure IPoIB for the default IB partition VLAN perform the following steps Step1 To configure the interface enter the ifconfig command with the following items The appropriate IB interface 1b0 1b1 etc The IP address that you want to assign to the interface The netmask keyword E A K A The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface Mosto omo lod SL ace 2515 25 0410 210 Step 2 Option
84. IZ oor Ei CEUC AGEING TIME 77 SWITCH 0x0002c902004050 8 AGEING TIME 44 SWITCH Oxabcde ENABLE false Mellanox Technologies 157 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 8 9 Congestion Control 8 9 1 Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in 1 e it is a shared library libec mgr so that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Con gestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches 8 9 2 Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so per form the following 1 Create the file Run opensm c lt options file name gt 2 Find the event_plugin_name option in the file and add cemgr to it Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt a Once the Congestion Control is enabled on the fabric nodes to completely disable 4 Congestion Control you
85. InfiniBand specific functions and plugs into the InfiniBand midlayer mlx4_en A 10GigE driver under drivers net mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer mlx4_fc Handles the FCoE functions using ConnectX ConnectX 2 Fibre Channel hardware offloads 1 4 3 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and ker nel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 4 4 Open FCoE The FCoE feature is based on and interacts with the Open FCoE project Mellanox OFED includes the following open fcoe org modules libfc and fcoe See Section 4 2 Fibre Channel over Ethernet 1 4 5 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Reliable Connected RC by default but it may also be configured to be Unreliable Datagram UD The interface supports unicast multicast and broadcast For details see Chapter 4 7 IP over InfiniBand RoCE RDMA over Converged Ethernet RoCE allows InfiniBand IB tran
86. LES 11 please ignore the following error messages in var log messages when loading ib_srpt to SLES 11 distribution s kernel ts ptos o ERTI esta e aber E EI ios Ste Este Lo peso volevo nese ete quis Lei sip UNO iso Eset regis ter ES episcopato Is EE HI nego DO Ss cs tarea rstertraroentrenplate B On Initiator Machines On Initiator machines manually perform the following steps 1 Run modprobe ib srp 2 Run ibsrpdm c d dev infiniband umadX to discover a new SRP target UMW Ono es OA Umad 5 clone Zenza es pa umad2 port 1 of the second HCA 3 echo new target info gt sys class infiniband srp srp mthca0 1 add target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in the system 1 e mthca0 root lab104 ibsrpdm c d dev infiniband umad0 id ext 0002c90200226c 4 ioc guid 0002c90200226c 4 dgid fe800000000000000002c90200226cf 5 pkey ffff service id 0002c90200226c 4 root lab104 echo id ext 0002c90200226c 4 i0c guid 0002c90200226c 4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c90200226cf4 gt sys class infiniband srp srp mthca0 1 add target OR 218 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 e You can edit etc infiniband openib conf to load the SRP driver and SRP High Avail ability HA daemon automatically that is set SRP LOAD yes and SRP
87. Linux da In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient et lt client cont file gt lt 1B network interface name gt Example of a configuration file for the ConnectX PCI Device ID 26428 called dhelient conf The value indicates a hexadecimal number interface ib1 Send eet cien radeon e NEO III 000007 eo 00s 0007 eo 03 000 iso Example of a configuration file for InfiniHost HI Ex PCI Device ID 25218 called GG L ent cone The value indicates a hexadecimal number HEG 2 T Viol 4 send dhcp client identifier 203 003553 reo AO G 00 00s 00 DDT G DUDI DZ GR UDS 1 219 SI lt In order to use the configuration file run hostile dhclient cf dhclient cont abl Mellanox Technologies 79 Rev 1 5 3 1 0 0 Driver Features 4 7 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the instal lation script with a configuration file using the n option containing the full IP configuration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface e A static IPoIB configuration e An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring
88. Lun 0 Path dev sda5 Type fileio Example of an iSCSI Target ign line Targer 16m 2007 037 1 3 4 10 1 sesiboor Step 4 Start your iSCSI Target Example hostl etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section 4 7 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI target Mellanox Technologies 213 Rev 1 5 3 1 0 0 Filename Opinion E eee n The following is an example for configuring an IB ETH device to boot from an iSCSI target host hostii filename MW For a ConnectX device with ports configured as InfiniBand comment out the following line option dhcp client identifier fe OO S Aa Aa 2 DIU 2 DU 9 Z 6099 9 02 Ok Ore TIS Z For a ConnectX device with ports configured as Ethernet comment out the following line hardware ethernet 00 02 c9 00 00 bb A 11 WinPE Mellanox FlexBoot enables WinPE boot via TFTP For instructions on preparing a WinPE image please see http etherboot org wiki winpe 214 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 215 Rev 1 5 3 1 0 0 Appendix B SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or I
89. MACs where applicable These val ues can be set later using the sg command see Table 24 below guids burn s 4 GUIDs must be specified here The specified GUIDs are assigned the fol lt GUIDs gt lowing values repectively node portl port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Table 23 mstflint Switches Sheet 2 of 2 Affected Relevant Description Commands byte_mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types Non interactive mode Assume the answer is yes to all questions e CI Non interactive mode Assume the answer is no to all questions vsd lt string gt bum Write this string of up to 208 characters to VSD upon a burn command bend i Burn vsd as it appears in the given image do not keep existing VSD on Flash dual image Make the burn process burn two images on Flash The current default fail safe burn process burns a single image in alternating locations Table 24 mstflint Commands ES Burn Block Burn the given image as is without running any checks de lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file wbne lt addr gt lt size gt Write a data block to Flash without sector erase
90. Mellanox TECHNOLOGIES Mellanox OFED for Linux User Manual Rev 1 5 3 1 0 0 www mellanox com NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PRO VIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUS TOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CAN NOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MER CHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEM PLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCURE MENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Mellanox TECHNOLOGIES Mellano
91. Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack based on the OpenFabrics OFED Linux stack and operates across all Mellanox network adapter solutions supporting 10 20 and 40Gb s InfiniBand IB 10Gb s Ethernet 10GigE Fibre Channel over Ethernet FCoE and 2 5 or 5 0 GT s PCI Express 2 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based ROMA protocols and software and are supported with major operating system distributions Mellanox OFED is certified with the following products e Mellanox Messaging Accelerator VMA software Multicast socket acceleration library that performs OS bypass for standard socket based applications e Mellanox Unified Fabric Manager UFM software Powerful platform for man aging demanding scale out computing fabric environments built on top of the OpenSM industry standard routing engine e Fabric Collective Accelerator FCA FCA is a Mellanox MPI integrated software package that utilizes CORE Direct technology for implementing the MPI collectives communications 1 2 Introduction to Mellanox VPI Adapters Mellanox VPI adapters which are based on Mellanox ConnectX and ConnectX 2 adapter devices provide leading server and storage I O performance with flexibility to support the myriad of communication protocols and network fabrics over a single device without sacrificing func tionality when consolida
92. O unit and provides storage services See Chapter 4 5 SCSI RDMA Protocol and Appendix B SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that promotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface is defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifica tion is timed to coincide with OFED release of the Open Fabrics www openfabrics org software stack For more information about the DAT collaborative go to the following site http www datcollaborative org 1 4 6 MPI Message Passing Interface MPI is a library specification that enables the development of parallel software libraries to utilize parallel computers clusters and heterogeneous networks Mellanox OFED includes the following MPI implementations over InfiniBand e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Bench mark and Presta 1 4 7 InfiniBand Subnet Manager All InfmiBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an 18 Mellanox Technolog
93. P server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions The value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecimal digits Extracting the Port GUID Method To obtain the port GUID run the following commands S The following MFT commands assume that the Mellanox Firmware Tools MFT package has been installed on the client machine hostils mst Start hostil mst status The device name will be of the form dev mst mt lt dev id gt pcif cr0 conf0 Use this device name to obtain the Port GUID via the following query command flint d lt MST DEVICE NAME gt q Example with ConnectX 2 QDR MHJH29B XTR Dual 4X IB QDR Port PCIe Gen2 x8 Tall Bracket ROHS R6 HCA Card CX4 Connectors as the adapter device Mellanox Technologies 201 Rev 1 5 3 1 0 0 Image types ConnectX EW Version 2 OO Rom Into type PXE version 3 3 400 devid 26428 proto VPI Device ID 26428 DPecoript ron Node Porel Pores Sys image GUIDs 000260 000e ETa TOO 90000500002 0998000500 0002c9030005cffd MACs IIS rea WAS Semis Board ID MT 0DDO0110009 SIDA PoID MT 0DD0110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 00 10 39 Extracting the Port GUID Method II An alternative method for obtain
94. PI sarai leleine Aerea ae 164 932 OUpulFiles s 64 oon He iii ue ot bat dared kee ene beh oars teats hake 165 9355 INC Ode dra a lt ate Ae elio baita 166 9 4 ibdiagnet of ibutils IB Net Diagnostic 166 SAL SN NORS Su bi 166 942 SO pue ler ta A era 168 45 ERROR CODES 2640 A AS AA a eee 169 9 5 ibdiagpath IB diagnostic path 169 doll SYNOPSYS da ch Reh ee BEATA Ad tier ga eich 169 DS OPUF FANGS i crise illo bellis Seat nd 2 ae en Bose ha 170 05 5 ERROR CODES dl ota oka ogee Pelee Shae are 170 9 6 ibv devices 171 9 7 ibv_devinfo 171 9 8 1bdev2netdev 173 Gel ASNO SOS elio ile dala it ire 173 9 9 ibstatus 173 9 10 ibportstate 175 9 11 ibroute 179 9 12 smpquery soo C 182 9 13 perfquery 186 9 14 ibcheckerrs 189 9 15 mstflint 191 9 16 ibv asynewatch
95. PoIB requires assigning an IP Mellanox Technologies Tf J Rev 1 5 3 1 0 0 Driver Features address and a subnet mask to each HCA port like any other network adapter card i e you need to prepare a file called ifefg ib lt n gt for each port The first port on the first HCA in the host is called interface 1b0 the second port is called 1b1 and so on An IPoIB configuration can be based on DHCP Section 4 7 3 1 or on a static configuration Sec tion 4 7 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 7 3 3 4 7 3 1 IPolB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP is performed similarly to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp If IPoIB configuration files are included ifcfg 1b lt n gt files will be installed under 4 etc sysconfig network scripts on a RedHat machine po etc sysconfig network on a SuSE machine 4 A patch for DHCP is required for supporting IPoIB For further information please see the REAME which is available under the docs dhep directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hardware address To overcome this problem DHCP over InfiniBand messages c
96. Products gt InfiniBand VPI SW Drivers The binary code is exported by the device as an expansion ROM image A 1 1 Supported Mellanox Adapter Devices and Firmware The package supports all ConnectX ConnectX 2 ConnectX 3 network adapter devices and cards Specifically adapter products responding to the following PCI Device IDs are supported ConnectX ConnectX 2 ConnectX 3 devices 198 Mellanox Technologies e Decimal 25408 Hexadecimal Decimal 25418 Hexadecimal Decimal 25448 Hexadecimal Decimal 26418 Hexadecimal Decimal 26428 Hexadecimal Decimal 26438 Hexadecimal Decimal 26448 Hexadecimal Decimal 25458 Hexadecimal Decimal 26458 Hexadecimal Decimal 26468 Hexadecimal Decimal 26478 Hexadecimal 6340 634a 6368 6732 673c 6746 6750 6372 675a 6764 676e Decimal 4096 Decimal 4097 Decimal 4099 Decimal 4098 Decimal 4100 Decimal 4101 Decimal 4102 Decimal 4103 Decimal 4104 Decimal 4105 Decimal 4106 Decimal 4107 Decimal 4108 Decimal 4109 Decimal 4110 Decimal 4111 Decimal 4112 Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal Hexadecimal 1000 1001 1003 1002 1004 1005 1006 1007 1008 1009 100a 100b 100c 100d 100e 1001 1010 Mellanox OFED for Linux User
97. S Class The CMA then creates Ser vice ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes 4 8 3 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four subsec tions I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription ll Fabric Setup Defines how the SL2VL and VLArb tables should be setup 86 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 In OFED this part of the policy is ignored SL2VL and VLArb tables should be config ured in the OpenSM options file opensm opts hai III QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet L
98. SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The following two match rules are equivalent sdp SS any service id 0x0000000000010000 0x000000000001ffff lt SL gt 8 6 6 3 RDS Similar to SDP RDS PR query is matched by Service ID The Service ID for RDS 1s 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds SME any service id 0x00000000010648CA lt SL gt 8 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the target IB port GUID The following two match rules are equivalent grp target port guid E Sl any target Port guid H SS Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 8 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traf fic and that s why it is the only ULP that did not appear in the qos ulps section Mellanox Technologies 147 Rev 1 5 3 1 0 0 OpenSM Sub
99. SK IS SET mask attr mask amp attr Dit EHET carry 0 gender For L 0 to 63 LT a lea Q PESO Bonos Genter lt lt lt ol Dated ress pit rada encon is oo Va DO LOS E MASK IS SET compare add bit position amp new carry LE DEL ren alice Pon Moon COLON iNew rGatry 4s MASK ls Scompare add mask pit position Tetu tonic nte s pones Mellanox Technologies 89 Rev 1 5 3 1 0 0 Driver Features 4 10 Socket Acceleration 4 Socket Accelerator is still at beta level in this release Ai 4 10 1 Overview The socket acceleration will accelerate the latency of the supported functionalities recv poll select epoll of all the sockets which match the policy rules mechanism 4 10 2 MLX4 Socket Acceleration Module Configuration To configure MLX4 Socket Acceleration module perform the following steps 1 Unload the mlx4 en kernel module modprobe r mlx4 en I Add at the end of the configuration file either modprobe conf or modprobe d mlx4_en conf the following opeiono miden tenable ex tace lol 2 Reload the mlx4 en kernel module modprobe mlx4 en 3 Load the mlx4 acceleration modules modprobe mlx4 accl sys modprobe mlx4 acc 4 Set the policy rules according to the usages below i schio Voglio se gt Vanoise cee joo live Usage Add Remove rule add remove lt app gt tcp connect tep accept udp bind a b c d lt n gt lt x gt lt y gt
100. T11 stack the sysfs directory is located at sys class mlx4 fc When using the T11 stack the sysfs directory is located at sys module fcoe Both directories contain the same entries In the following the sysfs directory will be referred to as SFCSYSES To create a new vHBA on an Ethernet interface e g eth3 run gt echo Yeth gt ECOS SES Create To destroy a previously created vHBA on an interface e g eth3 run gt echo eth3 gt SFCSYSFS destroy To signal link up to an existing vHBA e g on eth3 run gt echo eth3 gt FCSYSFS link up To signal link down to an existing vHBA e g on eth3 run 46 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 o as Sl a 4 2 3 2 Creating vHBAs That Use PFC To create a vHBA that uses the PFC feature it is required to configure the Ethernet driver to sup port PFC create a VLAN Ethernet interface assign it a priority and start a vHBA on the interface The following steps demonstrate the creation of such a vHBA To configure the mlx4 en Ethernet driver to support PFC add the following line to the file etc modprobe conf and restart the network driver Options Mix4 core prerx Uxtt pierx Uxit To create a VLAN with an ID e g 55 on interface e g eth3 run fe vVeonrVg add vets 55 omite pasooo To set the map of skb priority 0 to the requested vlan priority e g 6 run Acc aio bres res H o UNO
101. U utilization This mode is called ZCopy combined mode The sendmsg syscall is blocked until the buffer is transfered to the socket s peer and the data is copied directly from the user buffer at the source side to the user buffer at the sink side To set the threshold use the module parameter sdp zcopy thresh This parameter can be accessed through sysfs sys module ib_sdp parameters sdp zcopy thresh Setting it to 0 disables ZCopy 4 5 SCSI RDMA Protocol 4 5 1 Overview As described in Section 1 4 5 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator con trols the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services Section 4 5 2 describes the SRP Initiator included in Mellanox OFED for Linux This package however does not include an SRP Target Mellanox Technologies 59 Rev 1 5 3 1 0 0 Driver Features 4 5 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D avail able from http www t10 org The SRP Initiator supports e Basic SCSI Primary Commands 3
102. VLAN tag will be present in all EoIB packets sent by the vNics and will be verified on all packets received on the vNic When passed from the InfiniBand to Ethernet the EoIB encapsulation will be disassembled but the VLAN tag will remain For example if the vNic eth23 is associated with a vHub that uses BridgeX bridge01 eport A10 and VLAN tag 8 all incoming and outgoing traffic on eth23 will use a VLAN tag of 8 This will be enforced by both BridgeX and destination hosts When a packet is passed from the internal fabric to the Ethernet subnet through the BridgeX it will have a true Ethernet VLAN tag of 8 The VLAN implementation used by EoIB uses operating systems unaware of VLANs This is in many ways similar to switch tagging in which an external Ethernet switch adds strips tags on traf fic preventing the need of OS intervention EoIB does not support OS aware VLANS in the form of vconfig Configuring VLANs To configure VLAN tag for a vNic add the VLAN tag property to the configuration file in host administrated mode or configure the vNic on the appropriate vHub in network administered mode In the host administered mode when a vHub with the requested VLAN tag is not available the vNIC s login request will be rejected e Host administered VLAN configuration in centralized configuration file can be modi fied as follow Add vid lt VLAN tag gt or remove vid property for no VLAN e Host administered VLAN configuration with ifc
103. Value gt gt Print any provided pm that is greater than its provided value Ona Provide a report of the fabric qualities SS Indicate that UpDown credit loop checking should be done against automatically determined roots SW SAR ex 22 x Specify the expected link width sd Specify the expected link speed skip lt ibdiag check gt Skip the execution on the given stage Applicable to the following stages dup guids lids links sm nodes infolall default None ONE tierce eld e whe das Specify the directory where the output files will be placed SS Spec une Entre sola Tor priming reno screen default SIE Placed default var tmp ibdiagnet2 Mido Print this help message V version Print the version of the tool 9 3 2 Output Files Table 13 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 13 ibdiagnet of ibutils2 Output Files An ibdiagnet run performs the following stages e Fabric discovery Mellanox Technologies 165 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities e Duplicated GUIDs detection e Links in INIT state and unresponsive links detection e Counters fetch e Error counters check e Routing checks e Link width and speed checks 9 3 3 Return Codes 0 Success i Failure with description 9 4 ibdiagnet of ibutils IB Net Diagnostic 4 This version of ibdiagnet is included in the ibutils package and it is run by default after installi
104. WN going port groups unless they are leaf switches e Switches of the same rank should have the same number of ports in each UP going port group e Switches of the same rank should have the same number of ports in each DOWN going port group e All the CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and 1t should only comply with the following rules e Tree rank should be between two and eight inclusively e All the Compute Nodes have to be at the same tree level rank Note that non com pute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algorithm allows leaf switches to have any number of CAs the closer the tree 1s to be fully populated the more effective the shift communication pattern will be In general even 1f the root list is pro vided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering f
105. able 1 Table 2 Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12 Table 13 Table 14 Table 15 Table 16 Table 17 Table 18 Table 19 Table 20 Table 21 Table 22 Table 23 Table 24 Table 25 Table 26 Table 27 Mellanox Technologies r 4 Typographical Conventions LL 13 Abbreviations and Acronyms 0 13 ClOssor oie Bose ee ote eee ea dete a wre a Oa LETALI 15 Reference DOCS sita paren tS A E a br e ai 16 minxotedinstall Refur Codes 2 ieee sy ean A ekke ea 29 mix4 me CONE TIG TOM E A AN a rB Red Hat Linux mlx4 vnic conf file format rr eee 78 Supported ConnectX Port Configurations rss rss rss ss 101 Recommended PCIe Configuration ida ra it 106 Use MMP EOR TTT 113 Congestion Control Manager General Options File o oooooooooo o o 165 Congestion Control Manager Switch Options File 0 0 0 ccc cee 165 Congestion Control Manager CA Options File oo 165 Congestion Control Manager CC MGR Options File 0 0 00 0c eae 166 ibdiagnet of ibutils2 Output Files 170 ibdiagnet ofibutils Output Files citar sek aed letali haw oe we 173 Ibdiag path Output PINGS 222603 edet sb pati ripio sea Glee 175 iby devinfo Flags and Options veria A Koa ee Reh 177 ibstatus Flags and Options 179 Ibportstate FlassandOpions fcc lea io i 181 ibportstate Flags and Options 00 cece eee e eee ees 185 smpquery Flags and Options 18
106. aded manually or automatically Manual Operation Use the utility ib bond to start query or stop the driver For details on this utility please read the documentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Automatic Operation Automatic 1b bonding operation can be configured as follow I Using a standard OS bonding configuration For details on this please read the documenta tion for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Notes e If the bondX name is defined but one of bondX SLAVES or bondX IPs is missing then that specific bond will not be created e The bondX name must not contain characters which are disallowed for bash variable names such as and All the newer OSes Bonding can be done with the inbox bonding module ha 222 Mellanox Technologies
107. al Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib argument The following example shows how to verify the configuration T eee ig Loe PO link encap UNSPEC HWaddir 00001 04 FE UI SOS 0000000000 OOO Ea Te e a o as O as O UN UPT PROADCAST MULTICAST IMTU C3320 Metric RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 D TX bytes 0 0 0 D Step 3 Repeat Step I and Step 2 on the remaining interface s 4 7 4 Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri mary parent interface The default Partition Key PKey ff ff applies to the primary parent Interface This section describes how to e Create a subinterface Section 4 7 4 1 e Remove a subinterface Section 4 7 4 2 Mellanox Technologies 81 J Rev 1 5 3 1 0 0 Driver Features 4 7 4 1 Creating a Subinterface 4 In the following procedure 1b0 is used as an example of an IB subinterface ha To create a child interface subinterface follow this procedure Step1 Decide on the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For e
108. and set the CC manager main settings perform the following e To enables disables Congestion Control mechanism on the fabric nodes set the follow ing parameter enable e The values are lt TRUE FALSE gt e The default is True e CC manager configures CC mechanism behavior based on the fabric size The larger the fabric is the more aggressive CC mechanism is in its response to congestion To manually modify CC manager behavior by providing it with an arbitrary fabric size set the following parameter num hosts e The values are 0 48K e The default is o base on the CCT calculation on the current subnet size e The smaller the number value of the parameter the faster HCAs will respond to the congestion and will throttle the traffic Note that if the number is too low it will result in suboptimal bandwidth To change the mean number of packets between marking eli gible packets with a FECN set the following parameter marking rate e The values are 0 Oxffff e The default is Oxa e You can set the minimal packet size that can be marked with FECN Any packet less than this size bytes will not be marked with FECN To do so set the following param eter packet Eze e The values are 0 0x3 c0 Mellanox Technologies 159 Rev 1 5 3 1 0 0 OpenSM Subnet Manager e The default is 0x200 e When number of errors exceeds max errors of send receive errors or timeouts in less than error window se
109. anox HCA Nocti spe v grep Mellanox 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev ab Download the ISO image to your host The image s name has the format MLNX OFED LINUX lt ver gt lt OS label gt lt CPU arch gt iso You can download it from http www mellanox com gt Products gt IB SW Drivers Use the mdSsum utility to confirm the file integrity of your ISO image Run the following com mand and compare the result to the value provided on the download page host1 md5sum MLNX OFED LINUX lt ver gt lt 0S label gt iso Installing Mellanox OFED The installation script mlnxofedinstall performs the following e Discovers the currently installed kernel e Uninstalls any software stacks that are part of the standard operating system distribu tion or another vendor s commercial stack e Installs the MLNX OFED LINUX binary RPMs if they are available for the current kernel e Identifies the currently installed InfiniBand and Ethernet network adapters and auto matically upgrades the firmware The firmware will not be updated if you run the install script with the without fw update option 22 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deleti
110. apturing RSPAN traffic I Run ibdump gt ibdump IB device gt mlx4 ON IB port ai Dump file SSA Sniffer WOES max burst size 4096 196 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 197 Rev 153 100 Appendix A Mellanox FlexBoot A 1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology FlexBoot supports remote Boot over InfiniBand BoIB and over Ethernet Using Mellanox Virtual Protocol Interconnect VPI technologies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target iSCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox ConnectX products FlexBoot is based on the open source project PXE available at http www ipxe org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then 1t connects to a DHCP server to obtain its assigned IP address and net work parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some other service For an InfiniBand port Mellanox FlexBoot implements a network driver with IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Linux software package see www mellanox com gt
111. arget TCP IP port range e SRP with a specific target IB port GUID e RDS e IPoIB with a default PKey e IPoIB with a specific PKey e Any ULP application with a specific Service ID in the PR MPR query e Any ULP application with a specific PKey in the PR MPR query e Any ULP application with a specific target IB port GUID in the PR MPR query Since any section of the policy file 1s optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file Mellanox Technologies 145 Rev 1 5 3 1 0 0 OpenSM Subnet Manager The shortest policy file in this case would be as follows qa ss default end gos ulps dera ll It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible keywords gossulips default sdp port num 30000 sapa POCE NUM L000 7 0 00 0 sdp rds ipoib pkey 0x0001 1poib any service id 0x6234 any pkey 0x0ABC Srp tardet Port guid Wxd734 any end qos ulps ES HEEE E T T T T T T T T T T T T T GITTE L E T default SL Shonen icaro enon top of SDP when a destination TCP IPport is 30000 default SL for any other apps S r elake BROR Hae o ee er o RESP Sie ror RDS E SIE LO SEO IAS RITO pk
112. bytes Sa Gil Westie hosts Client Oc E A connecredsbeo lo 22122227 sent 2048 bytes host2 NIE 00 54 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 55 Rev 1 5 3 1 0 0 Driver Features 56 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies of Rev 1 5 3 1 0 0 Driver Features 58 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 printf read zd bytes n nr SER Mendo rest inl Close lcd close sd return 0 4 4 6 BZCopy Zero Copy Send BZCOPY mode is only effective for large block transfers By setting the sys parameter sdp zcopy thresh to a non zero value a non standard SDP speedup is enabled Messages longer than sdp zcopy thresh bytes in length cause the user space buffer to be pinned and the data to be sent directly from the original buffer This results in less CPU usage and on many systems much higher bandwidth Note that the default value of sdp zcopy thresh is 64KB but is may be too low for some systems You will need to experiment with your hardware to find the best value 4 4 7 Using RDMA for Small Buffers For smaller buffers the overhead of preparing a user buffer to be RDMA ed is too big therefore it 1s more efficient to use BCopy Large buffers can also be sent using RDMA but they lower CP
113. can _A occur if the SRP LUNS are in the black list of multipath Edit the blacklist section in af etc multipath conf and make sure the SRP LUNs are not black listed Automatic Activation of High Availability e Set the value of SRPHA ENABLE in etc infiniband openib conf to yes For the changes in openib conf to take effect run etc init d openibd restart From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper 4 It is possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name ha e Itis possible to see the output of the SRP daemon in var log srp daemon log 4 5 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop or as a by product of a complete system shutdown Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three cases I Without High Availability 66 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Kill the SRP daemon ins
114. can be used for log ging and running commands on remote computers and or servers 7 2 1 SSH Configuration The following steps describe how to configure password less access over SSH Step 1 Generate an ssh key on the initiator machine host1 Mellanox Technologies 105 Rev 1 5 3 1 0 0 MPI Message Passing Interface host1 ssh keygen t rsa Generating public private rsa key pair mesta is alia io eo Sens das sy 1 sms siste el AN Enter passphrase empty for no passphrase Enter same passphrase again EEE a tron fas been saved Mono username ssh idursa Monno obio aa ss ias loss ssl da Meme Ss ae a oul The key fingerprint 1s SIE USOS eer UE DS dl er OR lt lsername gt hostil Step 2 Check that the public and private keys have been generated Nossa Mnong lt leername es ny hosti ls hose Ll la total 40 QLWxX A POOE FORE A076 Mare o dare OO GOO Ur Ue Och aire B R A Joe Co e A O O A Be Seno puo Step 3 Check the public key fiosksafe sei de oe SE oa AAAAB3NZzaC1yc2EAAAABIWAAAOEA1zVY8VBHOh90kZN70A11bUO74RXm4zHeczyVxpYHaDPyDmgezbYMKrCIVzd10b H ZkCOrpLYviU00UHd3fvNT Ms0gcGg08PysUf 12FyYj ira2P1xyg6mkHLGGgVut fEMmABZ3wNCUg6J2X3G uiuS WXeubZmbXcMrP w4IWByfH8a jwo6A5WioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorXNC0Z Be4kTnUgm63n02z1gVMdL9IFrCmalxI0u9 SQUA wONevaMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrodZ1 C0 lt username gt hostl Step 4 Now you need to add the pub
115. cations Please refer to http www open mpi org faq category mpi apps 108 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 8 OpenSM Subnet Manager 8 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Man agement Model 13 Subnet Management 14 and Subnet Administration 15 8 2 opensm Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also pres
116. cause without the connection to the BridgeX the EoIB protocol cannot work and no data can be sent on the wire The mlx4 vnic driver can also report the status of the external BridgeX port status by using the mlx4 vnic info script If the eport state enforce module parameter is set then the external port state will be reported as the vNic interface link state If the connection between the vNic and the BridgeX is broken hence the external port state is unknown the link will be reported as down e the link state of the external port associated with the vNic interface the InfiniBand fabric appears to be functional The issue might result from a miscon A link state is down on a host administrated vNic when the BridgeX is connected and al figuration of either BXADDR or and BXEPORT configuration file To query the link state run the following command and look for Link detected ethtool lt interface name gt Example ethtool eth10 Settings for ermal supported ports Supported link modes Supports auto negotiation No Advertised link modes Not reported Advertised auto negotiation No Speed Unknown 10000 Duplex Full Port Iwisted Pair PHYAD 0 Transceiver internal Auto negoriation orf Supports Wake on d Wake on d Current message level 0x00000000 0 Link detected yes 4 6 3 4 Bonding Driver EoIB uses the standard Linux bonding driver For more information on the Linux Bonding driver please ref
117. ce s MTU Since the mlx4 en interface s MTU is 1560 port 2 will run with MTU of 1K Please note that RoCE s MTU are subject to IB MTU restrictions The RoCE s MTU values are 256 byte 512 byte 1024 byte and 2K Association of IB Ports to Ethernet Ports It 1s useful to know how IB ports associate to network ports ibdev2netdev MI ANA ie OO tele OL Since both RoCE and mlx4_ en use the Ethernet port of the adapter one of the drivers must carry the task of controlling the port state In this implementation it is the task of the mlx4 en driver The mlx4 ib driver holds a reference to the mlx4 en net device for getting notifications about the state of the port as well as using the mlx4 en driver to resolve IP addresses to MAC that are required for address vector creation However RoCE traffic does not go through the mlx4 en driver it is completely offloaded by the hardware Configre an IP Address to mIx4_en Interface Run the following on both sides of the link Mellanox Technologies 39 Rev 1 5 3 1 0 0 Driver Features PLCN I EEA H E H ifconfig eth2 eth2 lank encap kthernet HWaddr 00702 C09 08 E892311 Mere a S E EI UP BROADCAST MULTICAST MTU 1500 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 LAIA oy P ES s OOTO Make sure that ping is working FORT TT PIN o T ss cotas O AS ga ao See eee AS A o one
118. concept of a dateline for each torus dimension It encodes into a path SL which datelines the path crosses as follows sl 0 OE KCl e he Mol lt lt eons Chilo USO Ls A es parte ossei sence he Weng Ik Se aa e eel elit lt lt lt Ol For a 3D torus that leaves one SL bit free which torus 2QoS uses to implement two QoS levels Torus 2QoS also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows PO E ES o cdir port reports which torus coordinate direction a switch port i th and re i SOI Om 2 lt 7 SR MEER Mellanox Technologies 131 Rev 1 5 3 1 0 0 OpenSM Subnet Manager Thus on a pristine 3D torus 1 e in the absence of failed fabric switches torus 2QoS consumes 8 SL values SL bits 0 2 and 2 VL values VL bit 0 per QoS level to provide deadlock free routing on a 3D torus Torus 2QoS routes around link failure by taking the long way around any 1D ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z 4 AO O O AO O O O AO O toto 3 tooo D 2 1 r 1 m in o ____p__ y 0 4 4 4 Ho x 0 1 2 3 4 T
119. conds the CC MGR will abort and will allow OpenSM to pro ceed To do so set the following parameter max errors error window e The values are max errors 0 zero tollerance abort configuration on first error error window 0 mechanism disabled no error checking 0 48K e The default is 5 8 9 4 1 Congestion Control Manager Options File Table 9 Congestion Control Manager General Options File enable Enables disables Congestion Control mechanism on the Values lt TRUE FALSE gt fabric nodes Default True num hosts Indicates the number of nodes The CC table values are cal Values 0 48K culated based on this number Default 0 base on the CCT cal culation on the current subnet size Table 10 Congestion Control Manager Switch Options File threshold Indicates how aggressive the congestion mark 0 0xf ing should be e 0 no packet marking e Oxf very aggressive Default Oxf marking rate The mean number of packets between marking eligible Values 0 Oxffff packets with a FECN Default Oxa packet size Any packet less than this size bytes will not be marked Values 0 0x3fc0 with FECN Default 0x200 Table 11 Congestion Control Manager CA Options File port control Specifies the Congestion Control attribute for this port Values e 0 QP based congestion control e 1 SL Port based congestion con trol Default 0 160 Mellanox Technologies Mellanox OFED for Linux Use
120. configuration files are included i fcfg ib lt n gt files will be installed under etc sysconfig network scripts ona RedHat machine etc sysconfig network ona SuSE machine e The installation process unlimits the amount of memory that can be pinned by a user space application See Step 5 e Man pages will be installed under usr share man Firmware e The firmware of existing network adapter devices will be updated if the following two conditions are fullfilled 6 1 You run the installation script in default mode that is without the option without fw update 2 The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image 7 4 If an adapter s Flash was originially programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image hai e In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device Mellanox Technologies 31 J Rev 1 5 3 1 0 0 Installation E Can t auto detect iw configuration file m 2 3 5 Post installation Notes e Most of the Mellanox OFED components can be configured or reconfigured after the installation by modifying the relevant configuration files See the relevant chapters in this man
121. cted route from the local node source and the destination node The minimal number of packets to be sent across each link default 100 Enable verbose mode Specifies the topology file name Specifies the local system name Meaningful only if a topology file is specified Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system Specifies the local device s port number used to Connect LOLAS 1B fabric Specifies the directory where the output files will be placed default tmp Specifies the expected link width Specifies the expected link speed Dump all the fabric links pm Counters into ibdiagnet pm Reset all the fabric links pmCounters If any of the provided pm is greater then its provided value print it to screen Prints the help page information Pants the version of Ene rool Prints the tool s environment variables and their values which is the Table 15 ibdiagpath Output Files A dump of all the application reports generated according to the provided flags A dump of the Performance Counters values of the fabric links 9 5 3 ERROR CODES IE The path tracedirs u 2 Failed to parse comm 3 More then 64 hops ar Source port and th 4 Unable to traverse t 5 Failed to use Topolo 6 Failed to load requi n healthy and line options e required for traversing the local port to the en to the Destination port he LFT data from source to destinati
122. cy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the pro vided policy on client requests The overall flow for such requests is as follows e The request is matched against the defined matching rules such that the QoS Level def inition is found Mellanox Technologies 139 Rev 1 5 3 1 0 0 OpenSM Subnet Manager e Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 3 QoS Manager QoS Policy Config File InfiniBand subnet with QoS OFED 1 3 rosy based nodes OSM There are two ways to define QoS policy e Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR e Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 8 6 2 Advanced QoS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by e Port GUID e Port name which is a combination of NodeDescription and IB port number e PKey which means that all the ports in th
123. d information about a specific vNic or all EoIB vNic interfaces such as BX info IOA info SL PKEY Link state and interface features If network administered vNics are enabled this script can also be used to discover the available BridgeX s from the host side e To discover the available BridgeXs run mixi vnic inio g e To receive the full vNic information of eth10 run nieto G RR e To receive a shorter information report on eth10 run mlx4 vnic info s ethl0 e To get help and usage information run midA vne HIG 4 6 3 2 ethtool ethtool application is another method to retrieve interface information and change its configura tion EoIB interfaces support ethtool similarly to hardware Ethernet interfaces The supported Ethtool options include the following options c C Show and update interrupt coalesce options Query RX TX ring parameters k K chow and update protocol ottiloads ii Show driver information o Show adapter statistics For more information on ethtool run ethtool h 4 6 3 3 Link State An EoIB interface can report two different link states e The physical link state of the interface that is made up of the actual HCA port link state and the status of the vNics connection with the BridgeX f the HCA port link state is Mellanox Technologies 13 Rev 1 5 3 1 0 0 Driver Features down or the EoIB connection with the BridgeX has failed the link will be reported as down be
124. dence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by e Source port group whether a source port is a member of a specified group e Destination port group same as above only for destination port e PKey e QoS class e Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Ser vice ID disregarding rest of the query fields However if a certain query has only Service ID which means that this is the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 8 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a match rule has only one criterion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and li
125. down A very com mon case that is handled by the unicast routing cache is host reboot which otherwise would cause two full routing recalculations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destina tion LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID This step is common to standard and Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same target port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid
126. ds Te ko 30 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 e The package kernel ib devel include files are placed under usr src ofa kernel include These include files should be used when building kernel modules that use the stack Note that the include files if needed are backported to your kernel e The raw package un backported source files are placed under usr src ofa kernel lt ver gt e The script openibd is installed under etc init d This script can be used to load and unload the software stack e The script connectx port config is installed under sbin This script can be used to configure the ports of ConnectX network adapter cards to Ethernet and or InfiniBand For details on this script please see Section 5 1 Port Type Management e The directory etc infiniband is created with the files info and openib conf and connectx conf The info script can be used to retrieve Mellanox OFED installation information The openib conf file contains the list of modules that are loaded when the openibd script is used The connectx conf file saves the ConnectX adapter card s ports configuration to Ethernet and or InfiniBand This file is used at driver start restart etc init d openibd start e The file 90 ib rules is installed under etc udev rules d e If OpenSM is installed the daemon opensmd is installed under etc init d and opensm conf is installed under etc e IfIPoIB
127. e as possible to the failed switch in order to route around it n the above example suppose switch T has failed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns 1 e those turns that are illegal under DOR This causes the first hop after any such turn to use a separate set of VL values and prevents deadlock in the presence of a single failed switch For any given path only the hops after a turn that is illegal under DOR can contribute to a credit loop that leads to deadlock So in the example above with 132 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 failed switch T the location of the illegal turn at I in the path from S to D requires that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I i e hop r D cannot contribute to a credit loop because they cannot be used to construct a loop encircling T The hop I r uses a separate VL so it cannot contribute to a credit loop encircling T
128. e subnet that belong to partition with a given PKey belong to this port group e Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group e Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port II QoS Setup denoted by qos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fab ric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts III QoS Levels denoted by gos levels Each QoS Level defines Service Level SL and a few optional fields 140 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 e MTU limit e Rate limit e PKey e Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by gos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes prece
129. echnologies 69 Rev 1 5 3 1 0 0 Driver Features BOOTPROTO dhcp ONBOOT yes BXADDR BX001 BXEPORT A10 VNICIBPORT m1x4 0 1 VNICVLAN 3 Optional field The fields used in the file for vNic configuration have the following meaning Table 3 Red Hat Linux mlx4_vnic conf file format Field Description DEVICE An optional field The name of the interface that is displayed when running ifconfig If it is not present the trailer of the configuration file name e g ifcfg eth47 gt eth47 is used instead HWADDR The mac address to assign the vNic BXADDR The BridgeX box system GUID or system name string BXEPORT The string describing the eport name VNICVLAN An optional field If it exists the vNic will be assigned the VLAN ID specified This value must be between 0 and 4095 VNICIBPORT The device name and port number in the form device name port number The device name can be retrieved by running ibv_devinfo and using the output of hca_id filed The port number can have a value of 1 or 2 Other fields available for regular eth interfaces in the 1fcfg ethX files may also be used mlx4_vnic_confd Once the configuration files are updated the host administered vNics can be created To manage the host administrated vNics run the following script Usage etc init d mlx4 vnic confd start stop restart reload status To retrieve general information about the vNics on the system including netw
130. eck on HCA 0 PASS Mellanox Technologies 29 Rev 1 5 3 1 0 0 Installation 2 3 4 HO ra R dz PASS Number of HCA Ports ACTIVE e aeee il Contorno al e UP 4X DDR Porre tare o Pore o RAC INIT Ear Coti 21 a t PASS kerne IE ORE PASS NA RR 00502 E 0300200 LET SS DONEs gt gt 4 prefix kernel version and installation parameters can be retrieved by running the com af mand etc infiniband info 74 After the installer completes information about the Mellanox OFED installation such as Installation Results Software e The OFED and MFT packages are installed under the usr directory e The kernel modules are installed under InfiniBand subsystem lib modules uname r updates kernel drivers infiniband mlx4 driver Under 1ib modules uname r updates kernel drivers net ml1x4 you will find mlx4 core ko mlx4_en ko mlx4_ib ko mlx4 vnic ko and mlx4 fc ko IPoIB lib modules uname r updates kernel drivers infiniband ulp ipoib ib ipoib ko SDP lib modules uname SRP K updates kernel drivers infiniband ulp sdp ib sdp ko lib modules uname r updates kernel drivers infiniband ulp srp ib srp ko lib modules uname RDS K Updates kernel a a nd Wp e o E lib modules uname r updates kernel net rds rds ko lib modules uname r updates kernel net rds rds rdma ko iby modules vaene Updates Kernel tet ds r
131. ector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also shows what the current selections are This com mand is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting 7 4 Compiling MPI Applications A A valid Fortran compiler must be present in order to build the MVAPICH MPI stack and tests ha The following compilers are supported by Mellanox OFED s MVAPICH and Open MPI packages Gcc Intel and PGI The install script prompts the user to choose the compiler with which to install the MVAPICH and Open MPI RPMs Note that more than one compiler can be selected simultane ously if desired Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user guide html Mellanox Technologies 107 Rev 1 5 3 1 0 0 MPI Message Passing Interface To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Appli
132. eee Baw ds 35 ALS FPimware Dependent iE eri iiti ee eed eee tee ed Bh aa 35 Aka General Guides slats 35 Ao N eT Ea App ca OASIS A Dd ee eek 35 Ako GID bia eigen 36 Al Usim VANS ille arie 36 4 1 8 Reading Port Counters Statistics eee eee eens 37 TES AUDetarled Pxaimple sa co Dei o dude due pioli hi Leila 38 41 10 Contiounne DAPL over ROCE sa tevicon aid diria 43 4 2 Fibre Channel over Ethernet 44 del PEOR OYONI OW sile win RI iatale ae 44 42 22 1 COP Basie B A e iatale he 44 4 29 ECoB Advanetd Usapezez cee ob ote Rha ans ta ita ada reos 46 4 3 Reliable Datagram Sockets 47 dol E We abs a tate zal E AN 47 AL ROST OM CULO wis oct eich st ici is Ge ob Sata gii dolo nia dell ia ii 48 4 4 Sockets Direct Protocol 48 Mellanox Techologies 1 J AA OVENI di a Ser tase SN var Goes ceda 48 442 MBSAP SO LIDIA A bra eatin AS 48 4 45 CONTOUR ES DP estadios ais A Lan yt I oe Aroha eee 49 4AA Environment Variables vu i wes b de tbat taped bho insiatbeglospo edi badi SI 4 4 5 Converting Socket based Applications 52 44 6 BZCopy Ze COpy Didi Slice ili lee 59 447 Using RDMA Tor Small Butters isa hie E aes 59 4 5 SCSI RDMA Protocol 59 os WOM c Ws 2 lent olen dt ot hore dete AE Ble Male Ele otters te 59
133. enarios for these utilities are presented ibsrpdm 1bsrpdm is using for the following tasks I Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device dev umadO exe cute the following command 1bsrpdm This command will output information on each SRP Target detected in human readable form Sample output IO Unit Intos nota ED 0103 POLE GID fe800000000000000002c90200402bd5 change ID 0002 Max Ccontrolbers 0x10 concroller 1 Mellanox Technologies 61 J Rev 1 5 3 1 0 0 Driver Features GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 One ess ILO REE LSI Storage Systems SRP Driver 200400a0b81146a1 service entries 1 service 0 200400a0b81146a1 SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the fol lowing command lbsrpdm d lt umad device gt 2 Assistance in creating an SRP connection a To generate output suitable for utilization in the echo command of Section 4 5 2 2 add the c option to ibsrpdm lbsrpdm c Sample output 1d ext 200400A0B81146A1 i0c guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service 1d 200400a0b81146a1 b To establish a connection with an SRP Target using the output from the libsrpdm c exam ple above execute the following command echo n id ext 200400A0B81146A1 i0c g
134. ent the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indicators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 8 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits e E lt raeke tame gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists Mellanox Technologies 109 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 110 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 111 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 112 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 113 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 114 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 115 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 116 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 8 2
135. er to 14 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 lt kernel source gt Documentation networking bonding txt Currently only fail over modes are supported by the EoIB driver load balancing modes including static and dynamic LACP configurations are not supported 4 6 3 5 Jumbo Frames EoIB supports jumbo frames up to the InfiniBand limit of 4K bytes The default Maximum Trans mit Unit MTU for EoIB driver is 1500 bytes To configure EoIB to work with jumbo frames 1 Make sure that the IB HCA and Switches hardware support 4K MTU 2 Configure Mellanox low level driver to support 4K MTU Add midA Core module param Cer IO Sein Ak miu 3 Change the MTU value of the vNic for example run ifconfig eth2 mtu 4038 A interface is 4038 bytes If the vNic is configured to use VLANs then the maximum 7 Due to EoIB protocol overhead the maximum MTU value that can be set for the vNic all MTU is 4034 bytes due to VLAN header insertion 4 6 4 Advanced EolB Settings 4 6 4 1 Module Parameters The mlx4_vnic driver supports the following module parameters These parameters are intended to enable more specific configuration of the mlx4_vnic driver to customer needs The mlx4_vnic is also effected by module parameters of other modules such as set 4k mtu of mlx4 core This mod ules are not addressed in this section The available module parameters include e tx rings num Number of TX r
136. erver sockets to listen on both SDP and TCP interfaces The various configu rations with SDP TCP sockets are explained inside the etc libsdp conf file 48 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 4 3 Configuring SDP To load SDP upon boot edit the file etc infiniband openib conf and set SDP LOAD yes 4 For the changes to take effect run etc init d openibd restart Aa SDP can work over IPoIB interfaces or RoCE interfaces In case of IPoIB SDP uses the same IP addresses and interface names as IPoIB see IPoIB configuration in Section 4 7 3 and Section 4 7 3 3 In case of RoCE SDP use the same IP addresses and interface names of the corresponding mlx4 en interfaces see mlx4 en configuration in Section 5 3 and Section 5 3 4 4 4 3 1 How to Know SDP Is Working Since SDP is a transparent TCP replacement it can sometimes be difficult to know that it is work ing correctly To check whether traffic is passing through SDP or TCP monitor the file proc net sdpstats and see which counters are running Alternative Method Using the sdpnetstat Program The sdpnetstat program can be used to verify both that SDP is loaded and is being used The following command shows all active SDP sockets using the same format as the traditional net stat program Without the S option 1t shows all the information that netstat does plus SDP data host1 sdpnetstat S Assuming that the SDP
137. es per port Rx steering mode Receive Core Affinity RCA Tx arbitration mode VLAN user priority off by default MSI X or INTx Adaptive interrupt moderation HW Tx Rx checksum calculation Large Send Offload 1 e TCP Segmentation Offload Large Receive Offload IP Reassembly Offload Multi core NAPI support VLAN Tx Rx acceleration HW VLAN stripping insertion HW VLAN filtering HW multicast filtering ifconfig up down mtu changes up to 10K Ethtool support Net device statistics CX4 QSFP and SFP connectors S The current version of MLNX OFED supports NC SI in Ethernet mode only hai 94 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 5 3 2 Loading the Ethernet Driver By default the Mellanox OFED stack loads m1x4 en Run ifconfig a to verify that the module is listed 5 3 3 Unloading the Driver If etc infiniband openib conf had MLX4 EN LOAD yes at driver start up then you can unload the m1x4 en driver by running etc init d openibd stop Otherwise unload m1x4 en by running gt modprobe r mlx4 en 5 3 4 Ethernet Driver Usage and Configuration e To assign an IP address to the interface run gt ifconfig eth lt n gt lt ip gt where x is the OS assigned interface number e To check driver and device information run gt ethtool i eth lt x gt Example oo ela driver mlx4 en MT 0440140005 version 1 5 1 March 2010 firmware version
138. etsh each rad Bee bn dae iaia Ei flo 90 4 10 2 MLX4 Socket Acceleration Module Configuration 90 4 10 3 Kernel Space Socket Accelaration Debug nannan nannaa nanna 91 4 11 Huge Pages Support for Queue Resources 91 Chapter gt Working With VP Ios ai AAA AT ed oe AA 93 5 1 Port Type Management 93 5 2 InfiniBand Driver 94 5 3 Ethernet Driver 94 ese MOC LVI Wet ect sports Arsh aa hea Sr TE SORRISO RAR Pears Meee eg 94 3 32 Loading the Ethernet Drivers tara ri LA bac chee bhe lilla 95 MI IUnloddmethe Divers canada da rias hones Sade de 95 5 3 4 Ethernet Driver Usage and Configuration ocean ar a a rss 95 Chapter 6 Performance unable ed ee 98 6 1 General System Configurations 98 GLL PELExpress PCIe Capabilities pistoni beneteeet ai 98 6 1 2 BIOS Power Management Settings nunnana nnan nannan 98 6 1 3 Intel Hyper Threading Technology naana nnana n aaeeea 98 6 2 Performance Tuning for Linux 99 6 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 99 6 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 99 0 2 2 IntermiptiModerationi ciuda da
139. eved by adding the device driver module into the initra image and loading it The initrd image of some Linux distributions such as SuSE Linux Enterprise Server and Red Hat Enterprise Linux cannot be edited prior or during the installation 4 process If you need to install Linux distributions over Flexboot please replace your initrd images with the images found at www mellanox com gt Products gt Adapter IB VPI SW gt FlexBoot Download Tab A 9 1 Case I InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section A 9 1 1 for an example e ib addrko e ib core ko e ib mad ko e ib sa ko e ib cm ko e ib uverbs ko e ib ucm ko e ib umad ko e iw cm ko e rdma cm ko e rdma ucm ko e mlx4 core ko e mlx4 ib ko e ib mthca ko e ipoib helper ko this module is not required for all OS kernels Please check the release notes e ib ipoib ko A 9 1 1 Example Adding an IB Driver to initrd Linux Prerequisites I The FlexBoot image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section 4 7 3 1 IPoIB Configura tion Based on DHCP and is connected to the client machine 3 An initrd file Mellanox Technologies 207 Rev 1 5 3 1 0 0 4 To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for
140. ey 0x0001 deele LOPE pabellon pkey 0x7FFF match any PR MPR query with a specific Service ID match any PR MPR query with a specific PKey SRP when SRP Target is located on a specified IB port GUID 6 match any PR MPR query with a specific target port GUID Similar to the advanced policy definition matching of PR MPR queries is done in order of appear ance in the QoS policy file such as the first match takes precedence except for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched 146 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 8 6 6 1 IPolB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the multi cast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent 1poib SS ipoib pkey 0x7fff lt SL gt any pkey Ux Lit EAST 8 6 6 2
141. f not install it from the RHEL distribution Installation for SLES10 Execute once e Verify that multipath is installed If not take it from the installation you may use yast e Update udev Execute once for manual activation of High Availability only e Add a file to etc udev rules d you can call it 91 srp rules This file should have one line ACTION add KERNEL sd 0 9 RUN sbin multipath M m When SRPHA_ ENABLE is set to yes see Automatic Activation of High Availabil A ity below this file is created upon each boot of the driver and is deleted when the M driver is unloaded Manual Activation of High Availability Initialization Execute after each boot of the driver Mellanox Technologies 65 Rev 1 5 3 1 0 0 Driver Features Execute modprobe dm multipath Execute modprobe 1b srp l 2 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA PRO IS O IAN an a ne This step can be performed by executing srp_daemon sh which sends its log to var log srp_daemon log Now it is possible to access the SRP LUNs on dev mapper It is possible for regular non SRP LUNs to also be present the SRP LUNs may be A identified by their names You can configure the etc multipath conf file to change af multipath behavior It is also possible that the SRP LUNs will not appear under dev mapper This
142. fg ethX configuration files can be modified as follow Add VNICVLAN lt VLAN tag gt or remove VNICVLAN property for no VLAN Using a VLAN tag value of 0 is not recommended because the traffic using it would not be separated from non VLAN traffic 4 For Host administered vNics VLAN entry must be set in the BridgeX first For further information please refer to BridgeX documentation Mellanox Technologies 71 J Rev 1 5 3 1 0 0 Driver Features 4 6 2 4 EolB Multicast Configuration Configuring Multicast for EoIB interfaces is identical to multicast configuration for native Ether net interfaces EoIB maps Ethernet multicast addresses to InfiniBand MGIDs Multicast GID It A ensures that different vHubs use mutually exclusive MGIDs Thus preventing vNics on af different vHubs from communicating with one another 4 6 2 5 EolB and Quality of Service EoIB enables the use of InfiniBand service levels The configuration of the SL is performed through the BridgeX and lets you set different data control service level values per BridgeX box For further information on the use of non default service levels please refer to BridgeX documen tation 4 6 2 6 IP Configuration Based on DHCP Setting an EoIB interface configuration based on DHCP v3 1 2 which is available via www isc org is performed similarly to the configuration of Ethernet interfaces When setting the EoIB configuration files verify that it includes fo
143. ges from 1 40 March 19 2009 e Correction to text in Section 9 3 3 IPoIB Configuration on page 93 Mellanox Technologies T J Rev 1 5 3 1 0 0 Preface This Preface provides general information concerning the scope and organization of this User s Manual It includes the following sections e Section Intended Audience on page 8 e Section Documentation Conventions on page 9 e Section Related Documentation on page 12 e Section Support and Updates Webpage on page 12 Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet FCoE adapter cards It 1s also intended for application developers 8 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Documentation Conventions Typographical Conventions Table 1 Typographical Conventions sem pers f E Variables for which users supply specific val Italic font ues Emphasized words Italic font These are emphasized words Pop up menu sequences menul gt menu2 gt gt item o Warning lt text gt May result in system f l instability Common Abbreviations and Acronyms p f if Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description Capital B is used to indicate s
144. gy then the tool operates as in the 1 option Mellanox Technologies 163 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities 9 3 ibdiagnet of ibutils2 IB Net Diagnostic This version of ibdiagnet is included in the ibutils2 package and it is not run by default A after installing Mellanox OFED To use this ibdiagnet version and not that of the ibu a tils package you need to specify the full path opt bin ibdiagnet Please see ibutils2_release_notes txt for additional information and known issues ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 9 3 1 SYNOPSYS Here benet C dev name Sp Pore am om pe P lt lt PM gt lt Value gt gt Rei Sido LE IE Me THT He skip lt ibdiag stage gt o lt out dir gt ln V 164 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 OPTIONS 1 device lt dev name gt Specify the name of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p poru SpOrT nun Specify the local device s port number used to connect to the IB fabric pm Dump all pm Counters values into ibdiagnet pm 06 Reset all the fabric links pmCounters P counter lt lt PM gt lt
145. he port number using the option p lt local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device 1s installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option 1 lt index of local device gt 2 Define the environment variable IBDIAG DEV IDX 9 2 3 Addressing S This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies The following addressing modes can be used to define the IB ports e Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination e Using port LIDs Tool option 1 In this mode the source and destination ports are defined by means of their LIDs If the fabric is configured to allow multiple LIDs per port then using any of them is valid for defining a port e Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topolo
146. hnologies 83 Rev 1 5 3 1 0 0 Driver Features e The only meaningful bonding policy in IPoIB is High Availability bonding mode number 1 or active backup Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only supported value is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions e In the bonding master configuration file e g ifcfg bond0 in addition to Linux bond ing semantics use the following parameter MTU 65520 65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 4 7 2 IPoIB Mode Setting on page 77 and are configured with the same 4 value For IPoIB slaves that work in datagram mode use MTU 2044 If you do not be set the correct MTU or do not set MTU at all performance of the interface might decrease e In the bonding slave configuration file e g 1fcfg 1b0 use the same Linux Network Scripts semantics In particular DEVICE 1b0 e In the bonding slave configuration file e g ifcfg 1b0 8003 the line TY PE InfiniBand is necessary when using bonding over devices configured with par titions p_key e For RHEL users In etc modprobe b bond conf add the following lines alias bond0 bonding e For SLES users It is necessary to update the MANDATORY DEVICES environment variable in etc sysconfig
147. ib options mlx4 en options mlx4 fc parameter lt value gt parameter lt value gt parameter lt value gt The following sections list the available m1x4 parameters C 1 SCORE int debug level block loopback default 1 msi x log num mac int Pec EDE 0 1 default 0 log num qp 17 max is 20 log num srg LOMAS 00 i6 4 max is 7 log num cq Tema log num mcg tog romare perge mlx4_core Parameters bool default is 13 max is 21 log num mpt enuries per HCA default us 1 5 max ws 20 Attempt to set 4K MTU to all ConnectX ports Enable debug tracing 1i gt O0 derault 0 Block multicast loopback packets 1f gt 0 Attempt tO use Mo nonzero dera ul da log maximum number of MACs per ETH port 1 7 Enable steering by VLAN priority on ETH ports log maximum number of QPs per HCA default is log maximum number of SRQs per HCA default is log number of RDMARC buffers per QP default log maximum number of COs per HCA default is log maximum number of multicast groups per HCA log maximum number of memory protection table 220 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 log num mtt log maximum number of memory translation table segments per HCA default is 20 max is 20 legge speso one ore pene ei Mint enan tengos Enable Quality of Service support in the HCA iE gt 0 default 0 enabke pres EEEE For FCoXX e
148. ic libmthca devel static dapl devel libmge devel libmverbs devel Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 RER R f a ana EEEE EEEE iaa aaa aa aran aran a ai aaa aaa ina rana naaa fr ae a a ae a A aE a A a aa a aa aaa fr ae a ae a RR aaa fe Ae a A ae a a T tr TTT fr ae a ae a RR rr TTT fr ae ae a EEEE a a aa aaa a aa aaa nina iaa aaa aaa rana aia aaa a ana AAA ia aa a T 0 REELLE EEE RET ia aaa anna AAA rae a A ae a AE T T a a T 0 EEEE ET T 0 REEE EET T 0 EEEE fr ae ae a A AEE a a a a aa a EREEREER EREEREER EEEE a ai aa fr ae a ae aE Aa T a ea a aa aaa RET ia aaa RET fr ae a A ae a Aa a a a aa a a a aaa aia PEE EEEE T tr EEEE EEEE EEE EEEE iii aaa TTT iaa iaa a T EEEE EEEE EEEE fe ae a ae a Aa a a a a aa rt REA fr ae a ae a AE T a a aa aa aaa fr Ae ae a RE TT fr ae a A Ae a RR aaa fr ae a a ae a RR tr TTT fF ae ae a R f tr RT EEEE EE E EE E E EE E EEE E E E E E E aa AAA fr AE A ae a A r iaa aaa ana AAA a a AAA AAN RR iaa a aaa AAA iaa iaa a aa ana rana aaa ri iaa a aaa ina rana aia aria ai aia ana rana aaa fr ae a ae a a AEA a ea a a a ea aa aaa fr ae a ae a RR tr TTT fr a a ae a R f tr TTT EEEE T aa aaa aaa aaa fr ae a Ae a AE tr TT fF ae a ae a AE T a a r aaa fe ae a ae a r aaa fF ae a ae a AE a a A a a aa aa aaa fr Ae A ae a Aa T a ae a aa aa aaa fF ae A Ae a tr TTT fr ae ae a Aa T a a r a aa fr aa a ae a AE T tr TTT a aaa AAA aaa a T 0 RARE a iaa tr NAAA RER REA REA REA RAEE
149. ies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OFED See Chapter 8 OpenSM Subnet Manager 1 4 8 Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data cen ter managers e ibutils Mellanox Technologies diagnostic utilities e infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools 1 4 9 Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for e Generating a standard or customized Mellanox firmware image e Querying for firmware information e Burning a firmware image to a single InfiniBand node MFT includes the following tools mlxburn This tool provides the following functions e Generation of a standard or customized Mellanox firmware image for burning in bin binary or img format e Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device e Querying the firmware version loaded on an HCA board e Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mellanox network adapter bridge switch device It includes query functions to the burnt firm ware image and to the binary image file spark This tool burns a firmware binar
150. ifetime and Path Bits S Path Bits are not implemented in OFED di IV Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match 1s applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the following fields s SRC and DST to lists of port groups e Service ID to a list of Service ID values or ranges e QoS Class to a list of QoS Class values or ranges 4 8 4 CMA Features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma resolve add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 4 8 4 1 IPolB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 4 8 4 2 SDP SDP uses CMA for building its connections The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hexadecimal digits holding the remote TCP IP Port Number to connect to 4 8 4 3 RDS RDS uses CMA and thus it is very close to SDP The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hexadecimal digits holding the TCP IP Port Number that the protocol connect
151. ile provides the CN order that may be used to create efficient communication pattern that will match the routing tables 8 5 4 1 Routing between non CN Nodes The use of the cn gud file option allows non CN nodes to be located on different levels in the fat tree In such case 1t 1s not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn_guid file OpenSM options 128 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Spinel Spine2 Spine 3 fk x xd A NI Switch NZ witen N3 FI i Going down to compute nodes To solve this problem a list of non CN nodes can be specified by G or Y 10 guid fileW option Theses nodes will be allowed to use switches the wrong way around a specific number of times specified by H or Y max reverse hopsY With the proper max reverse hons and io guid file values you can ensure full connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt
152. ing the port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure below Mellanox Connects FlexBoot v3 3 400 IPXE 1 0 0 Open Source Network Boot Firmware net 100 0Z c9 03 00 0c 78 11 fon PCIOZ 00 0 open Link down TA U TXE U KX O RXE 0 Link status The socket is not connected Waiting for link up on netta ok Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 Operon dicp clivcur i1dentiirer fr OU 2 DD OO OO G Z OO ee 0 Orisa O SO 202 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 A 4 Subnet Manager OpenSM S This section applies to ports configured as InfiniBand only hai FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accomplish this Note that OpenSM may be run on the same host running the DHCP server but it is not manda tory For details on OpenSM see OpenSM Subnet Manager on page 109 S To use OpenSM caching for large InfiniBand cl
153. ings use 0 for cpus default 0 max 32 e tx rings len Length of TX rings must be power of two default 1024 max 8K e rx rings num Number of RX rings use 0 for cpus default 0 max 32 e rx rings len Length of RX rings must be power of two default 2048 max 8K e vnic net admin Network administration enabled default 1 e Jro num Number of LRO sessions per ring use 0 to disable LRO default 32 max 32 e eport state enforce Bring vNic up only when corresponding External Port is up default 0 For all module parameters list and description run mlx4 vnic info I Mellanox Technologies 15 Rev 1 5 3 1 0 0 Driver Features To check the current module parameters run eX O AOSE 4 6 4 2 vNic Interface Naming The mlx4 vnic driver enables the kernel to determine the name of the registered vNic By default the Linux kernel assigns each vNic interface the name eth lt N gt where lt N gt is an incremental num ber that keeps the interface name unique in the system The vNic interface name may not remain consistent among hosts or BridgeX reboots as the vNic creation can happen in a different order each time Therefore the interface name may change because of a first come first served kernel policy In automatic network administered mode the vNic MAC address may also change which makes it difficult to keep the interface configuration persistent To control the interface name you can use standard Linux utilities
154. inko Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example above 4 7 5 Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 Nos tie contr AS ne tm ostenta eL Ue Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command MA o Rd Sd PING 11 4 3 176 11 4 3 176 56 84 bytes of data SO ee Mione Is A ene EL oa bytes ron ense tal Aim OMA os som NES ono Eee e es US ws oA bytes tron lit cp se ome Us o com LS O e ia eni gt Md e ping Statistics 5 packets transmitted 5 received 0 packet loss time 399 ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 7 6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver e Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg 1b0 Mellanox Tec
155. is The following describes a work flow for local HCA adapter sniffing e Run ibdump with the desired options e Run the application that you wish its traffic to be analyzed s Stop ibdump CTRL c or wait for the data buffer to fill in mem mode e Open Wireshark and load the generated file How to Get Wireshark Mellanox Technologies 195 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities Download the current release from www wireshark org for a Linux or Windows environment See the ibdump release notes txt file for more details 4 Although ibdump is a Linux application the generated pcap file may be analyzed on gt either operating system Synopsis 1bdump options Table 25 lists the various flags of the command Table 25 ibdump Options Optional Default T If Not Description 9 Specified d ib dev lt dev gt Optional First device Use IB device lt dev gt found b max burst lt log2 Optional 12 4096 entries log2 of the maximal burst size that can be captured with burst gt no packet loss Each entry takes MTU bytes of memory mem mode lt size gt Optional When specified packets are written to the dump file only after the capture is stopped It is faster than the default mode less chance for packet loss but it uses more memory In this mode ibdump stops after lt size gt bytes are cap tured sep Examples Decapsulate port mirroring headers Should be used when c
156. is util ity will scan the fabric once connect to every Target it detects and then exit srp daemon will follow the configuration it finds in etc srp_daemon conf Thus it 9 will ignore a target that is disallowed in the configuration file Mellanox Technologies 63 Rev 1 5 3 1 0 0 Driver Features e To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp_daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric e To execute SRP daemon as a daemon you may run run_srp_daemon found under usr sbin providing it with the same options used for running sro daemon S Make sure only one instance of run srp daemon runs per port da e To execute SRP daemon as a daemon on all the ports run srp daemon sh found under usr sbin srp daemon sh sends its log to var log srp daemon Log e It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRPHA ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Section 4 5 2 6 For the changes in openib conf to take effect run etc init d openibd restart 4 5 2 5 Multiple Connections from Initiator IB Port to the Target Some system configurations may need multi
157. ize in bytes or multiples of bytes e g IKB 1024 bytes and 1MB 1048576 bytes Mellanox Technologies 9 Rev 1 5 3 1 0 0 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description Small b is used to indicate size in bits or multiples of bits e g 1Kb 1024 bits S S i r vi T O gt T Le DI Z O rG az py Q Y UN iii 10 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Manag ers in particular It is included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Clossary Channel Adapter CA An IB device that terminates an IB link and executes transport functions Host Channel Adapter This may be an HCA Host CA or a TCA Target CA HCA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant communication IB Cluster Fabric A set of IB devices connected by IB cables Subnet A term assigned to administration activities traversing the IB connectivity only LID An address assigned to a port data sink or source point by the Subnet Man ager unique within the subnet used for directing packets within the subnet L
158. kernel module is loaded and is being used then the output of the command will be as follows host1 sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address sdp 0 O T99 160 T0114 34216 sms Sos sdp 0 8047207193 168 10 1414 42724 I SE LO rana The example output above shows two active SDP sockets and contains details about the connec tions If the SDP kernel module is not loaded then the output of the command will be something like the following aoste capiet E 5 Proto Recv Q Send Q Local Address Foreign Address me HED EEIT roc O IN eco ola ais HT To verify whether the module is loaded or not you can use the 1smod command Mellanox Technologies 49 Rev 1 5 3 1 0 0 Driver Features ib sdp1250200 The example output above shows that the SDP module is loaded If the SDP module is loaded and the sdpnetstat command did not show SDP sockets then SDP is not being used by any application 4 4 3 2 Monitoring and Troubleshooting Tools SDP has debug support for both the user space 1ibsdp so library and the ib sdp kernel mod ule Both can be useful to understand why a TCP socket was not redirected over SDP and to help find problems in the SDP implementation User Space SDP Debug User space SDP debug is controlled by options in the libsdp conf file You can also have a local version and point to it explicitly using the following command hostl1 export LIBSDP CONFIG FILE lt path gt libsdp conf To ob
159. l ues guid gt node GUID guid 1 gt portl guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value ignores this value It can be set to 0x0 MAC address base value Two MACs are automatically assigned to the fol lowing values mac gt portl mac l gt port2 Note This switch is applicable only for Mellanox Technologies Ethernet products blank_guids No commands Force clear the Flash semaphore on the device No command is allowed clear semaphor allowed when this switch is used e Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash 1 mage burn verify Binary image file lt image gt burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g I2C MTUSB 1 nofs Burn image in a non failsafe manner Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sector difference is detected 192 Mellanox Technologies Two MACs must be specified here The specified MACs are assigned to portl and port2 repectively Note This switch is applicable only for Mellanox Technologies Ethernet products Burn the image with blank GUIDs and
160. l lt val gt specifico Sly for thas TPolR MC group deraulir 1520 scope lt val gt Species SCOpC TOT SON BST EOIR M eor oup decis ia shoei Note that values for rate mtu and scope should be specified as defined in the IBTA specifica tion for example mtu 4 for 2048 PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too se indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition e ALL means all end ports in this subnet e SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters e The line can be wrapped after after a Partition Definition and between e A PartitionName does not need to be unique but PKey does need to be unique e Ifa PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note e Itis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions Mellanox Technologies 123 Rev 1 5 3 1 0 0 OpenSM Subnet Manager Examples Default 0x ff ALL SELF full NewPartiti
161. l pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by maintaining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the pro cess 3 Once this stage has been completed it is highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoiding the bottleneck issues related to a root node and always routes shortest path
162. le and their syntax See the comments in the following example They explain different keywords and their meaning OOE GHE OTG S port group using port GUIDs names Storage se ERE a deser Tor Tor thar rs Usel Eo ooon Other than that it is just a comment set SRP Targets port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF 142 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Mellanox Technologies 143 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 144 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 qos match rule Source storage use match by source group only qos level name DEFAULT end gos match rule gos match rule use match by all parameters eje Glass LI source Virtual Servers destination Storage service id 0x0000000000010000 0x000000000001FFFF pkey 0x0F00 0x0FFF gos level name WholeSet end gos match rule end gos match rules 8 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include e Default match rule that is applied to PR MPR query that didn t match any of the other match rules e SDP s SDP application with a specific t
163. lic key to the authorized _keys2 file on the target machine ti cata pul args soi Most Scho gt gt home Us craneo ss authorized keys lt username gt host2 s password Ri Enter password Hosts For a local machine simply add the key to authorized keys2 Koori Ses Moa Meas 7 Step 5 Test 106 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 hostis ssh host2 uname Linux 7 3 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mellanox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and login again Other packages such as environment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI sel
164. llowing lines For RedHat BOOTPROTO dhcp e For SLES BOOTPROTO dchp If EoIB configuration files are included ifcfg eth lt n gt files will be installed under etc A sysconfig network scripts on a RedHat machine and under etc sysconfig network on af a SuSE machine DHCP Server Using a DHCP server with EoIB does not require special configuration The DHCP server can run on a server located on the Ethernet side using any Ethernet hardware or on a server located on the InfiniBand side and running EoIB module 4 6 2 7 Static EolB Configuration To configure a static EoIB you can use an EoIB configuration that is not based on DHCP Static configuration is similar to a typical Ethernet device configuration For further information on how to configure IP addresses please refer to your Linux distribution documentation Ethernet configuration files are located at etc sysconfig network scripts on a RedHat machine and at etc sysconfig network on a SuSE machine ha 72 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 6 2 8 Sub Interfaces VLAN EoIB interfaces do not support creating sub interfaces via the vconfig command To create inter faces with VLAN refer to Section Configuring VLANs on page 71 4 6 3 Retrieving EoIB Information 4 6 3 1 mlx4_vnic_info To retrieve information regarding EoIB interfaces use the script mlx4_vnic info This script pro vides detaile
165. low utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS e Upto 15 Virtual Lanes VL carry traffic in a non blocking manner e Arbitration between traffic of different YLS is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority 1s served e Packets carry class of service marking in the range 0 to 15 in their header SL field e Each switch can map the incoming packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL e The Subnet Administrator controls the parameters of each communication flow by pro viding them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The fol lowing subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack Mellanox Technologies 85 Rev 1 5 3 1 0 0 Driver Features 4 8 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chronol ogy approach to describe how the overall system works I The network manager human provides a set of
166. lowing command ie ED cele o es jason alc The initrd files should now be found under tmp initrd_en Step 4 Create a directory for the ConnectX EN modules and copy them host1 mkdir p tmp initrd en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers host1 cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en host1 cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en Step 5 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd en sbin Step 6 Ifyou plan to give your Ethernet device a static IP address then copy ifconfig Otherwise skip this step ess ca sota Liegi STE asia s Son Step 7 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be loaded The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network interface Step 9 Save the init file Step 10 Close initrd osi ed o den EIS ge Eo ee E e ong 212 Mellanox Technologies Mellanox OFED for Linux User s Manual
167. lx4 0 port 1 gt 1b0 Down mix4 0 port 2 gt 1b1 Down eo rele Pon A Ge On 9 9 ibstatus Applicable Hardware All InfiniBand devices Description Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus nl device nane Te Mellanox Technologies 173 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 17 lists the various flags of the command Table 17 ibstatus Flags and Options Default Flag eC If Not Description Mandatory Specified lt device gt Optional All devices Print information for the specified device May specify more than one device lt port gt Optional but All ports of the Print information for the specified port only of the spec requires specify specified device ified device ing a device name Examples I List the status of all available InfiniBand devices and their ports lo Sale RES RG Nee R LR Mix ON ole se default gid fe80 0000 0000 0000 0000 0000 0007 3896 base lid 0x3 sun Jato 0x3 Suaves 4 ACTIVE phys state AS rate 20 Gb sec 4X DDR Infiniband device mlx4 0 port 2 status default gid fe80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sde 0x1 STA 4 ACTIVE phys state Sail y rate 20 Gb sec 4X DDR Infinaband device mMineal port Status default gid fe80 0000 0000 0000 0002 c900 01
168. ms the necessary steps to accomplish the following e Discover the currently installed kernel e Uninstall any InfiniBand stacks that are part of the standard operating system distribu tion or another vendor s commercial stack e Install the MLNX OFED LINUX binary RPMs if they are available for the current kernel e Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 3 2 Software Components MLNX OFED LINUX contains the following software components e Mellanox Host Channel Adapter Drivers mthca IB only milx4 VPI which is split into multiple modules mlx4 core low level helper mlx4 ib IB mlx4 en Ethernet mlx4 fe FCoE mlx4 vnic EoIB e Mid layer core Verbs MADs SA CM CMA uVerbs uMADs e Upper Layer Protocols ULPs IPoIB RDS SDP SRP Initiator iSER e MPI Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta e OpenSM InfiniBand Subnet Manager e Utilities Diagnostic tools Performance tests e Firmware tools MFT e Source code for all the OFED software modules for use under the conditions men tioned in the modules LICENSE files 14 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 e QIB Low level driver implementation for all QLogic InfiniPath PCI Express H
169. n 7 Regardless of the mode you always need to have lun 0 in any group s device list af numbers in ascending order except that the first lun must always be 0 S Setting SRPT_LOAD yes in etc infiniband openib conf is not enough as it only loads the ib srpt module but does not load scst not its dev_handlers 4 The sest_disk module pass thru mode of SCST is not supported by Mellanox OFED Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst b modprobe scst_vdisk c echo open vdisk0 dev md0 BLOCKIO gt proc scsi tgt vdisk vdisk d echo open vdisk1 dev sda BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk2 dev cciss c1d0 BLOCKIO gt proc scsi tgt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices echo add vdisk1 1 gt proc scsi_tgt groups Default devices gt ye th o echo add vdisk2 2 gt proc scsi_tgt groups Default devices Example 2 working with scst_vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst b modprobe scst_vdisk c echo open vdisk0 dev md0 gt proc scsi_tgt vdisk vdisk d echo open vdiskl 10G file gt proc scsi_tgt vdisk vdisk e echo add vdisk0 0 gt proc scsi_tgt groups Default devices f echo add vdiskl 1 gt proc scsi_tgt groups Default devices 2 Run Ore OS ao OMS oo i ica robe ml Ome IE Hor SLES IT A preci Ole Mellanox Technologies 217 Rev 1 5 3 1 0 0 For S
170. n an x8 slot with the following BIOS configuration parameters e Max Read Req the maximum read request size is 512 or higher e MaxPayloadSize the maximum payload size is 128 or higher A Max Read Reg of 128 and or installing the card in an x4 slot will significantly limit bandwidth da To obtain the current setting for Max Read Reg enter Serpa AS O a To obtain the PCI Express slot link width and speed enter 102 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 o O 1 If the output is neither 81 nor 82 card then the card is NOT installed in an x8 PCI Express slot 2 The least significant digit indicates the link speed 1 for PCI Express Gen 1 2 5 GT s 2 for PCI Express Gen 2 5 GT s Note If you are running InfiniBand at QDR 40Gb s 4X IB ports you must run PCI Express Gen 2 6 3 2 InfiniBand Performance Troubleshooting InfiniBand IB performance depends on the health of IB link s and on the IB card type IB link speed 10Gb s or SDR 20Gb s or DDR 40Gb s or QDR also affects performance 4 A latency sensitive application should take into account that each switch on the path adds 200nsec at SDR and 150nsec for DDR hai 1 To check the IB link speed enter ibstat Check the value indicated after the Rate string 10 indicates SDR 20 indicates DDR and 40 indicates QDR 2 Check that the link has NO symbol errors since these errors result in the
171. nable pre t11 mode if non zero default 0 nto ga ross ee Reset device on internal errors if non zero default 1 C 2 mlx4_ ib Parameters SI Tene Enable debug tracing 1f gt 0 default 0 C 3 mlx4 core Parameters inline oo Threshold for using inline data default is 128 Pos Enable RSS for Incoming e Poca rio default 1 enabled udp rss Enable RSS for incoming UDP traffic default 1 enabled num lro Number of LRO sessions per ring or disabled 0 default is 32 lp reasm Allow the assembly of fragmented IP packets default 1 enabled DEC Priority based Flow Control policy on ANS ASS dera ie ales 0 PES Priority based Flow Control POLLO On UPS pon oa mask default is 0 C 4 mlx4 fc Parameters ore nm Max outstanding FC exchanges per virtual HBA ogis Dera ee a ones max vhba per port Max vHBAs allowed per port Default 2 int Mellanox Technologies 221 Rev 1 5 3 1 0 0 Appendix D ib bonding Driver for Systems Using SLES10 D 1 SP4 Using the ib bonding Driver The ib bonding driver is a High Availability solution for IPoIB interfaces It is based on the Linux Ethernet Bonding Driver and was adapted to work with IPoIB The ib bonding package contains a bonding driver and a utility called ib bond to manage and control the driver operation The 1b bonding driver comes with the ib bonding package run rpm qi ib bonding to get the package information The ib bonding driver can be lo
172. nager as device that does not support AR AR Manager will not try to enable AR on this switch If the firmware of this switch was updated to support the AR the AR Manager will need to be restarted by restarting SUbnet Manager to allow it to configure the AR on this switch This option can be changed on the fly AR MODE Adaptive Routing Mode Default bounded lt bounded free gt e free no constraints on output port selection e bounded the switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets This option can be changed on the fly AGEING TIME Applicable to bounded AR mode only Specifies how much Default 30 lt usec gt time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value This option can be changed on the fly MAX ERRORS lt N gt When number of errors exceeds MAX ERRORS of send Values for both options 0 Oxffff ERROR_WINDOW receive errors or timeouts in less than ERROR_ WINDOW MAX ERRORS 0 zero tolle lt N gt seconds the AR Manager will abort returning control back rance abort configuration on first to the Subnet Manager error Default 10 This option can be changed on the fly ERROR_WINDOW 0 mecha nism disabled no error checking Default 5 LOG FILE lt full path gt AR Manager l
173. ne of the adpater ports Mellanox Technologies 11 J Rev 1 5 3 1 0 0 Related Documentation Table 4 Reference Documents InfiniBand Architecture Specification Vol 1 Release 1 2 1 IEEE Std 802 3aeTM 2002 Amendment to IEEE Std 802 3 2002 Document PDF SS94996 Fibre Channel BackBone 5 standard for Fibre Channel over Ethernet Document INCITS xxx 200x Fibre Channel Backbone Firmware Release Notes for Mellanox adapter devices MFT User s Manual MFT Release Notes The InfiniBand Architecture Specification that is provided by IBTA Part 3 Carrier Sense Multiple Access with Collision Detec tion CSMA CD Access Method and Physical Layer Spec ifications Amendment Media Access Control MAC Parameters Physical Layers and Management Parameters for 10 Gb s Operation http www t11 org draft See the Release Notes PDF file relevant to your adapter device under docs folder of installed package Mellanox Firmware Tools User s Manual See under docs folder of installed package Release Notes for the Mellanox Firmware Tools See under docs folder of installed package Support and Updates Webpage Please visit http www mellanox com gt Products gt IB VPI SW Drivers for downloads FAQ trou bleshooting future updates to this manual etc 12 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 1 Mellanox OFED Overview 1 1 Introduction to
174. net Manager 8 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are e Max VLs the maximum number of VLs that will be on the subnet e High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 e VLArb low table Low priority VL Arbitration table IBA 7 6 9 template e VLArb high table High priority VL Arbitration table IBA 7 6 9 template e SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by gos lt type gt _ string Here is a full list of the currently supported sets e qos ca QoS configuration parameters set for CAs e qos rtr parameters set for routers e qos swO parameters set for switches port 0 e qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization Gos camas dos ca oa EA cos ica lan oe OU o O OO o LaS que ca laro Loy 420 id 24 304 pad ardor depor IRA 2 O O On 0 RAI qos swe max vis 15 GOs NS ole lima cay CeO Ss We Vel ee OA DE
175. network config with the names of the IPoIB slave devices e g 1b0 1b1 etc Otherwise bond ing master may be created before IPoIB slave interfaces at boot time It is possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bonding master have to restart the network service in order to bring up the bonding master A fter the configuration is saved restart the network service by running etc init d network restart P Restarting openibd does no keep the bonding configuration via Network Scripts You 84 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 8 Quality of Service 4 8 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means 1s needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand Servers J J Unified I O N le Jan ID Q0S Manager a Filer IG En Ne SG Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each f
176. nfiguration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels e Application traffic IPoIB UD and CM and SDP Isolated from storage Min BW of 50 e SRP Min BW 50 Bottleneck at storage nodes Administration e OpenSM QoS policy file S In the following policy file example replace SRPT with the real SRP Target port GUIDs gt gOssulips default Ipo al Mellanox Technologies 151 Rev 1 5 3 1 0 0 OpenSM Subnet Manager sdp zi stor Langer pore oa TEE SENS end gos ulps e OpenSM options file qos max vls 8 GOS aaa IO Cols R RSI eS Sg oY gos vlarb Tow 0 1 GOs SZ py AS toy o lo boy oy Lo 8 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels e Management traffic ssh IPoIB management VLAN partition A Min BW 10 e Application traffic IPoIB application VLAN partition B Isolated from storage and database Min BW of 30 e Database Cluster traffic RDS Min BW of 30 e SRP Min BW 30 Bottleneck at storage nodes Administration e OpenSM QoS policy file S In the following policy file example replace SRPT with the real SRP Initiator port GUIDs ha Gos ules defaul
177. nfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possiblee to work with and support a lot of IO modes on real or virtual devices in the backend 1 sest vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices B 1 Prerequisites and Installation 1 SRP targer is part of the OpenFabrics OFED software stacks Use the latest OFED distribu tion package to install SRP target 4 On distribution default kernels you can run scst vdisk blockio mode to obtain good performance 2 Download and install the SCST driver The supported version 1s 1 0 1 1 a Download scst 1 0 1 1 tar gz from http scst sourceforge net downloads html b Untar scst 1 0 1 1 Sel ae seso sls Silo eee ota S ed sest 1 0 1 1 C Install scst 1 0 1 1 as follows S make amp amp make install B 2 How to run A On an SRP Target machine I Please refer to SCST s README for loading scst driver and its dev handlers drivers sest_vdisk block or file IO mode nullio 216 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Then you can have any lun number following lun 0 it is not required to have the lu
178. ng Mellanox OFED To use this ibdiagnet version run ibdiagnet da ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 9 4 1 SYNOPSYS bdiagnet Te count TI E T eT H ame lt copo tule T E lt dev index gt o port nun Al nl pele lt M valhe Am lo aso ia ls lt 2 515110 gt skip lt ibdiag check s gt load db lt db file gt 166 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 OPTIONS UNO Min number of packets to be sent across each link default 10 SV Enable verbose mode ST Provides a report of the fabric qualities E Stopos L Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p lt port num species Ene local device sport mum used Eo connect Eo ieee NE ento OE or Specifies the directory where the output files will be placed default tmp A oo Specifies the expected link width AOS 5 05 Specifies the expected link speed SUI Dump abi the fabric nks pP Counters nto ahdiagner pm pe Reset all the fabric links pmCounter
179. not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage The user can override the node list manually Aa 4 If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm ha 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 3 Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet T Up Down routing does not allow LID routing communication between switches that A are located inside spine switch systems The reason is that there is no way to allow a LID route between them that does not break the Up Down rule One ramification of this is that you cannot run SM on switches other than the leaf switches of the fabric 8 5 3 1 UPDN Algorithm Usage Activation through OpenSM e Use R updn option instead of old u to activate the UPDN algorithm e Use a lt
180. nsistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimen sion Use R dor option to activate the DOR algorithm 8 5 7 Torus 2QoS Routing Algorithm Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics The torus 2QoS routing engine can provide the following functionality on a 2D 3D torus e Free of credit loops routing e Two levels of QoS assuming switches support 8 data VLs e Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL values e Very short run times with good scaling properties as fabric size increases 8 5 7 1 Unicast Routing Torus 2QoS is a DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the
181. nstraintbrrors 1L00 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 gt ibcheckerrs v T threshl 2 1 Error check om lid 2 MT47300 Infiniscale TIT Mellanox Technologies port 1 OK 9 15 mstflint Applicable Hardware Mellanox InfiniBand and Ethernet devices and network adapter cards Description Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access If you purchased a standard Mellanox Technologies network adapter card please down 74 load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Mellanox Technologies 191 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 23 lists the various switches of the utility and Table 24 lists its commands Table 23 mstflint Switches Sheet 1 of 2 Affected Relevant Description Commands Print the help menu Print an extended help menu Specify the device to which the Flash is connected Switch d evice lt device gt guid lt GUID gt GUID base value 4 GUIDs are automatically assigned to the following va
182. nual Rev 1 5 3 1 0 0 5 valid lids dumped 2 Dump all Lids with valid out ports of the switch with Lid 2 gt LOroure 2 Unucast Mas EE Ore Or Swatch Laa guia 0x000Z2ZC002 fir 0 MIS inputs cade DET Mellanox Technologies Lio Out Destination Port Info 00002300002 switch poriguad 0x9 002090288800 MIAI396 mts cole TOR Melltanox ech nologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Tech nologies Ux00 00077 channel Adaprex oe D ko AMS 000202050000 TOS sion ACA 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 lt Channel Adapter portquid 0x0002c902002533200d UswL36 HCA 1 5 valid lids dumped 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 gt Be 2 2 3 7 Unicast skids 0320 T BO TSAR ESA a TLEER E O GT n e 1d Mellanox Technologies hid Out Destination Port TREO Ox000S UA switch por Equidad OOOD CPE OOO E MIC E Intaniscale lll Me ero Tech nologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 Ox000r 02 Channel Adapter portquid 0x0002c90200250 7437 TT HCA 1 3 valid lids dumped 4 Dump all Lids with valid out ports of the switch with portguid 0x000b8cffff 004016 gt ib BOU ts COMO 0D caen OA 06 Une astas UZ0S0xZS on swatch had OO RR LERH cr EOT MEA TS o Une rats cade dd Mellanox Technologies Lio OWE Destination Port inte
183. ny of the switches used to specify a seed were to fail torus 2QoS would be unable to complete topology discovery successfully The next seed keyword specifies that the following link and dateline keywords apply to a new seed specification For maximum resiliency no seed specification should share a switch with any other seed specifi cation Multiple seed specifications should use dateline configuration to ensure that torus 2QoS can grant path SL values that are constant regardless of which seed was used to initiate topology discovery portgroup max ports max ports This keyword specifies the maximum number of parallel inter switch links and also the maximum number of host ports per switch that torus 2QoS can accom modate The default value is 16 Torus 2QoS will log an error message during topology discovery if this parameter needs to be increased If this keyword appears multiple times the last instance prevails port order pl p2 p3 This keyword specifies the order in which CA ports on a destination switch are visited when computing routes When the fabric contains switches connected with mul tiple parallel links routes are distributed in a round robin fashion across such links and so chang ing the order that CA ports are visited changes the distribution of routes across such links This may be advantageous for some specific traffic patterns 138 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0
184. o be invoked by specifying R minhop The Min Hop algorithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter is supplied by i lt egualize ignore guids file gt ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis 8 5 3 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a sta tistical histogram is built for each switch hop num vs number of occurrences If the histo gram reflects a specific column higher than others for a certain node then it is marked as 126 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 a root node Since the algorithm is statistical it may
185. ocal Device Node Sys The IB Host Channel Adapter HCA Card installed on the machine running tem IBDIAG tools The IB port of the HCA through which IBDIAG tools connect to the IB fab ric Master Subnet Man The Subnet Manager that is authoritative that has the reference configura ager tion information for the subnet See Subnet Manager Multicast Forward A table that exists in every switch providing the list of ports to forward ing Tables received multicast packet The table is organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot and provides one Card NIC or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in the role of a Master ager Subnet Manager by agency of the master SM See Subnet Manager Subnet Administra An application normally part of the Subnet Manager that implements the tor SA interface for querying and manipulating subnet management data Subnet Manager SM One of several entities involved in the configuration and control of the sub net Unicast Linear For A table that exists in every switch providing the port through which packets warding Tables LFT should be sent to each LID Virtual Protocol A Mellanox Technologies technology that allows Mellanox channel adapter Interconnet VPI devices ConnectX to simultaneously connect to an InfiniBand subnet and a 10GigE subnet each subnet connects to o
186. od lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod lib modules ib iw cm ko sbin insmod lib modules ib rdma_cm ko sbin insmod lib modules ib rdma ucm ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4 ib ko sbin insmod lib modules ib ib mthca ko 4 The following command loading ipoib_helper ko is not required for all OS kernels Please check the release notes A sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib ipoib ko In case of interoperability issues between ISCSI and Large Receive Offload LRO change the last command above as follows to disable LRO sbin insmod lib modules ib ib ipoib ko lro 0 210 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Step 10 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP chent then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin chelient c sbin dhelient cont ioi Step 11 Save the init file Step 12 Close initrd host1 cd tmp initrd ib oss Baco oo ne Se ajo e aaa HSBE host1 gzip tmp new init ib img
187. og file Default var log armgr log This option can be changed on the fly LOG SIZE lt size in This option defines maximal AR Manager log file size in 0 unlimited log file size MB gt MB The logfile will be truncated and restarted upon reach Default 5 ing this limit This option cannot be changed on the fly Per switch AR Options A user can provide per switch configuration options with the following syntax SWITCH lt GUID gt COWI CEN TOPE E E O 156 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 The following are the per switch options Table 8 Adaptive Routing Manager Pre Switch Options File ENABLE lt truelfalse gt Allows you to enable disable the AR on this switch If the Default true general ENABLE option value is set to false then this per switch option is ignored This option can be changed on the fly AGEING TIME Applicable to bounded AR mode only Specifies how much Default 30 lt usec gt time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value In the pre switch options file this option refers to the partic ular switch only This option can be changed on the fly Example of Adaptive Routing Manager Options File ENABLE true LOG FILE tmp ar mgr log LOG SIZE 100 MAX ERRORS 10 ERROR WINDOW 5 ONTEER OZ
188. ologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 9 4 3 ERROR CODES Failed to fully discover the fabric Failed to parse command line options Failed to intract with Ib abric Failed to use local device or local port Falled to use Topology File Failed to load requierd Package DD OF W N E I 9 5 Ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing is used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the op
189. on gy File red Package 170 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 9 6 ibv devices Applicable Hardware All InfiniBand devices Description Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ASAS Examples 1 List the names of all available InfiniBand devices gt evite device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 9 7 ibv_devinfo Applicable Hardware All InfiniBand devices Description Queries InfiniBand devices and prints about them information that is available for use from user space Synopsis Toe E dee oa L I EE Table 16 lists the various flags of the command Table 16 ibv devinfo Flags and Options Default If Not Description Specified Optional Mandatory d lt device gt Optional First found Run the command for the provided IB device device 1b device dev lt device gt Mellanox Technologies 171 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 16 ibv devinfo Flags and Options Default If Not Description Specified Optional Mandatory 1 lt port gt Optional All device ports Query the specified device port lt port gt ib port lt port gt Optional Only list the names of InfiniBand devices Optional Inactive Print all available information about the InfiniBand verbose device s Examples 1 List the names
190. on ipoib 0x1231450 tu11 0x77507209034 imi 0x2 b34ar230G YetAnotherOne 0x300 s SELF full YetAnotherOne 0x300 3 ALL limited Sharel0 0x30 detmember tull 3 02123451 0x123452 0x123453 0x123454 will be limited AIA AA 0x12 34547 e VI 0x123456 0x123457 will be limited SharelO 0x80 defmember limited 0x123450 02123457 0x123458 full ShareIO 0x80 defmember full 0x123459 0x12345a ShareIO 0x80 defmember full IMSS 0x12345c limited 021234501 The following rule is equivalent to how OpenSM used to run prior to the partition manager Detail porosa 8 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm Based on the minimum hops to each node where the path length is optimized 2 UPDN Algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat tree Routing Algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is con strained to ranking rules 4 LASH Routing Algorithm Uses InfiniBand virtual layers SL to provide deadlock free shor
191. on of the old packages Pre existing configuration files will be saved with the extension conf saverpm ha If you need to install Mellanox OFED on an entire homogeneous cluster a common strategy 19 to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh If your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the mlnx add kernel support sh script located under the docs directory Usage lines add kerne EU porte n O pda foga em lt in Meri EDO el Example The following command will create a MLNX OFED LINUX ISO image for RedHat 5 6 under the tmp directory MINX OPED AIN eS ox 00 docs mlnxvadd kernel Support si mo MLNX OFED LINUX 1 5 3 rhel5 6 x86 64 iso All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Removing OFED RPMs RUNNING MKL osa Created tmp MLNX OFED LINUX 1 5 3 rhel5 6 x86 64 iso 2 3 2 Installation Script Mellanox OFED includes an installation script called minxofedinstall Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Pro cedure on page 26 Usage mnt mlnxofedinstall OPTIONS Mellanox Technologies 23 Rev 1 5 3 1 0 0 Installation Options
192. onds 4855 96 Mbit sec 1000 iters in 0 01 seconds 13 50 usec iter Using rdma_cm Tests On Server ucmatose cmatose starting server initiating data transfers completing sends receiving data transfers data transfers complete cmatose disconnecting disconnected test complete return status 0 T 42 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 On Client ucmatose s 20 4 3 219 cmatose starting client cmatose connecting recelving data transfers sending replies data transfers complete test complete return status 0 T This server client run is without PCP or VLAN because the IP address used does not belong to a VLAN interface If you specify a VLAN IP address then traffic should go over VLAN Type Of Service TOS The TOS field for rdma cm sockets can be set using the rdma set option API just as it is set for regular sockets If the user does not set a TOS the default value 0 will be used Within the rdma cm kernel driver the TOS field is converted into an SL field The conversion formula is as follows SL TOS gt gt 5 e g take the 3 most significant bits of the TOS field In the hardware driver the SL field is converted into PCP by the following formula PCP SL amp 7 take the 3 least significant bits of the TOS field Note SL affects the PCP only when the traffic goes over tagged VLAN frames 4 1 10 Configuring DAPL over RoCE The default dat
193. onnectX ConnectX 2 ports designation to eth see mlx4 release notes txt for details e Configure the IP address of the interface so that the link will become active e All IB verbs applications which run over IB verbs should work on RoCE links as long as they use GRH headers that is as long as they specify use of GRH in their address vector 4 1 5 Ported Applications The following applications are ported with RoCE Mellanox Technologies 35 Rev 1 5 3 1 0 0 Driver Features e ibv pingpong examples are ported The user must specify the GID of the remote peer using the new g option The GID has the same format as that in sys class infiniband mlx4 0 ports 1 gids 0 A which is likely to exceed the MTU of the RoCE link Use ibv devinfo to inspect the A Care should be taken when using ibv ud pingpong The default message size is 2K af link MTU and specify an appropriate message size e All rdma cm applications should work seamlessly without any change e libsdp works without any change e Performance tests 4 1 6 GID Tables With RoCE there may be several entries in a port s GID table The first entry always contains the IPv6 link s local address of the corresponding Ethernet interface The link s local address is formed in the following way gid 0 7 fe80000000000000 qual o gt mac 0 a ar macl gido mac idee gid l2 fe gdi o mac gid 14 mac 4 gadi mach If VLAN is suppo
194. onvey a client identifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 3 2 Configuring the DHCP Server on page 201 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate configu ration file needs to be created By default the DHCP server looks for a configuration file called dhepd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of the Mellanox OFED for Linux installation 78 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 The DHCP server must run on a machine which has loaded the IPoIB module To run the DHCP server from the command line enter dhcpd lt IB network interface name gt d Example hostile dhepa ibo d DHCP Client Optional 4 A DHCP client can be used if you need to prepare a diskless machine with an IB driver See Step 8 under Example Adding an IB Driver to initrd
195. opensm ar_mgr conf To set an alterna tive location please perform the following 1 Add armgr conf_file lt ar mgr options file name gt to the event plugin option option in the file Options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt AR Manager options file contains two types of parameters 1 General options Options which describe the AR Manager behavior and the AR parameters that will be applied to all the switches in the fabric 2 Per switch options Options which describe specific switch behavior Note the following e Adaptive Routing configuration file is case sensitive e You can specify options for nonexisting switch GUID These options will be ignored until a switch with a matching GUID will be added to the fabric e Adaptive Routing configuration file is parsed every AR Manager cycle which in turn 1s executed at every heavy sweep of the Subnet Manager Mellanox Technologies 155 Rev 1 5 3 1 0 0 OpenSM Subnet Manager e Ifthe AR Manager fails to parse the options file default settings for all the options will be used 8 8 5 1 General AR Manager Options Table 7 Adaptive Routing Manager Options File ENABLE lt true false gt Enable disable Adaptive Routing on fabric switches Default true Note that if a switch was identified by AR Ma
196. ority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to achieve effective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet gos Ca max vis 15 qos ca high limit 6 qos ca vlarb high 0 4 que ca vlarb low 0 0 1262 2120 35192 4 0 9104 0004 1308 gos T 12 T 90 0710 9710 11 12 15 1477 qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 gos swe vlerb Tow 030 1 64 2 126 37192 430 9 64 6 64 1004 gos swe L 071 2 3 4 E 2 10 11 12 13 14 7 In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x 4KB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority Y LS with different weights while VL4 is effectively turned off 8 6 8 Deployment Example Figure 4 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs Mellanox Technologies 149 Rev 1 5 3 1 0 0 OpenSM Subnet Manager Figure 4 Example QoS Deployment on Traffic class SDP Traffic class Partition A Service level 2 Service level
197. ork administrated vNics refer to Section 4 6 3 1 mlx4 vnic_info on page 73 4 6 2 2 EolB Network Administered vNic In network administered mode the configuration of the vNic is done by the BridgeX If a vNic is configured for a specific host it will appear on that host once a connection is established between the BridgeX and the mlx4 vnic module This connection between the mlx4 vnic modules and all available BridgeX boxes is established automatically when the mlx4 vnic module is loaded If the BridgeX is configured to remove the vNic or if the connection between the host and BridgeX is lost the vNic interface will disappear running ifconfig will not display the interface Similar to host administered vNics a network administered vNic resides on a specific vHub For further information on how to configure a network administered vNic please refer to BridgeX documentation 70 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 To disable network administered vNics on the host side load mlx4 vnic module with the net admin module parameter set to 0 4 6 2 3 VLAN Configuration A vNic instance is associated with a specific vHub group This vHub group is connected to a BridgeX external port and has a VLAN tag attribute When creating configuring a vNic you define the VLAN tag it will use via the vid or the VNICVLAN fields if these fields are absent the vNic will not have a VLAN tag The vNic s
198. orus 2QoS would generate the path from S to D as S n O T r D With failed switches O and T torus 2QoS will generate the path S n I q r D with illegal turn at switch I and with hop I q using a VL with bit I set In contrast to the earlier examples the second hop after the illegal turn q r can be used to construct a credit loop encircling the failed switches Mellanox Technologies 133 Rev 1 5 3 1 0 0 OpenSM Subnet Manager 8 5 7 2 Multicast Routing Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically avail able in current switches there is no way to use SL VL values to separate multicast traffic from uni cast traffic Thus torus 2QoS must generate multicast routing such that credit loops cannot arise from a combination of multicast and unicast path segments It turns out that it is possible to con struct spanning trees for multicast routing that have that property For the 2D 6x5 torus example above here is the full fabric spanning tree that torus 2QoS will construct where x is the root switch and each is a non root switch 4 3 2 ho a KO boo 1 y 0 x 0 1 Z 3 4 G For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns H
199. orwarding tables of the fabric switches ibdiag A dump of the multicast forwarding tables of the fabric switches net mcfdbs ibdiagnet masks Incase of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes e SM report e Number of nodes and systems e Hop count information maximal hop count an example path and a hop count histo gram e All CA to CA paths traced e Credit loop report e megid mlid HCAs multicast group and report e Partitions report e IPoIB report Furthermore if a topology file is provided ibdiagnet uses the names defined in it for gt In case the IB fabric includes only one CA then CA to CA paths are not reported a the output reports 168 Mellanox Techn
200. owever to construct a credit loop the union of multicast routing on this spanning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop In addition 1f none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link or switch failures that result in a fabric for which torus 2QoS can generate credit loop free unicast routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example consider that same 2D 6x5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree 134 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 3
201. pecitying when both SDP and AP INE sockets should be used Note that both semantics is different for server and client roles For server it means that the server will beis tening On both she end RCP sockeus por Client the connect function will first attempt to use SDP and miles utente ito Ter suk see DP connection ease lt role gt can be one of server or listen for defining the listening port address family client or connect for defining the connected port address family lt program name gt 52 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Defines the program name the rule applies to not includ ing the path Wildcards with same semantics as ls are supported tana SO Dz oma E On an sorodran with a name starting with db2 t cp would match on ttcp Sues If program name is not provided default the statement matches all programs Salares 2 Either the local address to which the server binds or the remote server address to which the client connects The syntax for address matching is Sera dress peer Hiengtn I IPv4 address 0 9 1 0 9 1 0 9 1 0 9 each sub number lt 255 pone US US aree vale e A pretix length of 24 Matenes the subnet mask US Oa Soe oie PES SEO e Ue selec ra eg oe a rs ln POMBO start port end port where port numbers are gt 0 and lt 65536 Note that rules are evaluated in the order of definition So the first match wins If no match is
202. ple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port 1 e SRP connections use the same path the configuration is enabled using a different initiator ext value for each SRP connection The initiator_ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator ext value on each path The conventions is to use the Target port GUID as the initiator ext value for the relevant path If you use srp daemon with n flag it automatically assigns initiator ext values according to this convention For example id ext 200500A0B81146A1 i0c guid 0002c90200402bec dgid fe800000000000000002c90200402bed pkey ffff service 1d 200500a0b81146al initiator ext ed2b400002c90200 Notes I It is recommended to use the n flag for all srp_ daemon invocations 2 1bsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the user or auto matically at startup by setting SRPHA ENABLE to yes 64 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 5 2 6 High Availability HA Overview High Availability works using the Device Mapper DM
203. prevent routing that is free of credit loops and will log warnings and refuse to route If no fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that case all paths that do not transit the failed components will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was configured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and message deadlock are likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capa ble of routing a torus without credit loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To verify that a torus fabric is routed free of credit loops use ibdmchk to analyze data collected via ibdiagnet vlr 8 5 7 6 Torus 2QoS Configuration File Syntax The file torus 2QoS conf contains configuration information that is specific to the OpenSM rout ing engine torus 2QoS Blank lines and lines where the first non whitespace character is are ignored
204. r s Manual Rev 1 5 3 1 0 0 Table 11 Congestion Control Manager CA Options File ca control map An array of sixteen bits one for each SL Each bit indicates Values Oxffff whether or not the corresponding SL entry is to be modi fied Sets the CC Table Index CCTI increase Default 1 trigger threshold Sets the trigger threshold Default 2 Sets the CC Table Index CCTI minimum Default 0 Sets all the CC table entries to a specified value The first Values lt comma separated list gt entry will remain 0 whereas last value will be set to the rest Default 0 of the table When the value 1s set to 0 the CCT calculation is based on the number of nodes ceti timer Sets for all SL s the given ccti timer Default 0 When the value 1s set to 0 the CCT calculation is based on the number of nodes Table 12 Congestion Control Manager CC MGR Options File max_ errors When number of errors exceeds max_errors of send Values error_window receive errors or timeouts in less than error_window sec e max errors 0 zero tollerance onds the CC MGR will abort and will allow OpenSM to abort configuration on first error proceed e error window 0 mechanism dis abled no error checking Default 5 cc statistics cycle Enables CC MGR to collect statistics from all nodes every Default 0 cc statistics cycle seconds When the value is set to 0 no sta tistics are collected Mellanox Technologies 161 9 9 1
205. rces thus improving performance by experiencing less cache misses Huge pages are supported for e UD QPs e RC QPs e COS Huge pages are OFF by default An application can be instructed to use huge pages by exporting to following environment variables e HUGE UD y e HUGE RC y e HUGE CQ y Mellanox Technologies 91 J Rev 1 5 3 1 0 0 Driver Features For huge pages allocation to succeed the system administrator will have to reserve huge pages from the OS This can be done at runtime by running 92 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 5 Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth Ifa ConnectX port is configured as Eth it may also function as a Fibre Channel HBA 5 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet or Fibre Channel over Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx port config script after the driver is loaded Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved configuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are e cth Ethernet e ib Infiniband
206. red fairly across each of these two VL ranges Torus 2QoS will detect and warn if VL arbitration is configured unfairly across VLs in the range 0 3 and also in the range 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via qos vlarb high gos vlarb low etc 8 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the correct path SL for connection setup Applications that use rdma_cm for connection setup will automati cally meet this requirement 136 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 If a change in fabric topology causes changes in path SL values required to route without credit loops in general all applications would need to repath to avoid message deadlock Since torus 2005 has the ability to reroute after a single switch failure without changing path SL values repa thing by running applications is not required when the fabric is routed with torus 2QoS Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover provided that all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that
207. rnet packets bearing a dedicated ether type While the use of GRH is optional within IB subnets it is mandatory when using RoCE Verbs applications written over IB verbs should work seamlessly but they require provisioning of GRH information when creating address vectors The library and driver are modified to provide for mapping from GID to MAC addresses required by the hardware 4 1 2 Software Dependencies In order to use RoCE over Mellanox ConnectX R hardware the mlx4 en driver must be loaded Please refer to MLNX EN README txt for further details 4 1 3 Firmware Dependencies In order to use RoCE over Mellanox ConnectX R hardware RoCE requires ConnectX firm ware version 2 7 000 or higher Features such as loopback require higher firmware versions 4 1 4 General Guidelines Since RoCE encapsulates InfiniBand traffic in Ethernet frames the corresponding net device must be up and running In case of Mellanox hardware mlx4 en must be loaded and the corresponding interface configured e Make sure that mlx4 en ko is loaded To verify the module is loaded run the folow ing command Ismod grep mlx4 en If the module is loaded the mInx4_en should be displayed as shown in the example below lsmod grep mlx4 en AMR ele oe Ome e Run ibv devinfo There is a new field named link layer which can be either Ethernet or IB If the value is IB then you need to use connectx port config to change the C
208. rt num gt and to generate output suitable for echo you may execute Mose Sis Osis moi Se E OG i ao ii ASILI labio sie To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run ls sys class infiniband Aa To both discover the SRP Targets and establish connections with them just add the e option to the above command Executing srp daemon over a port without the a option will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a Itis recommended to use the n option This option adds the initiator ext to the connect ing string See Section 4 5 2 5 for more details srp daemon has a configuration file that can be set where the default is etc srp daemon conf Use the f to supply a different configuration file that configures the targets srp_ daemon is allowed to connect to The configuration file can also be used to set values for additional parameters e g max cmd per lun max sect A continuous background daemon operation providing an automatic ongoing detection and connection capability See Section 4 5 2 4 4 5 2 4 Automatic Discovery and Connection to Targets e Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running e To connect to all the existing Targets in the fabric run srp daemon e o Th
209. rted by the kernel and there are VLAN interfaces on the main Ethernet interface the interface that the IB port is tied to then each such VLAN will appear as a new GID in the port s GID table The format of the GID entry will be identical to the one described above except for the following change gid 11 VEAN I high byte 4 MS bits gid 12 VLAN ID low byte Please note that VLAN ID is 12 bits wide 4 1 6 1 Priority Pause Frames Tagged Ethernet frames carry a 3 bit priority field The value of this field is derived from the IB SL field by taking the 3 least significant bits of the SL field 4 1 7 Using VLANs In order for RoCE traffic to use VLAN tagged frames the user needs to specify GID table entries that are derived from VLAN devices when creating address vectors Consider the example below e Make sure VLAN support is enabled by the kernel Usually this requires loading the 802 1q module 36 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 gt modprobe 8021q e Adda VLAN device gt vconfig add eth2 7 e Assign an IP address to the VLAN interface This should create a new entry in the GID table as index 1 gt ifconfig eth2 7 7 10 11 12 e Verbs test On server gt 1bv rc pingpong g I On client gt ibv rc pingpongs g I server e For rdma cm applications the user needs only to specify an IP address of a VLAN device for the traffic to go with the VLAN tagged frames
210. rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fab ric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the target IP and port number ULPs might also provide Qo
211. s P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen skip lt skip option s gt Skip the executions of the selected checks Skip puros Monero TEL ie Ape Zero rguras OM Wegicalestate part pola ll wt lt file name gt Write out the discovered topology into the given file This flag is useful if you later want to check for changes from the current state of the fabric A directory named te ms ahs Oe rea teo DATES Op ELO anal holds the IBNL files required to load this topology To use these files you will need to set the environment variable named IBDM IBNL PATH to that directory The directory is located in tmp or in the output directory provided by the o flag iO dda iena Loader tane eg So ie ans Lp subnet discovery stage Note Some of the checks require actual subnet discovery STE HRH Ee LST SEE e e Eee These checks are Duplicated zero guids link state SMs STATUS h help Prints the help page information Y yers ion Prints the version or the tool EVAS Prints the tool s environment variables and their values Mellanox Technologies 167 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities 9 4 2 Output Files Table 14 ibdiagnet of ibutils Output Files ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet st List of all the nodes ports and links in the fabric ibdiagnet fdbs A dump of the unicast f
212. s All errors reported in opensm log should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 7 4 If a fatal non recoverable error occurs opensm exits di 8 2 4 1 Running OpenSM As Daemon OpenSM can also run as daemon To run OpenSM in this mode enter hostl etc init d opensmd start 8 3 osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administrator osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports 118 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 8 3 2 osmtest has the following test flows e Multicast Compliancy test e Event Forwarding test e Service Record registration test e RMPP stress test e Small SA Queries stress test 8 3 1 Syntax osmtest OPTIONS where OPTIONS are ON Hits Option direc sto site st ONU pe creo Flow Description c create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file S run service registration deregistration and lease test e run event forwarding test E
213. s to Mellanox Technologies Sf Rev 1 5 3 1 0 0 Driver Features The default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA 4 8 4 4 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 4 8 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 8 can be split into two main parts I Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the dis covered fabric elements II PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests 1s first the request is matched against the defined match rules such that the target QoS Level definition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level 4 9 Atomic Operations 4 9 1 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those defined by the IB spec Atomicity guarantees Atomic
214. s to log into the FC fabric Provided that the FC fabric and FC targets are well configured LUNs will map to SCSI disk devices dev sdXXX vHBAs instantiated automatically by the debxd daemon are created on a VLAN 0 interface with VLAN priority set to the value negotiated with the switch This takes advantage of PFC which allows pausing FCoE traffic when needed without pausing the entire Ethernet link Also with proper configuration of the FCoE switch the link s maximum bandwidth can be divided as needed between FCoE and regular Ethernet traffic Instantiating vHBAs manually allows creating them on VLAN interfaces with any arbitrary VLAN id and priority as well as on the regular without VLAN Ethernet interfaces Using the reg ular interface means that PFC cannot be used In this case it is highly recommended that both the FCoE switch and the m1x4 en driver be con figured to use link pause regular flow control Otherwise any FCoE packet drop will trigger SCSI errors and timeouts 4 2 2 1 FCoE Configuration After installation please edit the file etc mlxfc mlxfc conf and set the following vari ables e FC SPEC set to T11 or pre T11 as supported by your FCoE switch Only pre T11 format is offloaded in hardware di e DCBX IFS provide a space separated list of Ethernet devices to monitor the use of the DCBX protocol for the FCoE feature availability vHBAs are automatically created on these interfaces if the F
215. selector RPM IRE E mpi selector Install user level RPMs fr ae ae a aa a a a ea a aaa a fe ae a A Ae a R f tr TTT fr ae Ae a A rr aaa rae Ae a AE T a a a a aaa a iaa anna AAA AAA ia aa a AAA ai iaa a aia ana rana aaa aaa aia ara aia aran iaa a aaa aaa aa RAT aia a AAA AAN Preparing RER iaa a aa aaa rana ana libibverbs aaa aaa rr ana rana aia libibumad aa iaa a aia ana rana aaa librdmacm ari iaa rai aran aria ERE opensm libs fr Ae a a ae a A AE a a a a a AREA REA aran aran iaa aaa libibmad fr ae a ae a AE T a aa aaa aa a TTT libmverbs fr ae A ae a A T tr TTT libmge ARE rf tr REA REA REA REA dapl AR FE REA REA REA REA RSE sa ai iaa a anna a AAA mvapich gcc ari aaa aa AAA libmthca ai aa tr AAA AAA libmlx4 IE see libnes libipathverbs IBA Em TRENTER sm openmpi gcc libsdp compat dapl mpitests openmpi gcc piece ne RM mpitests mvapich gcc NPLESSte pleno Comer ar mgr depliant ibsim infiniband diags opensm mel EIH utils perftest qperf ibacm srptools mvapich intel prove sais ibdump openmpa intel dump pr rds tools sdpnetstat me Lint libibumad devel libibverbs devel libibcm devel librdmacm devel libibmad devel opensm devel opensm static opensm static compat dapl devel libsdp devel infinipath psm devel libipathverbs devel libipathverbs devel libnes devel static libnes devel static libcxgb3 devel libcxgb3 devel libmlx4 devel libmlx4 devel libmthca devel stat
216. sgo ae a 0 I ANO PA terre mle e SA SAN 0 O A A ne 0 EevkRemot elo SS Eee 0 ROV WRG LE EEO a oa 0 e Sta a Er nr IEEE 3 ATCC ONS PES o EE 0 RENCOR Sd ua 0 I TRD beg rt LS a a NGA 0 A HEI AA 0 A o ee 0 PRT te E acre teeny aaa ANTAA AN 0 Revere een 0 METEO RON 0 E A e A ane te 0 9 14 ibcheckerrs Applicable Hardware All InfmiBand devices Description Validates an IB port or node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold file using the same format as the dump can be specified using the t lt file gt option Synopsis NS its Wife Miu eee MS So o sa ls A HE AE co ns sd ja Table 22 lists the various flags of the command Table 22 ibcheckerrs Flags and Options Default Optional If Not Description rae Mandatory Specified Mellanox Technologies 189 Rev 1 5 3 1 0 0 InfiniBand Fabric Diagnostic Utilities Table 22 ibcheckerrs Flags and Options y Default Flag SR If Not Description Mandatory Specified ad Optional lt threshold file gt G uid Optional GUID Example 0x08f1040023 additional verbosity vvv or v v v Print in brief mode Reduce the output to show only if errors are present not what
217. sing setsockopt ee NS MS e286 S WwW Melscone wen Roio Secu sai Mek Qe seneca lie S SCr wench Core ed tau els ys or MemrcOre opten lieto e Increase Linux s auto tuning of TCP buffer limits The minimum default and maxi mum number of bytes to use are Sects Netley Eee ici SS UA oie yace pte gici O 6 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance e Disable the TCP timestamps option for better CPU utilization sysctl w net ipv4 tcp timestamps 0 e Disable the TCP selective acks option for better CPU utilization ys E dey E sack 0 Mellanox Technologies 99 Rev 1 5 3 1 0 0 Performance 6 2 3 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface ethl then turns off Rx interrupt moderation and last shows the new setting gt ethtool c ethl Coalesce parameters for ethl Adapmive RX on 1X2 oft pkt rate low
218. smpquery Flags and Options Default Flag eC If Not Description Mandatory Specified h help Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08 1040023 EXC O EST EC foe O CT EC A CCC P lt ca por gt lt ca P lt ca por gt Optional Use the specified port Use Use the specified port specified port Optional Override the default timeout for the solicited MADs NE ms gt msec lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2vl lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt lt dest dr path Optional Destination s directed path LID or GUID lid guid gt Examples I Query PortInfo by LID with port modifier gt smoquery portino i 1 o EN OE Raz I 0x0000000000000000 Mellanox
219. sockets bin bash CORES gt eat proc cpuinto grep processor tail 1 awk print 3 1 limit 1 while SCORES gt 0 do EE te EE CORES S CORES 1 done if z 1 then IROS cat proc interrupts grep eth mlx awk print 1 sed s else TROS S cat proc interrupt arepa prn seq e 3777 fi echo Discovered irgo STROS Mellanox Technologies 101 Rev 1 5 3 1 0 0 Performance mack 7 ror IRQ in SIRQS do echo printer ox Smask proc ra O smprartinity mask S mask 2 if Smask ge Slimit then mask 1 fi done echo irgs were set OK 6 2 5 Preserving Your Performance Settings After A Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows lt sysctl namel gt lt valuel gt lt sysctl name2 gt lt value2 gt lt sysctl name3 gt lt value3 gt lt sysctl name4 gt lt value4 gt For example Section 6 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Perfor mance listed the following setting to disable the TCP timestamps option Secci e sip tep ement ampe N In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf fet pvc tcp timestamps 0 6 3 Performance Troubleshooting 6 3 1 PCI Express Performance Troubleshooting For the best performance on the PCI Express interface the adapter card should be installed i
220. sport over Ethernet net works It encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type RDS Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram deliv ery between sockets over RC or TCP IP For more details see Chapter 4 3 Reliable Datagram Sock 22 ets Mellanox Technologies 17 e Rev 1 5 3 1 0 0 Mellanox OFED Overview SDP Sockets Direct Protocol SDP is a byte stream transport protocol that provides TCP stream semantics SDP utilizes InfiniBand s advanced protocol offload capabilities Because of this SDP can have lower CPU and memory bandwidth utilization when compared to conventional imple mentations of TCP while preserving the TCP APIs and semantics upon which most current net work applications depend For more details see Chapter 4 4 Sockets Direct Protocol SRP SRP SCSI RDMA Protocol is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP driver known as the SRP Initiator differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an I
221. ss dev mst mt25418 pci uar0 bus dev fn 02 00 0 bar 0xdc800000 size 0x800000 4 Your InfiniBand device is the one with the postfix pci cr0 In the example listed above this will be dev mst mt25418 pci cro Step 3 Burn firmware 1 Burning a firmware binary image using mstflint that is already installed on your machine Please refer to MSTFLINT README txt under docs 2 Burning a firmware image from a mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 host1 mlxburn dev dev mst mt25418 pci cr0 fw mnt firmware fw 25408 w 25408 rel mlx Step 4 Reboot your machine after the firmware burning is completed 2 5 Uninstalling Mellanox OFED Use the script usr sbin ofed uninstall sh to uninstall the Mellanox OFED package The script is part of the ofed scripts RPM Mellanox Technologies 30 Rev 1 5 3 1 0 0 Configuration Files 3 Configuration Files For the complete list of configuration files please refer to MLNX OFED configuration files txt 34 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 Driver Features 4 1 RDMA over Converged Ethernet 4 1 1 RoCE Overview RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Ethernet net works It encapsulates IB transport and GRH headers in Ethe
222. st of match rule criteria below 8 6 4 Policy File Syntax Guidelines e Leading and trailing blanks as well as empty lines are ignored so the indentation in the example is just for better readability Mellanox Technologies 141 Rev 1 5 3 1 0 0 OpenSM Subnet Manager e Comments are started with the pound sign and terminated by EOL e Any keyword should be the first non blank in the line unless it s a comment e Keywords that denote section subsection start have matching closing keywords e Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules e Any section subsection of the policy file is optional 8 6 5 Examples of Advanced Policy File As mentioned earlier any section of the policy file is optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file qos levels qos level name DEFAULT SE end qos level end qos levels Port groups section is missing because there are no match rules which means that port groups are not referred anywhere and there is no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy fi
223. sting application is able to run using SDP the difference is that the application s TCP socket gets replaced with an SDP socket It is also possible to configure the driver to automatically translate TCP to SDP based on the source IP port the destination or the application name See Section 4 4 5 The SDP protocol is composed of a kernel module that implements the SDP as a new address fam ily protocol family and a library see Section 4 4 2 that is used for replacing the TCP address family with SDP according to a policy This chapter includes the following sections e Section 4 4 2 libsdp so Library on page 48 e Section 4 4 3 Configuring SDP on page 49 e Section 4 4 4 Environment Variables on page 51 e Section 4 4 5 Converting Socket based Applications on page 52 e Section 4 4 6 BZCopy Zero Copy Send on page 59 e Section 4 4 7 Using RDMA for Small Buffers on page 59 4 4 2 libsdp so Library libsdp so is a dynamically linked library which is used for transparent integration of applica tions with SDP The library is preloaded and therefore takes precedence over glibc for certain socket calls Thus 1t can transparently replace the TCP socket family with SDP socket calls The library also implements a user level socket switch Using a configuration file the system administrator can set up the policy that selects the type of socket to be used 1ibsdp so also has the option to allow s
224. such as IFRENAME 8 IP 8 or UDEV 7 For example to change the interface eth2 name to eth bx01 a10 run ifrename 1 eth2 n eth bx01 al0 To generate a unique vNic interface name use the mlx4 vnic info script with the u flag The script will generate a new name based on the scheme Cui poi ld b pere nun Sports lana For example if vNic eth2 resides on an InfiniBand card on the PCI BUS ID 0a 00 0 PORT 1 and 1s connected to the GW PORT ID 3 without VLAN its unique name will be Mira Ve Into U e eth2 ethl0 1 3 You can add your own custom udev rule to use the output of the script and to rename the vNic interfaces automatically To create a new udev rule file under etc udev rules d 61 vnic net rules include the line SUBSYSTEM net PROGRAM sbin mlx4 vnic info u k NAME c 2 UDEV service is active by default however if it is not active run sbin udevd d When vNic MAC address is consistent you can statically name each interface using the UDEV following rule SUBSYSTEM net SYSFS address aa bb cc dd ee ff NAME ethX For further information on the UDEV rules syntax please refer to udev man pages 76 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 4 7 IP over InfiniBand 4 7 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transpor
225. t R 152 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 ipoib pkey 0x8001 sul ipoib pkey 0x8002 a rds DIO SO ages pont oud o RPT SEI OR end gos ulps e OpenSM options file que max vis o qos high limit 0 sa 2 3 No IO gos vlarb Vow 0 1 GOSS o On ploy oy oy Loy Loy Loy bay Po e Partition configuration file Default 0x1fff A UL Parra T ee 8 8 Adaptive Routing 8 8 1 Overview 4 Adaptive Routing is at beta stage Adaptive Routing AR enables the switch to select the output port based on the port s load AR supports two routing modes e Free AR No constraints on output port selection e Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric switches It scans all the fabric switches deduces which switches support Adaptive Routing and configures the AR functionality on these switches Currently Adaptive Routing Manager supports only link aggregation algorithm Adaptive Rout ing Manager configures AR mechanism to allow switches to select output port out of all the ports that are linked to the same remote switch This algorithm suits any topology with several links between switches Especially it suits 3D torus mesh where there are several link in each direction of the X Y Z axis
226. t service The IPoIB driver ib ipoib exploits the following ConnectX ConnectX 2 capabilities e Uses any CX IB ports one or two e Inserts IP UDP TCP checksum on outgoing packets e Calculates checksum on received packets Support net device TSO through CX LSO capability to defragment large datagrams to MTU quantas e Dual operation mode datagram and connected e Large MTU support through connected mode IPoIB also supports the following software based enhancements e Large Receive Offload e NAPI e Ethtool support This chapter describes the following IPoIB mode setting Section 4 7 2 e IPoIB configuration Section 4 7 3 e How to create and remove subinterfaces Section 4 7 4 e How to verify IPoIB functionality Section 4 7 5 e The ib bonding driver Section 4 7 6 4 7 2 IPoIB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Connected mode This can be changed to become Datagram mode by editing the file etc infiniband openib conf and setting SET IPOIB CM no After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode 4 7 3 IPoIB Configuration Unless you have run the installation script mlnxofedinstall with the flag n then IPoIB has not been configured by the installation The configuration of I
227. tain extensive debug information you can modify 1ibsdp conf to have the log directive produce maximum debug output provide the min level flag with the value 1 The log statement enables the user to specify the debug and error messages that are to be sent and their destination The syntax of log is as follows log destination stderr syslog file lt filename gt min level 1 9 where options are destination send log messages to the specified destination stderr forward messages to the STDERR syslog send messages to the syslog service file lt filename gt write messages to the file var log filename for root For a regular user write to tmp lt filename gt lt uid gt 1f filename is not specified as a full path otherwise write to lt path gt lt filename gt lt uid gt min level verbosity level of the log 9 Print errors only 8 print warnings 7 print connect and listen summary useful for tracking SDP usage 4 print positive match summary useful for config file debug 3 print negative match summary useful for config file debug 2 print function calls and return values l print debug messages Examples To print SDP usage per connect and listern to STDERR include the following statement 50 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 log min level 7 destination stderr A non root user can configure libsdp so to record function calls and return values in the file
228. tances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shutdown etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown 4 6 Ethernet over IB EolB vNic The Ethernet over IB EoIB mlx4 vnic module is a network interface implementation over InfiniBand EoIB encapsulates Layer 2 datagrams over an InfiniBand Datagram UD transport service The InfiniBand UD datagrams encapsulates the entire Ethernet L2 datagram and its pay load To perform this operation the module performs an address translation from Ethernet layer 2 MAC addresses 48 bits long to InfiniBand layer 2 addresses made of LID GID and QPN This transla tion is totally invisible to the OS and user Thus differentiating EoIB from IPoIB which exposes a 20 Bytes HW address to the OS The mlx4 vnic module is designed for Mellanox s ConnectX family of HCAs and intended to be used with Mellanox s BridgeX gateway family Having a BridgeX gateway is a requirement for using EoIB It performs the following operations e Enables the layer 2 address translation required by the mlx4 vnic module e Enables routing of packets from the Infini
229. test path routing while also dis tributing the paths between layers LASH is an alternative deadlock free topology agnostic rout ing algorithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node 124 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 5 DOR Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fab ric is cabled as a hypercube and for meshes when cabled as a mesh 6 Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2005 provides deadlock free routing while supporting two quality of service QoS levels Additionally it can route around multiple failed fabric links or a single failed fabric switch without introducing deadlocks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled by A or ucast cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being
230. ting I O For example VPI enabled adapters can support e Connectivity to 10 20 and 40Gb s InfiniBand switches Ethernet switches emerging Data Center Ethernet switches InfiniBand to Ethernet and Fibre Channel Gateways and Ethernet to Fibre Channel gateways e Fibre Channel over Ethernet and Fibre Channel over InfiniBand e A single firmware image for dual port ConnectX ConnectX 2 adapters that supports independent access to different convergence networks InfiniBand Ethernet or Data Center Ethernet per port e A unified application programming interface with access to communication protocols including Networking TCP IP UDP sockets Storage NFS CIFS iSCSI SRP Fibre Channel Clustered Storage and FCoE Clustering MPI DAPL RDS sockets and Management SNMP SMI S e Communication protocol acceleration engines including networking storage cluster ing virtualization and RDMA with enhanced quality of service e RDMA over Converged Ethernet RoCE The following ULPs can be used over RoCE uDAPL SDP RDS MPI Mellanox Technologies 13 Rev 1 5 3 1 0 0 Mellanox OFED Overview 1 3 Mellanox OFED Package 1 3 1 ISO Image Mellanox OFED for Linux MLNX OFED LINUX is provided as ISO images one per sup ported Linux distribution and CPU architecture that includes source code and binary RPMs firm ware utilities and documentation The ISO image contains an installation script called mlnxofedinstal1 that perfor
231. tion occurs if torus 2QoS is misconfigured 1 e the radix of a torus dimension as configured does not match the radix of that torus dimension as wired and many switches links in the fabric will not be placed into the torus 8 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration con figuration unless it is invoked with Q Since torus 2QoS depends on such functionality for cor rect operation always invoke OpenSM with Q when torus 20Q0S is in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines supported by OpenSM except torus 2QoS there is a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used for unicast QoS configuration will be honored by torus 2QoS For multicast QoS con figuration only SL values 0 and 8 should be used with torus 2QoS Since SL to VL map configuration must be under the complete control of torus 2QoS any config uration via qos sl2vl qos swe sl2vl etc must and will be ignored and a warning will be gener ated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise if traffic is not deliv e
232. tions p and d together the first port in the direct route must be equal to the one specified in the p option Otherwise an error is reported When ibdiagpath queries for the performance counters along the path between the 4 source and destination ports it always traverses the LID route even if a directed route is specified If along the LID route one or more links are not in the ACTIVE state ibdi agpath reports an error Moreover the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source 9 5 1 SYNOPSYS Tata gpath Sn ste smame dst name gt Sedia EH pe Oo mae cune Maioli s lt sys name 1c deyv index lcl p port num Sour ld Ax Pa ooo ho E ER o P lt lt PM counter gt lt Trash Limit gt gt OPTIONS Mellanox Technologies 169 Rev 1 5 3 1 0 0 lt src name dst name gt reads ld SUA CO UI lt topo file gt lt sys name gt lt dev index gt lt port num gt o lt out dir gt l lt Lx E Re P lt PM lt Trash gt gt h help V version vars 9 5 2 Output Files InfiniBand Fabric Diagnostic Utilities Names of the source and destination ports as defined in the topology file source may be omit ted gt local port is assumed to be the source Source and destination LIDs source may be omit ted gt the local port is assumed to be the source Dire
233. trd ib lib modules ib cpr band core ay on scor tmp nad iby lib modules 1b cp infiniband core rdma cm ko tmp initrd ib lib modules ib cp infiniband core rdma ucm ko tmp initrd ib lib modules ib cp net mlx4 mlx4 core ko tmp initrd ib lib modules ib cp infiniband hw mlx4 mlx4 ib ko tmp initrd ib lib modules ib cp infiniband hw mthca ib mthca ko tmp initrd ib lib modules ib cp infiniband ulp ipoib ipoib helper ko tmp initrd ib lib modules ib cp infiniband ulp ipoib ib ipoib ko tmp initrd ib lib modules ib 208 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Step 5 IB requires loading an IPv6 module If you do not have it in your initrd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko tmp initrd ib lib modules Step 6 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd ib sbin Step 7 If you plan to give your IB device a static IP address then copy ifconfig Otherwise skip this step BIG eps bin leo Milo Amo ETEN spin Step 8 If you plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed
234. ual for details e The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 2 4 Updating Firmware After Installation In case you ran the mlnxofedinsta1l1 script with the without fw update option and now you wish to manually update firmware on your adapter card s you need to perform the fol lowing steps 4 If you need to burn an Expansion ROM image please refer to Burning the Expansion 7 ROM Image on page 200 The following steps are also appropriate in case you wish to burn newer firmware that 4 you have downloaded from Mellanox Technologies Web site http www mella nox com gt Downloads gt Firmware Step 1 Start mst hostils mse Start Step 2 Identify your target InfiniBand device for firmware update 3 Get the list of InfiniBand device names on your machine hostli MSE status MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre 12C module is not loaded MST devices dev mst mt25418 pciconf0 ROL conocio elle acceso 32 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is AD dev mst mt25418 pci cro CI So E bus dev fn 02 00 0 bar 0xdef00000 size 0x100000 Chip revision is A0 dev mst mt25418 pci msix0 O co Ss bus dev fn 02 00 0 bar 0xdeefe000 size 0x2000 PCI direct acce
235. ude e TIl and pre T11 frame format e Complete hardware offload of SCSI operations in pre T11 format e Hardware offload of FC CRC calculations in pre T11 format e Zero copy FC stack in pre T11 format e VLANs and PFC Priority flow control that is PPP The FCoE feature is based on and interacts with the Open FCoE project The m1x4 fc module is designed to replace the original fcoe module and to allow using ConnectX hardware offloads Mellanox OFED also includes the following open fcoe org modules e libfc Used by the m1x4 fc module to handle FC logic such as fabric login and logout remote port login and logout fc ns transactions etc e fcoe Implements FCoE fully in software Will load instead of mlx4 fc to support T11 frame format Works on top of standard Ethernet NICs including m1x4 en See http www open fcoe org for further information on the Open FCoE project FCoE Basic Usage After loading the driver userspace operations should create destroy vHBAs on required Ethernet interfaces This can be done manually by issuing commands to the driver using simple sysfs oper ations Alternatively it can be handled automatically by the dcbxd daemon if the interface is con nected to an FCoE switch supporting DCBX negotiation of the FCOE feature e g Cisco Nexus 44 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 Once a vHBA is instantiated on an Ethernet interface it immediately attempt
236. ugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt See an example of AR Manager options file with all the default values in Example of Adaptive Routing Manager Options File on page 157 154 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 8 8 3 2 Disabling Adaptive Routing There are two ways to disable Adaptive Routing Manager 1 By disabling it explicitly in the Adaptive Routing configuration file 2 By removing the armgr option from the Subnet Manager options file S Adaptive Routing mechanism is automatically disabled once the switch receives setting of the usual linear routing table LFT ha Therefore no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing 8 8 4 Querying Adaptive Routing Tables When Adaptive Routing is active the content of the usual Linear Forwarding Routing Table on the switch is invalid thus the standard tools that query LFT e g smpquery dump Ifts sh and oth ers cannot be used To query the switch for the content of its Adaptive Routing table use the smparquery tool that is installed as a part of the Adaptive Routing Manager package To see its usage details run smparquery h 8 8 5 Adaptive Routing Manager Options File The default location of the AR Manager options file is etc
237. uid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband srp srp mthca0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command srp_daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsrpdm functionality described above srp_ daemon can also e Establish an SRP connection by itself without the need to issue the echo command described in Section 4 5 2 2 e Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode e Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt 1s a digit e Enable High Availability operation together with Device Mapper Multipath e Have a configuration file that determines the targets to connect to 62 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 l srp daemon commands equivalent to ibsrpdm iene ETS is equivalent to ibsrpdm So MON T poca 4 These srp daemon commands can behave differently than the equivalent ibsrpdm command when etc srp_daemon conf is not empty da 2 srp daemon extensions to ibsrpdm To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port lt po
238. updates to your network adapter hardware it will ask you to reboot your machine Step 5 The script adds the following lines to etc security limits conf for the userspace components such as MPI soft memlock unlimited hard memlock unlimited These settings unlimit the amount of memory that can be pinned by a user space application If desired tune the value unlimited to a specific amount of RAM Step 6 For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be running on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 8 OpenSM Subnet Manager Step 7 InfiniBand only Run the hca self test ofed utility to verify whether or not the InfiniBand link is up The utility also checks for and displays additional information such as x HCA firmware version H Kernel architecture X Driver version H Number of active HCA ports along with their states H Node GUID Note For more details on hca self test ofed see the file hea self test readme under docs nosis Ust lola laca sel resin cited Performing IntiniBand HCA Self Test gt gt gt UM SRO SE EES E i CURIE ae r a a PASS fer cls rene O ea x86 64 Hose Driver T E MAIN E TD Se eelts Jer 2 6 32 12 0 7 default Hoste Driver REMACNSCk e eee PASS HCA Firmware on MCA O sosoo0o0o0o000000000 v2 9 1000 HCA Firmware Ch
239. usters gt 100 nodes it is recommended to use the OpenSM options described in Section 8 2 1 opensm Syntax on page 109 dai A 5 TFTP Server When you set the filename parameter in your DHCP configuration file to a non empty filename the client will ask for this file to be passed through TFTP For this reason you need to install a TFTP server A 6 BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device The priority of this list can be modified through BIOS setup A 7 Operation A 7 1 Prerequisites e Make sure that your client is connected to the server s e The FlexBoot image is already programmed on the adapter card see Section A 2 e For InfiniBand ports only Start the Subnet Manager as described in Section A 4 e The DHCP server should be configured and started see Section 4 7 3 1 IPoIB Con figuration Based on DHCP on page 78 e Configure and start at least one of the services iSCSI Target see Section A 10 and or TFTP see Section A 5 Mellanox Technologies 203 Rev 1 5 3 1 0 0 A 7 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot to be the first on the boot device priority list see Section A 6 On dual port network adapters the client first attempts to boot from Port 1 If this fails 4 it s
240. witches to boot from Port 2 Note also that the driver waits up to 90 seconds for A each port to come up If MLNX FlexBoot iPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by the Subnet Manager Y In case sensing the port protocol fails the port will be configured as an InfiniBand port A For ConnectX Mellanox Connect FlexBoot v3 3 400 IPXE 1 0 0 Open Source Network Boot Firmware neto 00 02 c09 03 00 6c 70 11 on PCIO2 00 0 open Link down TxX 0 TAE 0 RX 0 RXE 0 Link status The socket is not connected Waiting for link up on net ok After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from For ConnectX InfiniBand Mellanox Connectx Flexboot v3 3 400 1PXE 1 0 0 Open Source Network Boot Firmware net 00 02 c9 03 080 0c 78 11 on PCIOZ2 00 0 open Link down TX 0 TXE 0 RX 0 RXE O Link status The socket is not connected Waiting for link up on netO ok DHCP netO O02 02 c9 0c 78 113 ok net 11 3 12 27255 255 255 0 Next server 11 3 12 121 Filename pxeilinux 0 Root path tftpboot tftp 77411 3 12 121 pxeilinux 204 Mellanox Technologies Mellanox OFED for Linux
241. x Technologies Mellanox Technologies Ltd 350 Oakmead Parkway PO Box 586 Hermon Building Sunnyvale CA 94085 Yokneam 20692 U S A Israel www mellanox com Tel 972 4 909 7200 Tel 408 970 3400 Fax 972 4 959 3245 Fax 408 970 3403 Copyright 2011 Mellanox Technologies All rights reserved Mellanox BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale PhyX Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd FabricIT MLNX OS and SwitchX are trademarks of Mellanox Technologies Ltd All other marks and names mentioned herein may be trademarks of their respective companies 2 Mellanox Technologies Document Number 2877 Table of Contents Chapter 1 Mellanox OFED Overview oooooooooooocccororooccccrrrorcr lt gt lt ross 13 1 1 Introduction to Mellanox OFED 13 1 2 Introduction to Mellanox VPI Adapters 13 1 3 Mellanox OFED Package 14 IWS ASTIAMA a A E dd tk e da pia dle ia idoli 14 1 32 Sofware Components e di dico 14 123732 A cc os Ree ate rr hls aaa Saal tha lea atl et 15 1 334 Directory SUUCHING 2 3 0 254 0 re he id et ke we Ae eh eS SRE 15 1 4 Architecture 15 IE mica ACA IB pe ect f br Dat A le 16 14 2 IA VPLODOVES EA A eee 16
242. xample a value of 0 will give a PKey with the value 0x8000 Step 2 Create a child interface by running Boetie eci lt P gt SS Class neice lt a Sus estrados cesano Chloe Fxample host1 echo 0 gt sys class net ib0 create child This will create the interface 1b0 8000 Step 3 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 Nos tire comio 0000 160 9000 Link encap UNSPEC HWaddr 80 00 00 44 FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 Dx packers UP errors i pedo veras Ucarnrer D collasrons 0 exqueuelen 173 RX bytes 0 0 0 b TX bytes 0 0 0 Db Step 4 AS can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 7 3 3 Step 5 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 8 OpenSM Sub net Manager 4 7 4 2 Removing a Subinterface To remove a child interface subinterface run 82 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 ano sno aer eee e sy Class mel Lo acer aee delete ea Using the example of Step 2 scalo auto gt cys ellas Nor tel delete C
243. y image to the EEPROM s attached to an InfiniScaleIII switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific MADs over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace mstdump isw and 12c For additional details please refer to the MFT User s Manual docs 1 OpenSM is disabled by default See Chapter 8 OpenSM Subnet Manager for details on enabling it Mellanox Technologies 19 Rev 1 5 3 1 0 0 Mellanox OFED Overview 1 5 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager 20 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 1 0 0 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed The chapter includes the following sections e Section 2 1 Hardware and Software Requirements on page 21 e Section 2 2 Downloading Mellanox OFED on page 22 e Section 2 3
Download Pdf Manuals
Related Search
Related Contents
Samsung HT-C720 Benutzerhandbuch There is still to come! Supermicro SuperServer 7036A-T 7900 Users Guide 711801.vp 取扱説明書 User's Guide USER GUIDE - redcam.pl TA_Operation_Maintenance_2039242_DE_A001 Troubleshooting the DM-80 711522 - Bricoman Copyright © All rights reserved.
Failed to retrieve file