Home
Mellanox OFED Linux User`s Manual
Contents
1. 34 Mellanox Technologies m 2 2 1 0 1 Preparing dapl Preparing dapl devel Preparing apl devel static reparing apl utils reparing perftest Preparing stflint reparing Ft reparing rptools Preparing rds tools Preparing rds devel Preparing ibutils2 reparing ibutils reparing cc mgr vu o vu QO GL E ms E un Jg Jg Preparing dump pr Preparing ar mgr Preparing bdump reparing infiniband diags reparing infiniband diags compat Preparing qperf Preparing fca reparing mxm Preparing openmpi Preparing penmpi reparing upc reparing nfinipath psm reparing infinipath psm devel Di A Jg Jg al ter iae Ens Jg Mellanox Technologies 35 Rev 2 2 1 0 1 Installation Preparing mvapich2 Preparing hcoll Preparing libibprof Preparing mlnxofed docs Preparing mpitests mvapich2 2 Orcl Preparing mpitests openmpi 1 6 5 Preparing mpitests openmpi 1 8 Device 06 00 0 06 00 0 Network controller Mellanox Technologies MT27500
2. rested by mer oper sng Shrink Current System Use Free Space Batsen pms urent hate on peters at nes erty Oe permed ieee mt te cnt Gree e ni summer ros bere ema Dee Were ov Create Custom Layout Menah ceste pas oe CA on ICA armeen ang ae part tet Ureryst system depen and modify partitioning layout ema ben Step 14 Click Next and proceed with the Installation Step 15 Select the Basic Server option This is only one of the options that can be chosen not the mandatory one pe of Red Hat Etre Unus is bani server inta You C Step 16 Check the Customize Now checkbox Step 17 Click Next Step 18 Select Infiniband Support and iSCSI Storage Client HT OOE Storage Cioet Servers Web Services OtWbendSupport 5 Databases SRT System Management Large Systems Performance Virtualization Legacy UNIX compatibility Desktops Mainframe Access Applications Z Network file system client Development jf gt Networking Tools Languages lt Performance Tools Pert Support Printing cBent Ruby Support ER Scientific support B Security Toots Smart card support 7 ISCSI Storage Client vw packages selectos 12 of 20 Mellanox Technologies 269 Rev 2 2 1 0 1 E 5 1 E 6 E 6 1 Step 19 Click Next Allow the installation to reach completion SAN Booting the Diskless Client with FlexBoot When the installation process is completed the client w
3. Output File Description ibdiagnet2 lst Fabric links in LST format ibdiagnet2 sm Subnet Manager ibdiagnet2 pm Ports Counters ibdiagnet2 fdbs Unicast FDBs ibdiagnet2 mcfdbs Multicast FDBx ibdiagnet2 nodes info Information on nodes ibdiagnet2 db csv ibdiagnet internal database An ibdiagnet run performs the following stages Fabric discovery Duplicated GUIDs detection Links in INIT state and unresponsive links detection Counters fetch Error counters check Routing checks Link width and speed checks Alias GUIDs check e Subnet Manager check Partition keys check Nodes information Return Codes 0 Success 1 Failure with description 190 Mellanox Technologies m 2 2 1 0 1 9 4 2 ibdiagnet of ibutils IB Net Diagnostic This version of ibdiagnet is included in the ibutils package and it is not run by default after installing Mellanox OFED To use this ibdiagnet version and not that of the ibu adi tils package you need to specify the full path opt bin ibdiagnet Ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis ibdiagnet c lt count gt v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt
4. 216 Mellanox Technologies m 2 2 1 0 1 IneloowiNGPEVACINMES o essere nos os 1 Oyoicloyoyrorollereuete Eni 9 55 oo Seen nas naonac 1 PiiterRawinbounde e E 1 Bose ROW OU C NE E SS DonamcscllorcOs ssposporpo esas dao 0 3 Query Nodelnfo by direct route gt smpquery D nodeinfo 0 Node info DR path slid 65535 dlid 65535 0 Bas Us a o esterases 1 A RE MU o as il Node TYPE ee ee e E E een ankare Channel Adapter NOMPO TE Sie E E E E ers d SYST EMG UA eer eR RU RE ROS 0x0002c9030000103b GUC Sawer rsa sean ETERNI EDI 0x0002c90300001038 PONG GUL e ole OR 0x0002c90300001039 PAGE Cap eo teo bea ed 128 DEV tna costes iia 0x634a Revi orons eeu aci n 0x000000a0 Local Portero eeen deer TTE 1 Encarna 0x0002c9 9 4 16 perfquery Queries InfiniBand ports performance and error counters Optionally it displays aggregated counters for all ports of a node It can also reset counters after reading them or simply reset them Synopsis peapeay Sa a I G seal ell eel EC em xen E ee pex AN et lt timeout_ms gt V lt lid guid gt port reset mask Output Files Table 35 lists the various flags of the command Table 35 perfquery Flags and Options Optional Default Flag po If Not Description Mandatory Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d G uid
5. 00 0200 unene 244 Table 63 raw ethernet lat Flags and Options 0 0c c cece cence enna 245 Table 64 Additional raw ethernet lat Flags and Options 0 0 0 c eee eee eee 246 Table 65 raw ethernet lat Raw Ethernet Flags and Options 02 0 e ee eeee 246 Mellanox Technologies 11 Rev 2 2 1 0 1 Document Revision History Table 1 Document Revision History Release Date Description Added the following sections Section 2 3 7 openibd Script on page 39 Section 4 7 Ethernet VXLAN on page 78 Section 4 8 1 Atomic Operations in mlx5 Driver on page 79 Section 4 14 7 Running Network Diagnostic Tools on a Virtual Function on page 107 Section 4 14 7 2 Granting SMP Capability to a Virtual Function on page 108 Section 4 14 7 3 Installing MLNX OFED with Net work Diagnostics on a VM on page 108 Section 4 15 Quantized Congestion Control on page 108 Section 4 15 1 QCN Tool mInx gen on page 109 Section 4 15 2 Setting QCN Configuration on page 111 Section 4 14 6 2 3 Multi GUID Support in InfiniBand on page 102 Section 4 14 6 3 4 Mapping VFs to Ports using the mlnx get vfs pl tool on page 107 Section 4 23 pm qos usage on ingress Packet Traffic on page 121 Section 4 24 XOR RSS Hash Function on page 121 Section 9 4 4 ibstat on page 195 Section 9 4 5 ibtracert on p
6. 4 22 2 Allocating Memory Window Allocating memory window is done by calling the ibv_alloc_mw verb type mw IBV MW TYPE 2 IBV MW TYPE 1 mw ibv alloc mw pd type mw 4 22 3 Binding Memory Windows After allocated memory window should be bound to a registered memory region Memory Region should have been registered using the IBV EXP ACCESS MW BIND access flag Binding Memory Window type 1 is done viathe ibv exp bind mw verb struct ibv exp mw bind mw bind comp mask IBV EXP BIND MW RESERVED 1 ret ibv exp bind mw qp mw amp mw bind Binding memory window type 2B is done via the ibv exp post send verb and a spe cific Work Request WR with opcode IBV EXP WR BIND MW Prior to binding please make sure to update the existing rkey ibv inc rkey mw gt rkey 4 22 4 Invalidating Memory Window Before rebinding Memory Window type 2 it must be invalidated using the ibv exp post send verb and a specific WR with opcode IBV EXP WR LOCAL INV 4 22 5 Deallocating Memory Window Deallocating memory window is done using the ibv dealloc mw verb ibv dealloc mw mw 120 Mellanox Technologies m 2 2 1 0 1 4 23 pm qos usage on ingress Packet Traffic pm qos API is used by mlx4 en to enforce minimum DMA latency requirement on the system when ingress traffic is detected Additionally it decreases packet loss on systems configured with abundant power state prof
7. 0 00 00 cee eee 170 8 6 7 SL2VL Mapping and VL Arbitration eese ese 172 8 6 8 Deployment Example 0 0 0 cece ccc e 174 6 Mellanox Technologies m 2 2 1 0 1 8 7 QoS Configuration Examples o o ooooooooororrornr eh 174 8 7 1 Typical HPC Example MPI and Lustre 0 0 0 0 174 8 7 2 EDC SOA 2 tier IPoIB and SRP ssosoroerres reser rrr rr rer eese 175 8 7 3 EDC 3 tier IPoIB RDS SRP 1 eens 176 8 8 Adaptive Routing sceso ceee cece eee e nen aa 177 8282 OVEIMIEW cine e ee Pee oe Vans iva Casca me 177 8 82 Installing the Adaptive Routing o oooooooororrrrrr rr rer rr 178 8 83 Running Subnet Manager with Adaptive Routing Manager 178 8 84 Querying Adaptive Routing Tables 0 0 cece ess 179 8 8 5 Adaptive Routing Manager Options File 00 c eee eee 179 8 9 Congestion Control siii a Ee Rae ER PUO a as 182 8 9 1 Congestion Control Overview 00 cece cect tenes 182 8 92 Running OpenSM with Congestion Control Manager 04 182 8 9 3 Configuring Congestion Control Manager 0 lese eee 182 8 9 4 Configuring Congestion Control Manager Main Settings 183 Chapter 9 InfiniBand Fabric Utilities ooooooooooooomomo 186 9 1 Common Configuration Interface and AddressiM8 ooocoococooo 186 9 2 InfiniBand Interface Definition 0 20 0 cette eee 186 9 3
8. Traffic class IPoIB Service Level 3 Policy min 10 BW App A Server App B Server 8 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each exam ple provides the QoS level assignment and their administration via OpenSM configuration files 8 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI Separate from I O load Min BW of 70 Storage Control Lustre MDS Low latency Storage Data Lustre OST Min BW 30 Administration MPlis assigned an SL via the command line host1 mpirun s1 0 e OpenSM QoS policy file 174 Mellanox Technologies m 2 2 1 0 1 P In the following policy file example replace OST and MDS with the real port GUIDs ale qos ulps default 0 default SL for MPI any target port guid OST1 0ST2 0ST3 0ST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL for Lustre MDS end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 2 1 qos vlarb low 0 96 1 224 qe elevil Wy dy by 3p by an Oy Ip 15 1155 19 115 19 15 Ly US 8 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels Application traffic IPoIB UD and CM and SDP Isolate
9. The line can be wrapped after after a Partition Definition and between A PartitionName does not need to be unique but PKey does need to be unique Ifa PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note e Itis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions Mellanox Technologies 149 Rev 2 2 1 0 1 OpenSM Subnet Manager Examples Default 0x7fff ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 1imi 0x2134af2306 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited ShareIO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited ShareIO 0x80 0x123453 0x123454 0x123455 full 0x123456 0x123457 will be limited 3 0 S hareI0 0x80 defmember limited 0x123456 0x123457 x123458 full hareIO 0x80 defmember full 0x123459 0x12345a harelO 0x80 defmember full 0x12345b 0x12345c limited 0x12345d The following rule is equivalent to how OpenSM used to run prior to the partition manager Default 0x7fff ipoib ALL full 150 Mellanox Technologies m 2 2 1 0 1 8 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm Based on th
10. On Kernels 3 13 and higher or Kernels that ported the functionality of enabled access to the VLAN device egress map through vlan dev get egress qos mask call in if vlan h 66 Mellanox Technologies m 2 2 1 0 1 4 The UP is mapped to the TC as configured by the minx qos tool or by the 11dpad daemon if DCBX is used gt With RoCE there can only be 4 predefined ToS values for the purpose of QoS mapping 4 5 5 Raw Ethernet QP Quality of Service Mapping Applications open a Raw Ethernet QP using VERBs directly The following is the RoCE QoS mapping flow 1 The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP Setsqp attrs ah attrs sl up e Calls modify qp with 1B oP av set in the mask 2 The UP is mapped to the TC as configured by the minx qos tool or by the 11dpad daemon if DCBX is used When using Raw Ethernet QP mapping the TOS sk prio to UP mapping is lost S Performing the Raw Ethernet QP mapping forces the QP to transmit using the given UP If packets with VLAN tag are transmitted UP in the VLAN tag will be overwritten with a the given UP 4 5 6 Map Priorities with tc wrap py mlnx qos Network flow that can be managed by QoS attributes is described by a User Priority UP A user s sk priois mapped to UP which in turn is mapped into TC Indicating the UP e When the user uses sk prio it is mapped into a UP by the tc tool This is done by
11. Synopsis ibv devices Examples 1 List the names of all available InfiniBand devices gt ibv devices device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 9 4 10 ibv devinfo Queries InfiniBand devices and prints about them information that is available for use from user space Synopsis ibv devinfo d lt device gt i lt port gt 1 v Output Files Table 30 lists the various flags of the command Table 30 ibv devinfo Flags and Options Optional Default Flag Ta PA If Not Description y Specified d device Optional First found Run the command for the provided IB 1b dev lt device gt device device device 1 lt port gt Optional All device Query the specified device port lt port gt ib port lt port gt ports Mellanox Technologies 203 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 30 ibv_devinfo Flags and Options Optional Denon Flag i dator If Not Description y Specified l Optional Inactive Only list the names of InfiniBand list devices V Optional Inactive Print all available information about the verbose InfiniBand device s Examples 1 List the names of all available InfiniBand devices gt ibv devinfo 1 2 HCAs found mthca0 mlx4 0 2 Query the device mlx4_0 and print user available information for its Port 2 gt ibv_devinfo d mlx4 0 i 2 hca_id mlx4 0 fw ver 2 5 944 node guid
12. IPAPPEND 2 amp To install SLES11 SP3 over iSCSI over IPoIB the following parameters need to be added to the append statement above an insmod ib ipoib insmod libiscsi insmod rdma cm insmod ib iser For post installation boot booting the SLES 11 SP3 off the iSCSI storage using PXE services please provide the booting client a path to the initrd and linux kernel as provided inside SLES11SP3 kISO VPI pxeboot in the tgz above The below is an example of such label LABEL SLES11 3x64 iscsi boot MENU LABEL 2 SLES11 3 iSCSI boot kernel SLES11SP3 kISO VPI pxeboot linux append initrd SLES11SP3 kISO VPI pxeboot initrd net root iscsi 12 7 6 30 iqn 2013 10 galab com sqa030 prt9 TargetAd dress 12 7 6 30 TargetName iqn 2013 10 qalab com sqa030 prt9 TargetPort 3260 net delay 10 rootfstype ext3 rootdev dev sda2 The steps described in this document do not refer to an unattended installation with autoyast For official information on SLES unattented installation with autoyast please refer to https www suse com documentation sles11 book autoyast page documentation sles11 book autoyast data book autoyast html The following is known to work with Mellanox NIC append initrd SLES 11 SP3 DVD x86 64 GM DVD1 boot x86 64 loader initrd install nfs lt NFS IP Address gt lt path the the repository directory gt autoyast nfs lt NFS IP Address gt path to autoyast xml directory autoyast unattended xml biosdevname 0 IPAPPEND
13. PCI device 0x15b3 0x1003 mlx4 core SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 a c3 50 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth1 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 a c3 51 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth2 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 e9 56 a1 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth3 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 e9 56 a2 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth4 Example for IPoIB interfaces SENSE o RXOIPTONESS Edu DES Ve ATR clei obe OQ AUR os 32 NAME ib0 SJUIE IS Sa inte RXCAPTIONES Ed IDEM Ve AE 0 O As 32 NAME ib1 44 Mellanox Technologies m 2 2 1 0 1 4 Driver Features 4 1 SCSI RDMA Protocol 4 1 1 Overview As described in Section 1 3 4 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol off load and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The kSRP Target resides in an IO unit and provides storage services Section 4 1 2 describes the SRP Initiator included
14. e Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg 1b0 The only meaningful bonding policy in IPoIB is High Availability bonding mode num ber 1 or active backup Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only supported value is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions In the bonding master configuration file e g ifcfg bond0 in addition to Linux bond ing semantics use the following parameter MTU 65520 65520 1s a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 4 3 2 IPoIB Mode Setting on page 55 and are configured with the same value For IPoIB slaves that work in datagram mode use MTU 2044 If you do ade not set the correct MTU or do not set MTU at all performance of the interface might decrease e In the bonding slave configuration file e g ifcfg ib0 use the same Linux Network Scripts semantics In particular DEVICE ib0 Inthe bonding slave configuration file e g ifcfg 1b0 8003 the line TYPE InfiniBand is necessary when using bonding over devices configured with partitions p key For RHEL users In etc modprobe b bond conf add the following lines alias bond0 bonding For SLES users It is necessary to update the MANDATORY DEVICES environment variable
15. verbose This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the vf option for more information about log verbosity su This option sets the maximum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity y This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume Mellanox Technologies 147 Rev 2 2 1 0 1 OpenSM Subnet Manager 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 8 3 2 Running osmtest To run osmtest in the default mode simply enter hostlf osmtest The default mode runs all the flows except for the Quality of Service flow see Section 8 6 After inst
16. Communicate with rdma_cm module to exchange data use regular QPs The table below lists the additional flags of the command Table 60 Additional raw_ethernet_lat Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive output lt units gt pkey_index lt pkey index gt Set verbosity output level bandwidth message_rate latency_typical PKey index to use for QP Raw Ethernet Options The table below lists the Raw Ethernet flags of the command Table 61 raw_ethernet_lat Raw Ethernet Flags and Options Flag Description B source_mac Source MAC address by this format XX XX XX XX XX XX default take the MAC address form GID 246 Mellanox Technologies m 2 2 1 0 1 Table 61 raw ethernet lat Raw Ethernet Flags and Options Flag Description E dest mac Destination MAC address by this format XX XX XX XX XX XX MUST be entered J dest ip Destination ip address by this format X X X X using to send packets with IP header j source ip Source ip address by this format X X X X using to send packets with IP header K dest port Destination port number using to send packets with UDP header as default or you can use tcp flag to send TCP Header k source port Source port number using to send packets with UDP header as default or you can use tcp flag to s
17. Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib_srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fab ric and send the ib srp module requests to connect to them as well Operation When a path from port1 to a target fails the ib srp module starts an error recovery process If this process gets to the reset host stage and there is no path to the target from this port ib srp will remove this scsi host After the scsi host is removed multipath switches to another path to this target from another port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path 52 Mellanox Technologies m 2 2 1 0 1 Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA srp daemon c e R 300 i InfiniBand HCA name p port number This step can be performed by execut
18. Provides a report of the fabric qualities WS Tree Indicates that UpDown credit loop checking should be done against automatically determined roots 188 Mellanox Technologies m 2 2 1 0 1 0 output path directory Specifies the directory where the output files will be placed default var tmp ibdiagnet2 skip stage Skip the executions of the given stage Applicable Skip stages all dup guids dup node desc lids links sm pm nodes info speed width check pkey aguid skip plugin library name gt Skip the load of the given library name Applicable skip plugins libibdiagnet cable diag plugin libibdiagnet cable diag plugin 2 1 1 Reset all the fabric PM counters P counter lt lt PM gt lt value gt gt If any of the provided PM is greater then its provided value than print it pm pause time seconds Specifies the seconds to wait between first counters sample and second counters sample If seconds given is 0 than no second counters sample will be done default 1 0d TEST Provides a BER test for each port Calculate BER for each port and check no BER value has exceeds the BER threshold default threshold 10 12 ber use data Indicates that BER test will use the received data for calculation ber thresh value Specifies the threshold value for the BER test The reciprocal number of the BER should be provided Example for 10
19. Some HCA s allocate multiple n MSI X vectors per HCA port If the IRQ affinity masks of these interrupts have been configured such that each MSI X interrupt is handled by a different CPU then the comp vector parameter can be used to spread the SRP completion workload over multiple CPU s 48 Mellanox Technologies m 2 2 1 0 1 tl retry count A number in the range 2 7 specifying the IB RC retry count 4 1 2 3 SRP Tools ibsrpdm srp daemon and srpd Service Script To assist in performing the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above echo command Step 2 e A service script srpd which may be started at stack startup The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented ibsrpdm ibsrpdm is using for the following tasks 1 Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device sys class infiniband mad umadoO execute the following command ibsrpdm This command will output information on each SRP Target detected in h
20. d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support d10 Put OpenSM in testability mode Without d no debug options are enabled alo ali c Display this usage info then exit 8 2 2 Environment Variables The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet 1st opensm fdbs and opensm mcfdbs By default this direc tory is var log OSM CACHE DIR 144 Mellanox Technologies m 2 2 1 0 1 opensn stores certain data to the disk such that subsequent runs are consistent The default directory used is var cache opensm The following file is included in it e guid21id stores the LID range assigned to each GUID 8 2 3 Signaling When OpenSM receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSRI can be used to trigger a reopen of var log opensm 1log for logrotate pur poses 8 2 4 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter hostlf opensm Note that opensm needs to be run o
21. export GASNET FCA ENABLE BARRIER 1 export GASNET FCA ENABLE BCAST 1 export GASNET FCA ENABLE REDUCE 1 Mellanox Technologies 131 oo Rev 2 2 1 0 1 HPC Features 5 5 2 2 Controlling FCA Offload in ScalableUPC using Environment Variables gt To enable FCA module under ScalableUPC export GASNET FCA ENABLE CMD LINE 1 gt To set FCA verbose level export GASNET FCA VERBOSE CMD LINE 10 gt To set the minimal number of processes threshold to activate FCA export GASNET FCA NP CMD LINE 1 ScalableUPC contains modules configuration file http modules sf net which can be found at opt mellanox bupc 2 2 etc bupc modulefile rl 5 5 3 Various Executable Examples The following are various executable examples Torun a ScalableUPC application without FCA support 9 upcrun np 128 fca enable 0 executable filename gt To run UPC applications with FCA enabled for any number of processes export GASNET FCA ENABLE CMD LINE 1 GASNET FCA NP CMD LINE 0 upcrun np 64 executable filename oo Torun UPC application on 128 processes verbose mode upcrun np 128 fca enable 1 fca np 10 fca verbose 5 executable filename Torun UPC application offload to FCA Barrier and Broadcast only upcrun np 128 fca ops barrier bt executable filename 132 Mellanox Technologies Jrev22 1 04 6 Working With VPI VPI allows ConnectX ports to be inde
22. gt proc scsi tgt groups Default devices add vdisk2 2 proc scsi tgt groups Default devices add vdisk3 3 proc scsi tgt groups Default devices modprobe ib srpt 250 Mellanox Technologies rev 2 2 1 0 1 echo add mgmt proc scsi tgt trace level echo add mgmt dbg gt proc scsi tgt trace level echo add out of mem gt proc scsi tgt trace level oce ke ke ek ke ke ke ke ke ke e ke ke ke e e KK x End srpt sh kkkkxkkkkkkkkkkkkkkkkkkkkkxkxkxk A 3 How to Unload Shutdown 1 Unload ib srpt modprobe r ib srpt 2 Unload scst and its dev handlers first modprobe r scst vdisk scst 3 Unload MLNX OFED kernel modules etc rc d openibd stop Mellanox Technologies 251 Rev 2 2 1 0 1 Appendix B mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modprobe conf options mlx4 core paramete and or options mlx4 ib paramete and or options mlx4 en paramete r lt value gt r lt value gt r lt value gt The following sections list the available m1x4 parameters B 1 mlx4 ib Parameters sm guid assign dev assign str Enable SM alias GUID assignment if sm guid assign gt 0 Default 1 int Map device function numbers to IB device numbers exse 0000 000 00 00 2168 1L 6t8 UG cil doa o Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for IB device numbers e g 1 Max
23. u qp timeout lt timeout gt QP timeout timeout value is 4 usec 2 timeout default 14 242 Mellanox Technologies m 2 2 1 0 1 Table 55 raw ethernet bw Flags and Options Flag Description V version Display version number w limit bw Set verifier limit for bandwidth X gid index lt index gt Test uses GID with GID index Default IB no gid ETH 0 y limit_msgrate Set verifier limit for Msg Rate Z com_rdma_cm Additional Options Communicate with rdma_cm module to exchange data use regular QPs The table below lists the additional flags of the command Table 56 Additional raw_ethernet_bw Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive output lt units gt pkey_index lt pkey index gt Set verbosity output level bandwidth message_rate latency_typical PKey index to use for QP report both Report RX amp TX results separately on Bidirectinal BW tests report_gbits Report Max Average BW of test in Gbit sec instead of MB sec run_infinitely Run test forever print results every lt duration gt seconds Mellanox Technologies 243 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Rate Limiter The table below lists the Rate Limiter flags of the command Table 57 raw_ethernet_bw Rate Limite
24. 1 2 4 Directory Structure ute dee eeepc aa t a e gr cens 21 1 3 Architecture eset et etd and eo ePaper Sens etu ex net Onan es 22 1 3 mb VPEDEIVet ct LLRENIURENN CBE cold ite dee Sua ae de 22 1 3 2 mlx5 DIVER had Pera ae kas SS 23 1 3 3 Mad layer Cores c vvv bumper verius s oe PSP PER ES PERS 24 1 3 4 UPS o ven yd n eode vit ML aen 24 IRAMUJER LEER 25 1 3 6 InfiniBand Subnet Manager oooooooocoorrrr eh 25 1 3 7 Diagnostic Utilities cca ecg RETE ERES rr rr rr rr ee 25 1 3 8 Mellanox Firmware Tools 0 0 cece eee rr rss rr se res 25 T 4 Quality of Services ans ues emeret tbe et e ette e mre teris 26 1 5 RDMA over Converged Ethernet RoCE o o ooooooomomommmmo o o 26 Chapter 2 Installation A AA E S REX EK 2 1 Hardware and Software Requirements ssssseseseeeeeererrrr ere rr rea 28 2 2 Downloading Mellanox OFED ooocccccccoococcocnoc eh 28 2 3 Installing Mellanox OFED 2 0 0 cece Ih 29 2 3 1 Pre installation Notes 2 3 200 us sa eee cece eet ra 29 2 32 Installation CPE cc A i 30 2 3 3 Installation Procedure 0 0 cece ence rer resa 33 2 34 Installation Results cootra do os Rag baa 38 2 3 5 Post installation Notes lesse 39 2 3 6 Installation Logging 0 eect tnt ve render N R 39 2 3 openibd Seript is eue ERE ae is AAA VOLENS RN de 39 2 4 Updating Firmware After Installation 20 0 cece eee eee ee 40 2 5 Installing MLNX OFED us
25. 15 8 2 opensm Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indi cators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correc
26. For example ethtool k eth0 grep udp tnl tx udp tnl segmentation on Make sure that the VXLAN tunnel is set over UDP port 4789 which is the ConnectX firmware default If using standard Linux bridge and not open vswitch set the following etc moa probe d vxlan conf options vxlan udp port 4789 4 7 3 Important Notes VXLAN tunneling adds 50 bytes 14 eth 20 ip 8 udp 8 vxlan to the VM Ethernet frame Please verify that either the MTU of the NIC who sends the packets e g the VM virtio net NIC or the host side veth device or the uplink takes into account the tunneling overhead Meaning the MTU of the sending NIC has to be decremented by 50 bytes e g 1450 instead of 1500 or the uplink NIC MTU has to be incremented by 50 bytes e g 1550 instead of 1500 From upstream 3 15 rcl and onward it is possible to use arbitrary UDP port for VXLAN Note that this requires firmware version 2 31 2800 or higher Additionally you need to enable this kernel configuration option CONFIG MLX4 EN VXLAN y Onupstream kernels 3 12 3 13 GRO with VXLAN is not supported 4 8 Atomic Operations 4 8 1 Atomic Operations in mlx5 Driver Atomic Operations in Connect IB mlx5 driver are fully supported on big endian machines e g PPC Their support is limited on little endian machines e g x86 When using ibv exp query device on little endian machines with Connect IB the attr exp atomic cap is set to IBV EXP ATOMIC HCA REPLY BE
27. Inspect the current QCN statistics and counters for a certain port sorted by priority Set values of chosen QCN parameters Usage mlnx qcn i interface options Options version Show program s version number and exit h help Show this help message and exit i INTE interface INTF Interface name 0 INGE qe type TA Type of information to get statistics param COLS rpg enable RPG ENABLE LIST oO t value of rpg enable according to prior ty use spaces between values and 1 for nknown values rppp max rps RPPP MAX RPS LIST 9 t value of rppp max rps according to prior ty use spaces between values and 1 for nknown values rpg time reset RPG TIME RESET LIST 9 t value of rpg time reset according to pri rity use spaces between values and 1 for nknown values rpg byte reset RPG BYTE RESET LIST D e value of rpg byte reset according to pri rity use spaces between values and 1 for nknown values p rpg threshold RPG THRESHOLD LIST oO t value of rpg threshold according to pri rity use spaces between values and 1 for nknown values D rpg max rate RPG MAX RATE LIST t value of rpg max rate according to prior ty use spaces between values and 1 for nknown values oO rpg ai rate RPG AI RATE LIST t value of rpg ai rate according to prior ty use spaces between values and 1 for nknown values oO rpg hai rate RPG HAI
28. Output Files Table 32 lists the various flags of the command Table 32 ibportstate Flags and Options Optional Default Flag O If Not Description Mandatory Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr show Optional Show send and receive errors time outs and others Mellanox Technologies 207 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 32 ibportstate Flags and Options Continued Optional Default Flag ee If Not Description BEY Specified v erbose Optional Increase verbosity level May be used several times for additional ver bosity vvv or v v v V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases 1t is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C ca name Optional Use the specified channel adapter or router P ca port Optional Use the specified port t timeout ms Optional Override the default timeout for the solicited MADs msec dest dr path lid Optional Destination s directed p
29. Step 22 Click Install F e View Macros Toots Power MextBoot Virtual Media Hep J Installation Settings Chck a headline to make changes or use the Change menu below Demon ppt Keyboard Layout Engish US Partitioning A Delete log af volume Merg qudebi l Lv root 3 78 ON Delete logical volume ardeo LA sep 015 72 OB y 01 60 Create rost volume deviva2 17 99 GB wth et Software Product SUSE Unux Enterprise Server 11 SP3 erns o Make sure open iscsi RPM is selected for the installation under Software After the installation is completed the system will reboot Make sure you choose SLES11 3x64 iscsi boot label from the boot menu See Section E 3 Configuring the PXE Server on page 258 Mellanox Technologies 265 Rev 2 2 1 0 1 Step 23 Complete post installation configuration steps It is recommended to download and install the latest version of MLNX OFED LINUX available from http www mellanox com page q products dyn product family 26 amp mtag linux sw drivers E 4 1 Using PXE Boot Services for Booting the SLES11 SP3 from the iSCSI Target Once the installation is completed the system will reboot At this stage it is expected from the client to perform another PXE network boot with FlexBoot Make sure you choose the SLES11 3x64 iSCSI boot label from the boot menu See Section E 3 Configuring the PXE Server on page 258 E 5 Installing RHEL6 4
30. connected through the loop As such the UPDN routing algorithm should be send if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a statisti cal histogram is built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage P The user can override the node list manually bo If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm ade 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 3 Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid
31. e Calculates checksum on received packets Support net device TSO through ConnectX LSO capability to defragment large data grams to MTU quantas Dualoperation mode datagram and connected Large MTU support through connected mode IPoIB also supports the following software based enhancements Giant Receive Offload NAPI Ethtool support 4 3 2 IPolB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Datagram except for Connect IBTM adapter card which uses IPoIB with Connected mode as default For better scalability and performance we recommend using the Datagram mode However the mode can be changed to Connected mode by editing the file etc infiniband openib conf andsetting SET IPOIB CM yes The SET_IPOIB_CM parameter is set to auto by default to enable the Connected mode for Con nect IB card and Datagram for all other ConnectX cards After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode Mellanox Technologies 55 Rev 2 2 1 0 1 Driver Features 4 3 3 IPolB Configuration Unless you have run the installation script mlnxofedinstall with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP address an
32. encapsulates the InfiniBand transport and the GRH headers in Ethernet packets bearing a dedi cated ether type 0x8195 Thus any VERB application that works in an InfiniBand fabric can work in an Ethernet fabric as well RoCE is enabled only for drivers that support VPI currently only mlx4 When working with RDMA applications over Ethernet link layer the following points should be noted The presence of a Subnet Manager SM is not required in the fabric Thus operations that require communication with the SM are managed in a different way in RoCE This does not affect the API but only the actions such as joining multicast group that need to be taken when using the API Since LID is a layer 2 attribute of the InfiniBand protocol stack it is not set for a port and is displayed as zero when querying the port e With RoCE the alternate path is not set for RC QP and therefore APM is not supported Since the SM is not present querying a path 1s impossible Therefore the path record structure must be filled with the relevant values before establishing a connection Hence it is recommended working with RDMA CM to establish a connection as it takes care of filling the path record structure The GID table for each port is populated with N 1 entries where N is the number of IP addresses that are assigned to all network devices associated with the port including VLAN devices alias devices and bonding masters The only exception to this ru
33. group DEVGROUP up down vf NUM mac LLADDR vlan VLANID qos VLAN QOS sordas sone NOE I I use ip link set dev PF device vf NUM vlan vlan id qos lt qos gt where NUM 0 max vf num e vlan id 0 4095 4095 means set VGT e qos 0 7 For example ip link set dev eth2 vf 2 qos 3 sets VST mode for VF 2 belonging to PF eth2 with qos 3 e ip link set dev eth2 vf 2 4095 sets mode for VF 2 back to VGT 4 14 6 3 2Additional Ethernet VF Configuration Options Guest MAC configuration By default guest MAC addresses are configuredto beallzeroes Inthemlnx ofedguestdriver if a guest sees a zero MAC it generates a random MAC address for itself If the administrator wishes the guest to always start up with the same MAC he she should configure guest MACs before the guest driver comes up The guest MAC may be configured by using ip link set dev PF device vf NUM mac lt LLADDR gt For legacy guests which do not generate random MACS the adminstrator should always con figure their MAC addresses via ip link as above Spoof checking Spoof checking is currently available only on upstream kernels newer than 3 1 ip link set dev lt PF device gt vf lt NUM gt spoofchk on off 4 14 6 3 3Mapping VFs to Ports gt To view the VFs mapping to ports Using the ip link tool v2 6 34 3 and above ip link The output is as following 61 plpl lt BROADCAST MULTICAST gt mtu 1500 qdisc noo
34. hostl srp daemon c a o i lt InfiniBand HCA name gt p port number 4 To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or tun ls sys class infiniband de To both discover the SRP Targets and establish connections with them just add the e option to the above command Executing srp daemon over a port without the a option will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a e Itis recommended to use the n option This option adds the initiator_ext to the connecting string See Section 4 1 2 5 for more details e srp daemon has a configuration file that can be set where the default is etc srp daemon conf Use the f to supply a different configuration file that configures the tar gets srp daemon is allowed to connect to The configuration file can also be used to set values for additional parameters e g max cmd per lun max sect Acontinuous background daemon operation providing an automatic ongoing detection and connection capability See Section 4 1 2 4 4 1 2 4 Automatic Discovery and Connection to Targets Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running To connect to all the existing Targets in the fabric run srp daemon e o This util ity will scan the fabric once connect to every Ta
35. lt username gt fhostl Step 2 Check that the public and private keys have been generated host1 cd home lt username gt ssh host1 1s hostis 1s la total 40 Gli 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 gt 1 root root 1675 Mar 5 04 57 id rsa Mellanox Technologies 125 Rev 2 2 1 0 1 HPC Features rW r r 1 root root 404 Mar 5 04 57 id rsa pub Step 3 Check the public key hostl cat id rsa pub ssh rsa AAAAB3NzaC1yc2EAAAABIWAAAQEA1ZVY8VBHOh90kZN70A1ibU074RXm4zHeczyVxpYHaDPyDmqezbYMKrCIVz d10bH ZkCOrpLYviU0oUHd3 f vNT Ms0gcGg08PysUft 12FyYjira2Plxyg6mkHLGGqVut fEMmABZ3wNCUg6J2X 3G uiuSWXeubZmbXcMrP wAIWByfH8ajwo6A5WioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorX NCOZBe4kTnUqm63nQ2z1qVMdL9FrCmalxIOu94SQJAjwONevaMz FKEHe7YHg6YrNfXunfdbEurzB524TpPcrod ZlfCQ lt username gt thostl Step 4 Now you need to add the public key to the authorized keys2 file on the target machine hostl cat id rsa pub xargs ssh host2 V echo gt gt home lt username gt ssh authorized keys2 lt username gt host2 s pass word Enter password host1 For a local machine simply add the key to authorized_keys2 host1 cat id rsa pub gt gt authorized keys Step 5 Test host1 ssh host2 uname Linux 5 2 3 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI im
36. nication IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connectivity only LocalIdentifier ID An address assigned to a port data sink or source point by the Subnet Manager unique within the subnet used for directing packets within the subnet Local Device Node The IB Host Channel Adapter HCA Card installed on the System machine running IBDIAG tools Mellanox Technologies 17 Rev 2 2 1 0 1 Table 3 Glossary Sheet 2 of 2 Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Man The Subnet Manager that is authoritative that has the refer ager ence configuration information for the subnet See Subnet Manager Multicast Forward ing Tables A table that exists in every switch providing the list of ports to forward received multicast packet The table is organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot Card NIC and provides one or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in the ager role of a Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administra An application normally part of the Subnet Manager that tor SA implements the interface for querying
37. port lt port gt Listen on connect to port lt port gt default 18515 r rx depth lt dep gt Rx queue size default 512 Ifusing srq rx depth con trols max wr size of the srq R rdma cm S Size size Connect QPs with rdma cm and run test on those QPs Size of message to exchange default 2 S sl sl7 T tos lt tos value SL default 0 Set tos value to RDMA CM QPs available only with R flag values 0 256 default off u qp timeout lt timeout gt QP timeout timeout value is 4 usec 2 timeout default 14 U report unsorted implies H print out unsorted results default sorted V version Display version number x gid index lt index gt Z com_rdma_cm Test uses GID with GID index Default IB no gid ETH 0 Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 46 Additional ib_send_lat Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive Mellanox Technologies 233 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 46 Additional ib_send_lat Flags and Options Flag Description output lt units gt Set verbosity output level bandwidth message rate latency typical pkey_index lt pkey index gt PKey ind
38. v Addressing Re LR LECT edere ete Ege BM ete Pence 187 9 4 Diagnostic Utilities o oooooooorororrrrrrrr e 187 9 4 1 ibdiagnet of ibutils2 IB Net Diagnostic 0 0 0 0 cece ene 187 9 4 2 ibdiagnet of ibutils IB Net Diagnostic 0 000 191 9 4 3 ibdiagpath IB Diagnostic Path 00 0 cece 193 OA AL MDS tab ict e de 195 9 4 5 IDtfacett su a e dl e o tse E be tetas sees a 196 9 4 6 1bqUetyerrOfS ooo esce e rue ace E qom P eU acp UE P eR bien 197 DAT SAQUE risas di AS OH He SES 199 9 4 8 smpdumpiil ue ee oe eed eka ee eee ee eee ee 201 94 9 BV deVICes c sid A EES ESOS IR IA RES 203 OA TOsibvsdevinfo 20 tale o AES as oth ES S 203 9 41 I ibd v2netdey 5 0 a o eques 204 9 412 abstatus s O ot lc Ug d t e e 205 9 4 13 ibportstate ci a Pade eee a CREE RES 207 DANA 1broutez uev er EE oe deel Hg Ma PERPE E 210 9 4 15 Smpquety verbe CREER UE RI CERE d is 214 9 4 TO PAUTA Y Den QUARO AER We ey taber e HRS 217 9 4 17 msttlintes oo eae e ee Ce eto a Ma o 220 9 4 18 aby asyncewatch ii ceo une ERA S ES 224 DATO 1bd mp 5 xe beet A ORTOS ent A e OR 224 9 5 Performance Utilities 225 9 53 c dbxread DW 30 2 oo coe ia a PESE 225 9 5 2 ib read sat eoe o tie te A a SUR 227 9 5 3 o o V WERT ee 229 gSA ab send later s on bes uc eut det et ad Sr 232 9 5 5 1b write DW o cera eee os Oa YS OE NR Tee ARE SAVER 234 Mellanox Technologies 7 Rev 2 2 1 0 1 9 5 6 ADC writ
39. wt pm pc P lt lt PM gt lt Value gt gt lw lt 1x 4x 12x gt ls 2 95 105 skip ibdiag check s gt load db db file gt Options E COUINE Min number of packets to be sent across each link default 10 y Enable verbose mode i Provides a report of the fabric qualities t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p lt port num gt Specifies the local device s port num used to connect to the IB fabric 0 o Cites Specifies the directory where the output files will be perci t tmp lw lt 1x 4x 12x gt Specifies the expected link width le 2 55 Le Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm pc Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen placed defau skip lt skip option s gt Skip the executions of the selected checks Skip options one or more can be specified dup guids zero quids pm logical state part ipoib all Mellanox Technologies 191 wt lt file name gt load db file h help V version vars Output Files Rev 2 2 1 0 1 InfiniBand Fabric Ut
40. x X OK X x XX XX XxX Xx Xx xXx X Mellanox Technologies 213 Rev 2 2 1 0 1 InfiniBand Fabric Utilities 9 4 15 smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsis smpquery h d e v D G s lt smlid gt V C lt ca_name gt P lt ca_port gt t lt timeout_ms gt node name map lt node name map gt lt op gt dest dr_path lid guid gt op params Output Files Table 34 lists the various flags of the command Table 34 smpquery Flags and Options a Default Flag a aioe If Not Description SSR RA Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr_show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries V ersion Optional Sh
41. 0000 0000 0007 3895 sys image guid 0000 0000 0007 3898 vendor id 0x02c9 vendor part id 25418 hw ver 0x40 board id T 04A0140005 phys port cnt 2 POR state PORT_ACTIVE 4 max_mtu 2048 4 active mtu 2048 4 sm lid il port_lid 1 port lmc 0x00 9 4 11 ibdev2netdev Ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link Synopsys ibdev2netdev v h 204 Mellanox Technologies m 2 2 1 0 1 Options v Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt ib0 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 2 DOWN gt ibl Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 1 DOWN gt eth2 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 2 DOWN gt eth3 Down sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev mlx4 0 port 1 gt eth5 Down mlx4 0 port 1 gt ib0 Down mlx4 0 port 2 gt ibl Down mlx4 1 port 1 gt eth2 Down mlx4 1 port 2 gt et
42. 1000 RX bytes 22440 21 9 KiB TX bytes 19232 18 7 KiB Interrupt 10 Base address 0xa000 4 14 4 Assigning a Virtual Function to a Virtual Machine This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine 4 14 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server Step 1 Run the virt manager Step 2 Double click on the virtual machine and open its Properties Step 3 Go to Details gt Add hardware gt PCI host device ac He Virtual Machine View Send Key Q 9 o Add new virtual hardware v x Adding Virtual Hardware This assistant will guide you through adding a new piece of virtual hardware First select what type of hardware you wish to add HOR Hardware type i Storage W Network input a EJ Graphics n ha Sound a 4 Serial EU Video Physical Host Device EJ video B watchdog amp Cancel gt Forward gt Add Hardware Remove Step 4 Choose a Mellanox virtual function according to its PCI device e g 00 03 1 Step 5 Ifthe Virtual Machine is up reboot it otherwise start it Step 6 Log into the virtual machine and verify that it recognizes the Mellanox card Run lspci grep Mellanox 00 03 0 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 98 Mellanox Technologies m 2 2 1 0 1 Step 7 Add the device to the etc sysconfig network scripts ifcfg ethx configuration file The MA
43. 12 than value need to be 1000000000000 or 0xe8d4a51000 10 12 If threshold given is 0 than all BER values for all ports will be reported extended speeds lt dev type gt Collect and test port extended speeds counters dev type sw all pm per lane List all counters per lane when available ls 2 5 5 10 14 25 FDR10 Specifies the expected link speed lw 1x 4x 8x 12x Specifies the expected link width w write topo file file name Write out a topology file for the discovered topology zie sistere Tile lt stile gt Specifies the topology file name out ibnl dir directory The topology file custom system definitions ibnl directory screen num errs num Specifies the threshold for printing errors tO eme TI default 5 Mellanox Technologies 189 Rev 2 2 1 0 1 InfiniBand Fabric Utilities smp window num Max smp MADs on wire default 8 gmp window num Max gmp MADs on wire default 128 max hops lt max hops gt Specifies the maximum hops for the discovery process default 64 V version Prints the version of the tool h help Prints help information without plugins help if exists ele help Prints deep help information including plugins help Output Files Table 22 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 22 ibdiagnet of ibutils2 Output Files
44. 2 E 4 Installing SLES11 SP3 on a Remote Storage over iSCSI Step 1 Reboot the diskless client and perform a PXE boot with FlexBoot This is not an iSCSI boot rather a regular PXE initiated network deployment of SLES11 SP3 In the DHCP server configuration the PXELINUX pxelinux 0 and a SLES11 SP3 distribution media will be provided for network installation The clients HDD was removed beforehand therefore the installer will ask to locate a HDD The built in iSCSI discovery will be used to connect to the iSCSI target LUN parti tion Mellanox Technologies 259 Rev 2 2 1 0 1 Step 2 Reboot the client and invoke PXE boot with the Mellanox boot agent 5 ES O X File View Macros Tools Power NextBoot Virtual Media Help BIOS Boot Manager Normal PLDS DUD RM DS SH tegrated NIC 1 BRCM MBA Slot 0100 v15 2 10 ot 5 MLNX FlexBoot 3 414 159 PCI 04 00 0 Use Up Down arrows to highlight desired item Use Enter to select highlighted item Use Esc to continue normal boot urrent User s root 10 8 16 157 Step3 Select the Install SLES11 3 boot option from the menu see pxelinux cfg example above After about 30 seconds the SLES installer will issue the notification below due to the PXELINUX boot label we used above File View Macros Tools Power MextBoot Virtual Media Help gt gt gt Linuxrc v3 3 91 Kernel 3 0 76 0 11 default lt lt lt urrent User s root 10 8 16 157 Step4 Click OK
45. 27 ibdiagnet of ibutils Output Files 0 ccc II 192 Table 28 ibdiagpath Output Files 2 0 ete nen en tree eas 195 Table 29 ibstat Flags and Options 0 ccc cece tenet nen tree eas 195 Table 30 ibtracert Flags and Options 0 0c cece cece een teen eas 196 Table 31 ibqueryerrors Flags and Options 197 Table 32 saquery Flags and Options 00 cece cece eect een tree ens 199 Table 33 smpdump Flags and Options 0 0c cece eee eee nen enn eae 202 Table 34 ibv_devinfo Flags and Options 0 0 c ccc eh 203 Table 35 ibstatus Flags and Options 2 0 0 0 ccc cece eee rr rer rer tree rea 205 10 Mellanox Technologies rov 2 2 1 0 1 Table 36 ibportstate Flags and Options sssesssesererereerrerrrrrrrrr ere rr rer ere rena 207 Table 37 ibportstate Flags and Options 0 cece cece ere rr rer ere rna 211 Table 38 smpquery Flags and Options 00 cece eee eee eh 214 Table 39 perfquery Flags and Options 2 00 0 eee eh 217 Table 40 mstflint Switches 00 cece ence teen n eens 220 Table 41 mstflint Commands 0 cnet rss se sea 222 Table 42 ib read bw Flags and Options 0 ccc eet ere rr rer ere rens 226 Table 43 Additional ib read bw Flags and Options 0 0 0 cece cence enn 227 Table 44 ib read lat Flags and Options 0 c eect nee eee 228 Table 45 Additional ib read lat Flags and
46. 4 l l I 3 I I I I 2 I I I T 4 X l I l I I I y 0 I x 0 1 2 3 4 5 Assuming the y dateline was between y 4 and y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as it occurs on a 1D ring the ring for x 3 that is broken by a failure as in the above example 8 5 7 3 Torus Topology Discovery The algorithm used by torus 2QoS to construct the torus topology from the undirected graph rep resenting the fabric requires that the radix of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm is to examine the cube formed by the eight switch locations bounded by the corners x y z and x 1 y 1 z 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that is consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm is based on examining the topology of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2Qo
47. 51 ib atomic bw Flags and Options Flag Description A atomic_type lt type gt Type of atomic operation from CMP AND SWAPFETCH AND ADD default FETCH AND ADD b bidirectional Measure bidirectional bandwidth default unidirectional c connection lt RC XRC Connection type RC XRC DC default RC DC gt d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec F CPU freq Do not fail even if cpufreq_ondemand module is loaded h help Show this help screen 1 ib port lt port gt Use port lt port gt of IB device default 1 post_list lt list size gt Post list of WQEs of list size size instead of single post m mtu lt mtu gt MTU size 256 4096 default port mtu n iters lt iters gt Number of exchanges at least 5 default 1000 N no peak bw Cancel peak bw calculation default with peak 0 outs lt num gt Num of outstanding read atom default max of device O dualport Run test in dual port mode p port lt port gt Listen on connect to port lt port gt default 18515 q qp lt num of qp s gt Num of qp s default 1 Q cq mod Generate Cqe only after lt cq mod gt completion 238 Mellanox Technolog
48. 6 6 3 RDS Similar to SDP RDS PR query is matched by Service ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent Mellanox Technologies 171 Rev 2 2 1 0 1 OpenSM Subnet Manager rds SSI any service id 0x00000000010648CA lt SL gt 8 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the tar get IB port GUID The following two match rules are equivalent srp target port guid 0x1234 lt SL gt any target port guid 0x1234 SL Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 8 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traffic and that s why it is the only ULP that did not appear in the qos ulps section 8 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are Max VLs the maximum number of VLs that will
49. Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mellanox network adapter bridge switch device It includes query functions to the burnt firm ware image and to the binary image file spark 1 OpenSM is disabled by default See Chapter 8 OpenSM Subnet Manager for details on enabling it Mellanox Technologies 25 J Rev 2 2 1 0 1 Mellanox OFED Overview This tool burns a firmware binary image to the EEPROM s attached to an InfiniScalelII switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific MADs over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace mstdump isw and 12c For additional details please refer to the MFT User s Manual docs 1 4 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB and Eth network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager 1 5 RDMA over Converged Ethernet RoCE RoCE allows InfiniBand IB transport applications to work over Ethernet network RoCE
50. Example 2 working with scst vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst b modprobe scst vdisk c echo open vdisk0 dev md0 gt proc scsi tgt vdisk vdisk d echo open vdisk1 10G file gt proc scsi tgt vdisk vdisk e echo add vdisk0 0 gt proc scsi tgt groups Default devices f echo add vdiskl 1 gt proc scsi tgt groups Default devices 2 Run For all distributions except SLES 11 gt modprobe ib srpt For SLES 11 gt modprobe i ib srpt For SLES 11 please ignore the following error messages in var log messages when loading ib srpt to SLES 11 distribution s kernel ib srpt no symbol version for scst unregister ib srpt Unknown symbol scst unregister ib srpt no symbol version for scst register ib srpt Unknown symbol scst register ib srpt no symbol version for scst unregister target template ib srpt Unknown symbol scst unregister target template B On Initiator Machines On Initiator machines manually perform the following steps Mellanox Technologies 249 Rev 2 2 1 0 1 1 Run modprobe ib srp 2 Run ibsrpdm c d dev infiniband umadX to discover a new SRP target umad0 port 1 of the first HCA umadi port 2 of the first HCA umad2 port 1 of the second HCA 3 echo new target info gt sys class infiniband srp srp mthca0 1 add target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in
51. Flags and Options Flag Description a all Run sizes from 2 till 2 23 c connection lt RC XRC UC Connection type RC XRC UC DC default RC DC C report cycles Report times in cpu cycle units default microseconds d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds f margin Measure results within margins default 2sec h help Show this help screen H report histogram Print out all results default print summary only 1 ib port lt port gt Use port lt port gt of IB device default 1 236 Mellanox Technologies m 2 2 1 0 1 Table 49 ib write lat Flags and Options Flag Description I inline_size lt size gt Max size of message to be sent in inline m mtu lt mtu gt MTU size 256 4096 default port mtu n iters lt iters gt Number of exchanges at least 5 default 1000 p port lt port gt Listen on connect to port lt port gt default 18515 R rdma_cm Connect QPs with rdma_cm and run test on those QPs S SiZe lt size gt Size of message to exchange default 2 S sl lt sl gt SL default 0 T tos lt tos value gt Set tos value to RDMA CM QPs available only with R flag values 0 256 default off u qp timeout lt timeout gt U report unsorted QP timeou
52. Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 state 28 ON phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mthca0 port 2 status default gid fe80 0000 0000 0000 0002 c900 0101 d152 base lid 0x0 sm lid 0x0 state Z9 JUN phys state 5 LinkUp rate 10 Gb sec 4X 206 Mellanox Technologies m 2 2 1 0 1 2 List the status of specific ports of specific devices gt ibstatus mthca0 1 mlx4 0 2 Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 state 23 IN phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0x1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR 9 4 13 ibportstate Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a switch port then ibportstate can be used to disable enable or reset the port validate the port s link width and speed against the peer port Synopsis ibp ortstate xU sel EN ew ED 26 es mid Y C ca name gt P lt ca_port gt t timeout ne gt lt dest dr path lid guid gt lt portnum gt lt op gt lt value gt
53. Key This option specifies the SM s SM Key 64 bits This will effect SM authentication Note that OpenSM version 3 2 1 and below used the default value 1 in a host byte order it is fixed now but you may need this option to interoperate with old OpenSM running on a little endian machine siga licks ex This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID routing engine R engine name gt This option chooses routing engine s to use instead of default Min Hop algorithm Multiple routing engines can be specified separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail If all configured routing engines fail OpenSM will always attempt to route with Min Hop unless no fallback is included in the list of routing engines Supported engines updn file ftree lash dor torus 2Q0S Mellanox Technologies 137 Rev 2 2 1 0 1 OpenSM Subnet Manager do mesh analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing lash start vl vl number Sets the starting VL to use for the lash routing algorithm Defaults to 0 sm s
54. MACHADO so aeo oov oc5255 1X or 4X MS scone oco oo 8 1X or 4X A oro greco eee o ooo AX LinkSpeedSupported 00 2 5 Gbps or 5 0 Gbps NASA oi dee toes Active ANSeS a aeee E LinkUp inkDOwnDet Starem TEE T EIE Polling Mellanox Technologies 215 Rev 2 2 1 0 1 InfiniBand Fabric Utilities ProtecEBa Seer M a 0 IMA OS EE UXOR UO OU OUO 0 LinkSpeedActive esses 5 0 Gbps miss PENA de T CT ET 2 5 Gbps or 5 0 Gbps Na 2048 ODO eo Oro oe me E E 0 Conta em laa rada VL0 7 II ao 0x00 VERO Odo eS nan yd 4 MAIA TED HELG IG D Bcc e OS eas on woo 8 MISNEDEOWOSD DIDIT DUI DITS 8 Tina RODA ASSE S eR NAE IUE AUS ERU 0x00 DAD ICE E 2048 VISTE COLES 5o 5956 do590 6909995 0 HO GI ooo aca 31 OIL oooO to Ao TS VL0 3 Pant EN TORCENDO e sq MM MM f ren 0 AMO USOS noo oo eer mor qoonno 0 MER WAND Se MU E UR RE 0 uc QUE ens UU quM MM 0 MES NI 0 Pkeyvaokations 5 yoo E E TU 0 NS VAT ENE tOn Sc PP E SBN 0 ooo Io DO 128 ClientReregister o 0 SI srl Eo dead 9900990909000 18 MIS uno en E 16 Ad ua dea 8 A oa 8 Mead og 565560 550000505 0 ROUN GMS Sent tae C UE Rc US 0 2 Query SwitchInfo by GUID gt smpquery G switchinfo 0x000b8cffff004016 Switch info Lid 3 TEN Save DE apea elelr E TOT 49152 RanGombal Caries ener teeter 0 aa 1024 MN SERI I I ECT 8 DS RO end pene 0 DSHS CANINES E 0 PAM EN 0 EI e Ee E i e eR 18 SEAL mur nie n TUR ITE TITD 0 y dSPenbORt cm ee E P ATTE 0 Bongo ood PM 32
55. Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 a Optional Apply query to all ports Mellanox Technologies 217 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 35 perfquery Flags and Options Optional Default Flag m P t If Not Description MEM Specified l Optional Loop ports r Optional Reset the counters after reading them C Optional Use the specified channel adapter or router lt ca_name gt P lt ca_port gt Optional Use the specified port R Optional Reset the counters t Optional Override the default timeout for the solicited lt timeout_ms MADs msec P V ersion Optional Show version info lt lid guid Optional LID or GUID port reset mask Examples perfquery r 32 1 read performance counters and reset perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports perfquery R 32 2 Ox0fff reset only error counters of port 2 perfquery R 32 2 0xf000 reset only non error counters of port 2 1 Read local port s performance counters gt perfquery Port counters Lid 6 port 1 oc ccoo uo wonmuanpaanaodaes 1 Counters electr ran ee 0x1000 SVD LEN OAS aoannopcopansanoane 0 TAMKRE
56. Options 0 cc cece eee eens 229 Table 46 ib send bw Flags and Options 00 c cece ete een teens 230 Table 47 Additional ib send bw Flags and Options 0 0 0 ce cece eee eens 231 Table 48 Additional Rate Limiter Flags and Options 00 0 e cece eee eens 232 Table 49 ib send lat Flags and Options 0 0 02 c eee rer ers rr esse ess 232 Table 50 Additional ib send lat Flags and Options 2 0 0 eee eee eee 233 Table 51 ib write bw Flags and Options 00 00 cece cece 234 Table 52 Additional ib write bw Flags and Options 0 0 0 e cece eee eens 235 Table 53 ib write lat Flags and Options 00 cece cece Ih 236 Table 54 Additional ib write lat Flags and Options 2 0 0 0 ce cece eee eee eee 237 Table 55 ib atomic bw Flags and Options 0 ccc cece eh 238 Table 56 Additional ib atomic bw Flags and Options 00 0 0 c cee ee eee teens 239 Table 57 ib atomic lat Flags and Options 0 0c eee cece eee eee 240 Table 58 Additional ib atomic lat Flags and Options 0 0c cece eee eens 241 Table 59 raw ethernet bw Flags and Options 00 0 cece eee eee 241 Table 60 Additional raw ethernet bw Flags and Options 00 00 c eee eee eee ee 243 Table 61 raw ethernet bw Rate Limiter Flags and Options 0 00 errors 244 Table 62 raw ethernet bw Raw Ethernet Flags and Options
57. RATE LIST t value of rpg hai rate according to prior ty use spaces between values and 1 for unknown values FE ger xem ce A den ee AS O 1 ien oy A ex ex LA en pne ge cem defe dq rpg gd RPG GD LIST Set value of rpg gd according to priority use spaces between values and 1 for unknown values Mellanox Technologies 109 Rev 2 2 1 0 1 Driver Features rpg min dec fac RPG MIN DEC FAC LIST Set value of rpg min dec fac according to priority use spaces between values and 1 for unknown values rpg min rate RPG MIN RATE LIST Set value of rpg min rate according to prior ity use spaces between values and 1 for unknown values cndd state machine CNDD STATE MACHINE LIST Set value of cndd state machine according to priority use spaces between values and 1 for unknown values To get QCN current configuration sorted by priority mlnx qcn i eth2 g parameters gt To show OCN s statistics sorted by priority mlnx qcn i eth2 g statistics Example output when running mlnx qen i eth2 g parameters priority 0 rpg enable 0 rppp max rps 1000 rpg time reset 1464 rpg byte reset 150000 rpg threshold 5 rpg max rate 40000 rpg ai rate 10 rpg hai rate 50 rpg gd 8 rpg min dec fac 2 rpg min rate 10 cndd state machine 0 priority 1 rpg enable 0 rppp max rps 1000 rpg time reset 1464 rpg byte reset 150000 rpg threshold 5 rpg max rate 40000 r
58. SLES11SP3 kISO VPI pxeboot install initrd SLES118P3 kISO VPI pxeboot install linux e Kernel and initrd for the boot after the installation SLES118P3 kISO VPI pxeboot initrd SLES118P3 kISO VPI pxeboot linux KISO installation medium that can be used to boot from instead of booting the installer program over the network If choosing this method Bootthe client into the below SLES11 SP3 iso and proceed with the installation until the client fully boots up the installer program Discover and connect to a remote 1SCSI storage During the installation process you will be asked to insert the original installation medium to continue with the installation SLES118P3 kISO VPI slesl1 sp3 x86 64 mlnx ofed 2 1 1 0 6 iso If the iso method above is not used two different PXE server configurations are required PXELINUX booting labels for each phase discussed herein booting the installer and post installation boot For booting the installer program off the TFTP server please provide the client a path to the initrd and linux kernel as provided inside SLES11SP3 kISO VPI pxeboot install in the tgz above 258 Mellanox Technologies m 2 2 1 0 1 The below is an example of such label LABEL SLES11 3x64 manual installl MENU LABEL 1 Install SLES11 3 kernel SLES11SP3 kISO VPI pxeboot install linux append initrd SLESl11SP3 kISO VPI pxeboot install initrd install nfs 12 7 6 30 pxerepo SLES 11 3 x86 64 DVD1 device p4p2
59. SP PITE 0 EXOBUFOVE TAUN PERON S UNE 0 Vota E M 0 LME DAE ON Ese ele A E E R 0 Mellanox Technologies 219 Rev 2 2 1 0 1 InfiniBand Fabric Utilities RED cie Lr eee ene 0 ama mate gee eee e ett N 0 ROMPE topico A ear ens 0 9 4 17 mstflint Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access If you purchased a standard Mellanox Technologies network adapter card please down load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Output Files Table 36 lists the various switches of the utility and Table 37 lists its commands Table 36 mstflint Switches Sheet 1 of 3 Affected Switch Relevant Description Commands h Print the help menu hh Print an extended help menu d evice All Specify the device to which the Flash is connected lt device gt guid burn sg GUID base value 4 GUIDs are automatically assigned to the lt GUID gt following values guid gt node GUID guid 1 gt portl guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assign
60. Stamping 4 6 1 Ethernet Time Stamping Service Time Stamping is currently supported in ConnectX 3 ConnectX 3 Pro adapter cards only di Time stamping is the process of keeping track of the creation of a packet A time stamping ser vice supports assertions of proof that a datum existed before a particular time Incoming packets are time stamped before they are distributed on the PCI depending on the congestion in the PCI buffers Outgoing packets are time stamped very close to placing them on the wire 4 6 1 1 Enabling Time Stamping Time stamping is off by default and should be enabled before use To enable time stamping for a socket e Call setsockopt with SO TIMESTAMPING and with the following flags SOF TIMESTAMPING TX HARDWARE try to obtain send time stamp in hardware SOF TIMESTAMPING TX SOFTWARE i SOF TIMESTAMPING TX HARDWARE ig ori ole fails then do it in software SOF TIMESTAMPING RX HARDWARE return the original unmodified time stamp as generated by the hardware SOF TIMESTAMPING RX SOFTWARE dE SOF TIMESTAMPING RX HARDWARE dg Quei Qu fails then do it in software Mellanox Technologies 73 J Rev 2 2 1 0 1 Driver Features SOF TIMESTAMPING RAW HARDWARE return original raw hardware time stamp SOF TIMESTAMPING SYS HARDWARE return hardware time stamp transformed to the system time base SOF TIMESTAMPING SOFTWARE return system time stamp generated in software SOF TIMESTAMPING TX RX determine ho
61. The values are max errors 0 zero tollerance abort configuration on first error error window 0 mechanism disabled no error checking 0 48K The default is 5 8 9 4 1 Congestion Control Manager Options File Table 18 Congestion Control Manager General Options File Option File Description Values enable Enables disables Congestion Control mechanism Values TRUE FALSE gt on the fabric nodes Default True num hosts Indicates the number of nodes The CC table val Values 0 48K ues are calculated based on this number Default 0 base on the CCT calculation on the current subnet size Table 19 Congestion Control Manager Switch Options File Option File Description Values threshold Indicates how aggressive the congestion mark 0 Oxf ing should be 0 no packet marking Oxf very aggressive Default Oxf marking rate The mean number of packets between marking Values 0 Oxffff eligible packets with a FECN Default Oxa packet_size Any packet less than this size bytes will not be Values 0 0x3fc0 marked with FECN Default 0x200 Table 20 Congestion Control Manager CA Options File Option File Desctiption Values port_control Specifies the Congestion Control attribute for Values this port 0 QP based congestion control 1 SL Port based congestion con trol Default 0 184 Mellanox Technologies Table
62. ULPs and applications to support SR IOV and vHCAs are interoperable with any exist ing non virtualized IB deployments Sharing the same physical port s among multiple vHCAs is achieved as follows Each vHCA port presents its own virtual GID table For further details please refer to Section 4 14 6 2 3 on page 102 Each vHCA port presents its own virtual PKey table The virtual PKey table presented to a VF is a mapping of selected indexes of the physical PKey table The host admin can control which PKey indexes are mapped to which virtual indexes using a sysfs interface The physical PKey table may contain both full and partial mem berships of the same PKey to allow different membership types in different virtual tables Each vHCA port has its own virtual port state A vHCA port is up if the following conditions apply The physical port is up The virtual GID table contains the GIDs requested by the host admin The SM has acknowledged the requested GIDs since the last time that the physical port went up Other port attributes are shared such as GID prefix LID SM LID LMC mask To allow the host admin to control the virtual GID and PKey tables of vHCAs a new sysfs iov sub tree has been added under the PF InfiniBand device 4 14 6 2 1SR IOV sysfs Administration Interfaces on the Hypervisor Administration of GUIDs and PKeys is done via the sysfs interface in the Hypervisor Dom0 This interface is under sys clas
63. WG VAR RUN SEN ew Hd DR 136 8 2 opensm Description llis ren 136 8 2 opensm Syntax tulere Debe as eww se PEN 136 8 2 2 Environment Variables 144 A coh ok tuse uo tc RUE ERU RES eU Pee t 145 8 2 4 Runningopensm suse rer rr rer rer rr rr rr tres 145 8 3 osmtest Descriplion erse a os Seek Se ERE AENEAN Rac ee ws 145 8 3 GSyntax A A des 146 8 3 2 Running osmtest llle n 148 8 4 Partitions tt oa 148 SAL Eile Eorm at i5 ur e ete td tende o gage tt 148 8 5 Routing AlgorithMS o ooooooooororrrr e 151 8 5 1 Effect of Topology Changes 0 cece ee 152 8 5 2 Min Hop Algorithm ss sscan crier neren dorii tenet eens 152 8 33 UPDN Algorithm uero ia 153 8 5 4 Fat tree Routing Algorithm 0 0 cee 154 8 5 5 LASH Routing Algorithm sossreserrrerrrrrerer rer rer rr rr rr rr rs 155 8 5 6 DOR Routing Algorithm 0 0 20 0 ccc rr rer eh 157 8 5 7 Torus 2QoS Routing Algorithm 0 0 0 ccc 157 8 6 Quality of Service Management in OpenSM 0 00 c ee ee eee ee 165 8 65 OVETVIE Wasa ere Fees baie tt a AA 165 8 6 2 Advanced QoS Policy File s sssesreserrrrrrrrrrrer rer rer rr rr rr rr rr 165 8 6 3 Simple QoS Policy Definition llle rr rr rr terra 166 8 6 4 Policy File Syntax Guidelines rer rr rr rr rr rs 167 8 6 5 Examples of Advanced Policy File 00 0 0 rer rr rr rr oo 167 8 6 6 Simple QoS Policy Details and Examples
64. above here is the full fabric spanning tree that torus 2QoS will construct where x is the root switch and each is a non root switch I I I I I I 3 I I I I I I 2 Sho Ro I I I I I I i I I I I I I y 0 x 0 1 2 3 4 a For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns However to construct a credit loop the union of multicast routing on this span ning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop In addi tion if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link
65. aev 1oss tmo 4 1 2 1 2 SRP Remote Ports Parameters Several SRP remote ports parameters are modifiable online on existing connection To modify dev loss tmo to 600 seconds echo 600 sys class srp remote ports port xxx dev loss tmo To modify fast io fail tmo to 15 seconds echo 15 sys class srp remote ports port xxx fast io fail tmo To modify reconnect delay to 10 seconds echo 20 sys class srp remote ports port xxx reconnect delay 46 Mellanox Technologies m 2 2 1 0 1 4 1 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 4 1 2 4 explains how to do this automatically Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id ext GUID value ioc guid GUID value dgid port GID value pkey ffff service id service 0 value gt sys class infiniband srp srp mlx hca number port number add target See Section 4 1 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes e Itis possible to include additional parameters in the e
66. an example of AR Manager options file with all the default values in Example of Adaptive Routing Manager Options File on page 181 8 8 3 2 Disabling Adaptive Routing There are two ways to disable Adaptive Routing Manager 1 By disabling it explicitly in the Adaptive Routing configuration file 178 Mellanox Technologies m 2 2 1 0 1 2 Byremoving the armgr option from the Subnet Manager options file Adaptive Routing mechanism is automatically disabled once the switch receives setting of the usual linear routing table LFT de Therefore no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing 8 8 4 Querying Adaptive Routing Tables When Adaptive Routing is active the content of the usual Linear Forwarding Routing Table on the switch is invalid thus the standard tools that query LFT e g smpquery dump Ifts sh and others cannot be used To query the switch for the content of its Adaptive Routing table use the smparquery tool that is installed as a part of the Adaptive Routing Manager package To see its usage details run smparquery h 8 8 5 Adaptive Routing Manager Options File The default location of the AR Manager options file is etc opensm ar mgr conf To set an alter native location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin option option in the file Options
67. and manipulating subnet management data Subnet Manager One of several entities involved in the configuration and con SM trol of the an IB fabric Unicast Linear For warding Tables LFT A table that exists in every switch providing the port through which packets should be sent to each LID Virtual Protocol Interconnet VPI A Mellanox Technologies technology that allows Mellanox channel adapter devices ConnectX to simultaneously con nect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports 18 Mellanox Technologies m 2 2 1 0 1 Related Documentation Table 4 Reference Documents Document Name Description InfiniBand Architecture Specification The InfiniBand Architecture Specification that Vol 1 Release 1 2 1 is provided by IBTA IEEE Std 802 3ae 2002 Part 3 Carrier Sense Multiple Access with Colli Amendment to IEEE Std 802 3 2002 sion Detection CSMA CD Access Method and Document PDF SS94996 Physical Layer Specifications Amendment Media Access Control MAC Parameters Physical Layers and Management Parameters for 10 Gb s Operation Firmware Release Notes for Mellanox See the Release Notes PDF file relevant to your adapter devices adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Manual See under docs folder of installed package MFT Release Notes Release N
68. be readily used on InfiniBand architecture The SRP driver known as the SRP Initia tor differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an I O unit and provides storage services See Chapter 4 1 SCSI RDMA Protocol and Appen dix A SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that promotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface 1s defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifi cation is timed to coincide with OFED release of the Open Fabrics www openfabrics org soft ware stack For more information about the DAT collaborative go to the following site http www datcollaborative org 24 Mellanox Technologies m 2 2 1 0 1 1 3 5 MPI Message Passing Interface MPI is a library specification that enables the development of paral lel software libraries to utilize parallel computers clusters and heterogeneous networks Mella nox OFED includes the following MPI implementations over InfiniBand Open MPI an open source MPI 2 implementation by the
69. buffers and letting the network interface card split them into separate packets Large Receive Offload LRO increases inbound through put of high bandwidth network connections by reducing CPU overhead It works by aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack thus reducing the number of packets that have to be processed LRO is available in kernel versions 3 1 for untagged traffic Note LRO will be done whenever possible Otherwise GRO will be done Generic Receive Offload GRO is available throughout all kernels ethtool c eth lt x gt Queries interrupt coalescing settings ethtool C eth lt x gt adaptive rx Enables disables adaptive interrupt moderation onloff By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern ethtool C eth lt x gt pkt rate low N Sets the values for packet rate limits and for moderation pkt rate high N rx usecs low N time high and low values rx usecs high N Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value 112 Mellanox Technologies m 2 2 1 0 1 Table 6 ethtool Supported Options Options Description ethtool C eth lt x gt
70. c to stop 9 5 Performance Utilities The performance utilities described in this chapter are intended to be used as a performance micro benchmark 9 5 1 ib read bw ib read bw calculates the BW of RDMA read between a pair of machines One acts as a server and the other as a client The client RDMA reads the server memory and calculate the BW by sampling the CPU each time it receive a successful completion The test supports features such as Bidirectional in which they both RDMA read from each other memory s at the same time Mellanox Technologies 225 Rev 2 2 1 0 1 InfiniBand Fabric Utilities change of mtu size tx size number of iteration message size and more Read is available only in RC connection mode as specified in IB spec Synopsis Server ib read_bw options Client ib read bw options hostname Options The table below lists the various flags of the command Table 38 ib read bw Flags and Options Flag Description a all Run sizes from 2 till 2 23 b bidirectional Measure bidirectional bandwidth default unidirectional c connection lt RC XRC DC gt Connection type RC XRC DC default RC d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec F CP
71. cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also prevents credit loop dead locks If the root guid file is not provided a or root_guid_file options the topology has to be pure fat tree that complies with the following rules Tree rank should be between two and eight inclusively Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches Switches of the same rank should have the same number of ports in each UP going port group Switches of the same rank should have the same number of ports in each DOWN going port group All the CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and it should only comply with the following rules Tree rank should be between two and eight inclusively e All the Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports t
72. data recvmsg can be used to get this control message for regular incoming packets For send time stamps the outgo ing packet is looped back to the socket s error queue with the send time stamp s attached It can be received with recvmsg flags MSG ERRQUEUE The call returns the original outgoing packet data including all headers preprended down to and including the link layer the scm timestamping control message and a sock extended err control message with ee errno ENOMSG and ee origin SO EE ORIGIN TIMESTAMPING A socket with such Mellanox Technologies 75 Rev 2 2 1 0 1 Driver Features a pending bounced packet is ready for reading as far as select is concerned If the outgoing packet has to be fragmented then only the first fragment is time stamped and returned to the sending socket When time stamping is enabled VLAN stripping is disabled For more info please refer to Documentation networking timestamping txt in kernel org 4 6 1 3 Querying Time Stamping Capabilities via ethtool gt To display Time Stamping capabilities via ethtool Show Time Stamping capabilities ethtool T eth lt x gt Example ethtool T eth0 Time stamping parameters for p2pl Capabilities hardware transmit SOF TI ESTAMPING TX HARDWARE software transmit SOF TIMESTAMPING TX SOFTWARE hardware receive SOF TIMESTAMPING RX HARDWARE software receive SOF TIMESTAMPING RX SOFTWARE software system clock SOF TIMESTAMPING SOFTWARE hardware raw
73. ddns update style none ddns updates off allow bootp always broadcast off always reply rfc1048 off boot unknown clients on option client system architecture code 93 unsigned integer 16 option vendor encapsulated options code 43 string option vendor class identifier code 60 string class PXEClient match if substring option vendor class identifier 0 9 PXECli ent option vendor class identifier PXEClient option vendor encapsulated options 06 01 08 option dhcp parameter request list concat option dhcp parameter request list 43 272 Mellanox Technologies m 2 2 1 0 1 subnet 12 7 0 0 netmask 255 255 0 0 option dhcp server identifier 12 7 6 30 option domain name pxe030 mtl com option domain name servers 12 7 6 30 default lease time 86400 1 day max lease time 86400 option ntp servers 12 7 6 30 host sqa070 fixed address 12 7 6 70 hardware ethernet 00 02 c9 32 e8 80 next server 12 7 6 30 if option client system architecture 00 00 filename pxelinux 0 E 6 3 pxelinux cfg default root sqa030 cat var lib tftpboot pxelinux cfg default LABEL rh6 4x64 instl manual MENU LABEL Manual Installation RHEL6 4 KERNEL RHEL6 4 x86 64 DVD1 images pxeboot vmlinuz APPEND initrd RHEL6 4 x86 64 DVDl images pxeboot initrd img E 6 4 DHCP Configuration for iSCSI Boot with FlexBoot PXE SAN Boot Modify the following host declaration in your DHCP configur
74. e 62 442 OOS Architecture i cronicas rra ds See le Beles 63 44 3 Supported Policy isse a eee OU ELE ERE 63 44 4 CMA Eeat res oc oec SAA ek BE a a RUNE NR 64 44 5 OpensSM Features o eed RA Rer ERRARE ae OE UR 65 4 5 Quality of Service EtherMet o o ooooooooororcrrnr eh 65 4 5 1 Quality of Service Overview ssssessreseerrerrerr reser e 65 4 5 2 Mapping Traffic to Traffic Classes 0 2 0 cece esee 65 4 5 3 Plain Ethernet Quality of Service Mapping ssseseserrerrrrrrrrr rer ra 65 4 5 4 RoCE Quality of Service Mapping esee 66 4 5 5 Raw Ethernet QP Quality of Service Mapping 0 000 esses 67 4 5 6 Map Priorities with tc wrap py mlnx qos ssesemeeeeseereerrerr eee 67 4 5 7 Quality of Service Properties 0 ccc ee 68 4 5 8 Quality of Service Tools liiis 68 4 6 Ethernet Time Stamping ooooooororrrr ere rr rer reses 73 4 6 1 Ethernet Time Stamping Service oooooooooocrororrr esee 73 4 6 2 RoCE Time St tfipil8 i ss svs les se bensin teser e 76 4T Ethernet VXLAN sto a ia 78 ATL Prerequisites rot dia E 78 47 2 Enabling VXLAN uo a GU E E EAR TER 78 4 73 Important NOTES se ias E deck M ERIT MPRRE P RXSGG NN Vesp RU DES 79 4 8 Atomic Operations llis 79 4 8 1 Atomic Operations in mlx5 Driver sssoseeserrerererrrrrr rr ere ere rer rea 79 4 8 2 Enhanced Atomic Operations 0 0 0 ce cee ccc ee 79 4 9 Ethernet Tunneling Over IPoIB Driver eIPoIB o oooooo
75. e If its a triplet x y z applies only if all ports are configured as Ethernet the driver creates e x single port VFs on physical port 1 ysingle port VFs on physical port 2 applies only if such a port exist zn port VFs where n is the number of physical ports on device This applies to all ConnectX HCAs on the host Tfits format is a string The string specifies the num v s parameter separately per installed HCA The string format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to enable for that HCA which is either a single value or a triplet as described above For example num vfs 5 The driver will enable 5 VFs on the HCA and this will be applied to all ConnectX HCAs on the host num vfs 00 04 0 5 00 07 0 8 The driver will enable 5 VFs on the HCA positioned in BDF 00 04 0 and 8 on the one in 00 07 0 num vfs 1 2 3 The driver will enable 1 VF on physical port 1 2 VFs on physical port 2 and 3 dual port VFs applies only to dual port HCA when all ports are Ethernet ports num vfs 00 04 0 5 6 7 00 07 0 8 9 10 The driver will enable HCA positioned in BDF 00 04 0 Ssingle VFs on port 1 6 single VFs on port 2 7 dual port VFs HCA positioned in BDF 00 07 0 Ssingle VFs on port 1 Osingle VFs on port 2 10 dual port VFs Applies when all ports are configure as Ethernet in dual port HCAs Notes e PFs not included in the above list
76. eITOrs Shows send and receive errors usage u Usage message 196 Mellanox Technologies m 2 2 1 0 1 Table 26 ibtracert Flags and Options Flag Description Guid G Uses GUID address argument In most cases it is the Port GUID Example 0x08f1040023 sm_port s lt smlid gt Uses smlid as the target lid for SM SA queries help h Shows the usage message verbose v vv v v v Increases the application verbosity level version V Shows the version info Ca C lt ca_name gt Uses the specified ca_name Port P lt ca_port gt Uses the specified ca_port timeout t lt timeout_ms gt Overrides the default timeout for the solicited mads Examples Unicast examples ibtracert 4 16 show path between lids 4 and 16 ibtracert n 4 16 same but using simple output format ibtracert G 0x8f1040396522d 0x002c9000100d051 use guid addresses Multicast example ibtracert m 0xc000 4 16 show multicast path of mlid 0xc000 between lids 4 and 16 9 4 6 ibqueryerrors The default behavior is to report the port error counters which exceed a threshold for each port in the fabric The default threshold is zero 0 Error fields can also be suppressed entirely In addition to reporting errors on every port ibqueryerrors can report the port transmit and receive data as well as report full link information to the remote port if available Synopsis ibq
77. gt use RDMA device lt dev gt default first device found The relevant devices can be listed by running the ibv devinfo command i IET use port port of IB device default 1 224 Mellanox Technologies m 2 2 1 0 1 W write file dump file name default sniffer pcap stands for stdout enables piping to tcpdump or tshark 0 output file alias for the w option Do not use for backward compatibility b max burst log2 burst 092 of the maximal burst size that can be captured with 1 no packets loss Each entry takes MTU bytes of memory default 12 4096 entries s silent do not print progress indication mem mode lt size gt when specified packets are written to file only after the capture is stopped It is faster than default mode less chance for packet loss but takes more memory In this mode ibdump stops after lt size gt bytes are captured decap Decapsulate port mirroring headers Should be used when capturing RSPAN traffic h help Display this help screen v version Print version information Examples 1 Run ibdump ibdump IB device mx IB port Beal Dump file sniffer pcap Sniffer WQEs max burst size 4096 Initiating resources searching for IB devices in host Port active mtu 2048 MR was registered with addr 0x60d850 lkey 0x28042601 rkey 0x28042601 flags 0x1 QP was created QP number 0x4004a Ready to capture Press
78. http scst sourceforge net downloads html b Untar scst 1 0 1 1 G tare mort sesit l 0 l latar op cd scst 1 0 1 1 c Install scst 1 0 1 1 as follows make amp amp make install A 2 How to Run A On an SRP Target machine 1 Please refer to SCST s README for loading scst driver and its dev handlers drivers scst vdisk block or file IO mode nullio Regardless of the mode you always need to have lun 0 in any group s device list Then you can have any lun number following lun 0 it is not required to have the lun ad numbers in ascending order except that the first lun must always be 0 Setting SRPT LOAD yes in etc infiniband openib conf is not enough as it only loads the ib srpt module but does nof load scst not its dev handlers 248 Mellanox Technologies m 2 2 1 0 1 The scst disk module pass thru mode of SCST is not supported by Mellanox OFED Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst S modprobe scst vdisk c echo open vdisk0 dev md0 BLOCKIO gt proc scsi tgt vdisk vdisk d echo open vdisk1 dev sda BLOCKIO gt proc scsi tgt vdisk vdisk e echo open vdisk2 dev cciss cld0 BLOCKIO gt proc sesi tgt vdisk vdisk f echo add vdisk0 0 gt proc scsi_tgt groups Default devices g echo add vdiskl 1 gt proc scsi_tgt groups Default devices h echo add vdisk2 2 gt proc scsi_tgt groups Default devices
79. in etc syscon fig network config with the names of the IPoIB slave devices e g ib0 1b1 etc Otherwise bonding master may be created before IPoIB slave interfaces at boot time It is possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bonding master Restarting openibd does no keep the bonding configuration via Network Scripts You have to restart the network service in order to bring up the bonding master After the configuration is saved restart the network service by running etc init d network restart Mellanox Technologies 61 Rev 2 2 1 0 1 Driver Features 4 4 Quality of Service InfiniBand 4 4 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand Administrator QoS Servers IB Ethernet M Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resourc
80. informa tion 244 Mellanox Technologies m 2 2 1 0 1 9 5 10 raw ethernet lat raw ethernet lat calculates the latency of sending a packet in message size between a pair of machines One acts as a server and the other as a client They perform a ping pong benchmark on which you send packet only if you receive one Each of the sides samples the CPU each time they receive a packet in order to calculate the latency Using the a provides results for all message sizes Synopsis Server raw ethernet lat options dest mac MAC address of interface on client Client raw ethernet lat options dest mac MAC address of interface on server Options The table below lists the various flags of the command Table 59 raw ethernet lat Flags and Options Flag Description a all Run sizes from 2 till 2 23 c connection lt RC XRC UC Connection type RC XRC UC UD DC default RC UD DC gt C report cycles Report times in cpu cycle units default microseconds d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec g mcg num of qps gt Send messages to multicast group with num of qps qps attached to it h help Show this help screen H report histogram
81. left with No authentication iSCSI Discovery Details To use iSCSI disks you must provide the address of your SCSI target and the SCS initiator name you ve configured for your host Target IP Address 1276 30 ISCS initiator Name T What kind of iSCSI discovery authentication do you wish to perform No authentication gt Cancel Start Discovery Step 8 Check the relevant Node Name to log in If as a result of the discovery multiple Node Names are found select the one that is rele vant to you iSCSI Discovered Nodes Check the nodes you wish to log into Node Name interface Y qn 2013 10 qalab com sga030 prt9 etho 9n 2013 10 q8lab com 92030 prt9 etho qn 2013 10 galab com q2030 prt9 etho GN 2013 10 galab com sqa030 prtd etho qn 2013 10 galab com 3a030 prt9 etho B i9n201 10alabcom sqa030 pt9 etho Cancel Login Mellanox Technologies 267 Rev 2 2 1 0 1 Step9 Click Login A successful login is mandatory to proceed A failure at this stage is probably a result of a target or network configuration error and recovery troubleshooting that is out of the scope of this document iSCSI Login Results Successfully logged in and attached the following nodes qn 2013 10 galab com 5q3030 prt9 via etho OK Step 10 Make sure a new storage LUN appears in the Other AN Devices tab A successful LUN discovery is mandatory to proceed A failure at this stage is probably a
82. mapped to physical pkey index 2 Both vml and vm2 will have their pkey index 1 mapped to the default pkey Step d On Host2 do the following cd sys class infiniband mlx4 0 iov echo 0 gt 0000 03 00 1 ports 1 pkey idx 1 echo 1 gt 0000 03 00 1 ports 1 pkey idx 0 echo 0 gt 0000 03 00 2 ports 1 pkey idx 1 echo 2 gt 0000 03 00 2 ports 1 pkey idx 0 Step e Once the VMs are running you can check the VM s virtualized PKey table by doing on the vm cat sys class infiniband mlx4 0 ports 1 2 pkeys 0 1 Step3 Start up the VMs and bind VFs to them Step 4 Configure IP addresses for ib0 on the host and on the guests 4 14 6 3 Ethernet Virtual Function Configuration when Running SR IOV 4 14 6 3 1VLAN Guest Tagging VGT and VLAN Switch Tagging VST When running ETH ports on VGT the ports may be configured to simply pass through packets as is from VFs Vlan Guest Tagging or the administrator may configure the Hypervisor to silently force packets to be associated with a VLan Qos Vlan Switch Tagging In the latter case untagged or priority tagged outgoing packets from the guest will have the VLAN tag inserted and incoming packets will have the VLAN tag removed Any vlan tagged packets sent by the VF are silently dropped The default behavior is VGT Mellanox Technologies 105 Rev 2 2 1 0 1 Driver Features The feature may be controlled on the Hypervisor from userspace via iprout2 netlink ip link set dev DEVICE
83. no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obviously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 8 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by default if no routing algorithm is specified It can also be invoked by specifying R minhop The Min Hop algorithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter 1s supplied by i lt equalize ignore guids file gt ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis 152 Mellanox Technologies m 2 2 1 0 1 8 5 3 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts
84. on a Remote Storage over iSCSI Step 1 Reboot the diskless client and perform a PXE boot with FlexBoot This is not an iSCSI boot rather a regular PXE initiated network deployment of RHEL6 4 In the DHCP server configuration the PXELINUX pxelinux 0 and a RHEL 6 4 distribu tion media will be provided for network installation For further information please refer to sections Section E 6 2 DHCP Configuration for PXELINUX with FlexBoot on page 272 and Section E 6 3 pxelinux cfg default on page 273 The clients HDD was removed beforehand therefore the RHEL installer A K A Ana conda will ask to locate a HDD The Anacaonda s built in iSCSI discovery will be used to connect to the SCSI target LUN partition For further information please refer to https access redhat com site documentation en US Red Hat Enterprise Linux 6 html single Installation Guide index htmlZISCSI disks Step 2 Select the Network Interface the same interface which was used by FlexBoot during PXE boot stage for the installation process once prompted Step3 Select the type of Installation Media access In this example we use NFS which also requires us to enter the NFS server name and the directory path to the installation media on the NFS Step 4 Select Specialized Storage Devices What type of devices wh your installation involve Basic Storage Devices 6 upgrades to typical types of sierage devices F you re hel sure mhuch option
85. operating system an application a service or a driver is performing The counter data helps determine system bottlenecks and fine tune the system and application performance The operating system network and devices pro vide counter data that an application can consume to provide users with a graphical view of how well the system is performing The counter index is a QP attribute given in the QP context Multiple QPs may be associated with the same counter set If multiple QPs share the same counter its value represents the cumulative total ConnectX 3 supports 127 different counters which are allocated as follow e 4 counters reserved for PF 2 counters for each port 2counters reserved for VF 1 counter for each port All other counters if exist are allocated by demand e RoCE counters are available only through sysfs located under e sys class infiniband mlx4_ ports counters e sys class infiniband mlx4_ ports counters_ext Physical Function can also read Virtual Functions port counters through sysfs located under e sys class net eth vf _statistics To display the network device Ethernet statistics you can run Ethtool S lt devname gt Table 7 Port IN Counters Counter Description rx_packets Total packets successfully received rx_bytes Total bytes in successfully received packets rx_multicast_packets Total multicast packets successfully received rx_broadcast_packets Total bro
86. port mtu n Iters lt iters gt Number of exchanges at least 5 default 1000 0 outs lt num gt Num of outstanding read atom default max of device p port lt port gt Listen on connect to port lt port gt default 18515 R rdma_cm Connect QPs with rdma_cm and run test on those QPs S sl lt sl gt SL default 0 T tos lt tos value gt Set lt tos_value gt to RDMA CM QPs availible only with R flag values 0 256 default off u qp timeout lt timeout gt QP timeout timeout value is 4 usec 2 timeout default 14 U report unsorted implies H print out unsorted results default sorted 240 Mellanox Technologies m 2 2 1 0 1 Table 53 ib atomic lat Flags and Options Flag Description V version Display version number x gid index lt index gt Test uses GID with GID index Default IB no gid ETH 0 Z com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 54 Additional ib_atomic_lat Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive Set verbosity output level bandwidth message_rate latency_typical output lt units gt PKey index to use for QP pkey_index lt pkey index gt 9 5 9 raw ethernet bw raw et
87. result of a target or network configuration error and recovery troubleshooting that is out of the scope of this document Please select the drives you d lke to install the operating system on as well as any drives yo4 d bke to automatically moure to your system below Basic Devices femmare AND Multipath Devices Other SAN Devices Search Fiter By v Show Only Devices Using identifier Capacity MO Vendor wtercornect r Step 11 Click Next Device Opbans de Add Advanced Target 1 device s 2048 MB selected ov of I device 2048 MB total Tip Selecting a drive on this screen does not necessarily mean it will be wiped by the installation process Aisa note that post installation you may mount drives you did not select here by modifying your etc tstab Me dema ape Step 12 Select Fresh Installation and proceed with the Installation At least one existing installation has been detected on your system What would you Bike to do Fresh installation ka Choose th opbon ta matat tesh copy of Red Mat Enterprine Urea o pour tyitem Being softwere and OMA may Le Over armen depended on y Corts son chocs Upgrade an Existing Installation T Cheste Vus a F you end Mer Vo ray ode por Coming Bat Pal Erderpose Lawes tysken This Option wd preverve the CRD data en your viorage Sevceni 268 Mellanox Technologies m 2 2 1 0 1 Step 13 Select the Use All Space option wmon Spe of installation menda you bie Use All Space
88. settings are going to be hardcoded into this module and any modification done to them will require you to recompile the module after changing sanbootnchap ipxe root sqa030 cd ipxe src root sqa030 src vim sanbootnchap ipxe ipxe dhcp dhcp exit set username joe set password secret echo root path isset root path exit echo Booting from iSCSI tgt sanboot no describe root path Compile the undionly kpxe module For more information refer to http www ipxe org howto chainloading root8sqa030 src make bin undionly kpxe EMBED sanbootnchap ipxe Copy the bin undionly kpxe to your TFTP root directory E G var lib tftpboot Edit the client s host declaration in the DHCP configuration file for chain loading undionly kpxe The outcome of this procedure is to have FlexBoot download undionly kpxe to the client s RAM and then have undionly kpxe authenticate iSCSI and login with the iSCSI target host sqa070 next server 12 7 6 30 if option client system architecture 00 00 filename undionly kpxe fixed address 12 7 60 70 hardware ethernet 00 02 c9 32 e8 80 if exists user class and option user class iPXE QoirLen oota Vues o1 7 0 3052 331mmg201 3 10 qalab com sqa030 prt9 filename 274 Mellanox Technologies m 2 2 1 0 1 For CHAP users All the CHAP authentication lines mentioned as comments in the iSCSI target and initiator configuration examples in sect
89. sl2vl 0 1 2 354 556 1 8 9 10 11 12 13 14 0 VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos type high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sent is high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded If the 255 value is used the low priority VLs may be starved ae A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to achieve eff
90. statistics cycle seconds Default 0 When the value is set to 0 no statistics are collected Mellanox Technologies 185 Rev 2 2 1 0 1 InfiniBand Fabric Utilities 9 InfiniBand Fabric Utilities This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation synopsis and options descriptions error codes and examples 9 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 1 On the command line specify the file name using the option t lt topology file name gt 2 Define the environment variable IBDIAG TOPO FILE To spec
91. string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt AR Manager options file contains two types of parameters 1 General options Options which describe the AR Manager behavior and the AR parameters that will be applied to all the switches in the fabric 2 Per switch options Options which describe specific switch behavior Note the following Adaptive Routing configuration file is case sensitive You can specify options for nonexisting switch GUID These options will be 1gnored until a switch with a matching GUID will be added to the fabric Adaptive Routing configuration file is parsed every AR Manager cycle which in turn is executed at every heavy sweep of the Subnet Manager Ifthe AR Manager fails to parse the options file default settings for all the options will be used Mellanox Technologies 179 Rev 2 2 1 0 1 OpenSM Subnet Manager 8 8 5 1 General AR Manager Options Table 16 Adaptive Routing Manager Options File Option File Description Values ENABLE Enable disable Adaptive Routing on fabric Default true lt true false gt switches Note that if a switch was identified by AR Man ager as device that does not support AR AR Manager will not try to enable AR on this switch If the firmware of this switch was updated to support the AR the AR Ma
92. supported devices 32 string 1 In the current version this parameter is using decimal number to describe the InfiniBand device and not hexadecimal number as it was in previous versions in order to uniform the mapping of device function numbers to InfiniBand device numbers as defined for other module parameters e g num vfs and probe vf For example to map mlx4 15 to device function number 04 00 0 in the current version we use options mlx4 ib dev assign str 04 00 0 15 as opposed to the previous version in which we used options mlx4 ib dev assign str 04 00 0 f B 2 mlx4 core Parameters set 4k mtu debug level MSIE enable sys tune block loopback num vfs probe vf Obsolete attempt to set 4K MTU to all ConnectX ports int Enable debug tracing if gt 0 int 0 don t use MSI X 1 use MSI X gt 1 limit number of MSI X irqs to msi x non SRIOV only int Tune the cpu s for better performance default 0 int Block multicast loopback packets if 0 default 1 int Either a single value e g 5 to define uniform num vfs value for all devices functions or a string to map device func tion numbers to their num vfs values e g 0000 04 00 0 55002196 he so i153 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for num vfs value e g 15 string Either a single value e g 3 to indicate that the Hypervi Sor driver itself should activate this number of V
93. the tc wrap py tool which gets a list of lt 16 comma separated UP and maps the sk prio to the specified UP For example tc wrap py ieth0 u 1 5 maps sk prio 0 of etho device to UP 1 and sk prio 1to UP 5 Setting set egress map In VLAN maps the skb priority of the VLAN to a vlan qos The v1an qos is represents a UP for the VLAN device e In RoCE rdma set option with RDMA OPTION ID TOS could be used to set the UP When creating QPs the s1 field in ibv modify ap command represents the UP Indicating the TC Mellanox Technologies 67 Rev 2 2 1 0 1 Driver Features 4 5 7 4 5 7 1 4 5 7 2 4 5 7 3 4 5 8 4 5 8 1 After mapping the skb priority to UP one should map the UP into a TC This assigns the user priority to a specific hardware traffic class In order to do that minx_qos should be used minx qos gets a list of a mapping between UPs to TCs For example minx qos ieth0 p 0 0 0 0 1 1 1 1 maps UPs 0 3 to Tco and Ups 4 7 to Tc1 Quality of Service Properties The different QoS properties that can be assigned to a TC are Strict Priority see Strict Priority Minimal Bandwidth Guarantee ETS see Minimal Bandwidth Guarantee ETS Rate Limit see Rate Limit Strict Priority When setting a TC s transmission algorithm to be strict then this TC has absolute strict prior ity over other TC strict priorities coming before it as determined by the TC number TC 7 is highest prior
94. the hardware for time is done via the ibv exp query values verb For example ret ibv exp query values context IBV EXP VALUES HW CLOCK amp queried values if ret amp amp queried values comp mask amp IBV EXP VALUES HW CLOCK queried time queried values hwclock To change the queried time in nanoseconds resolution use the IBV EXP VALUES HW CLOCK NS flag along with the hwclock ns field ret ibv exp query values context IBV EXP VALUES HW CLOCK NS amp queried values if ret amp amp queried values comp mask amp IBV EXP VALUES HW CLOCK NS queried time ns queried values hwclock ns Querying the Hardware Time is available only on physical functions native machines die Ethernet VXLAN Prerequisites e HCA ConnectX 3 Pro Firmware 2 31 5020 or higher RHEL7 Ubuntu 14 04 or upstream kernel 3 12 10 or higher e DMFS enabled Enabling VXLAN To enable the vxlan offloads support load the m1x4 core driver with Device Managed Flow steering DMFS enabled Best would be to create an etc modprobe d mlx4 conf file with the following contents enable DMFS gt enable VXLAN offloads on CX3 pro options mlx4 core log num mgm entry size 1 To verify VXLAN is enabled verify the Ethernet net device created by the mlx4 en driver advertises the NETIF F GSO UDP TUNNEL feature which can be seen by the eththool K DEV grep udp 78 Mellanox Technologies rov 2 2 1 0 1
95. the system i e mthca0 root lab104 ibsrpdm c d dev infiniband umad0 id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c90200226cf4 root lab104 echo id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf5 pkey ffff service id 0002c90200226cf4 gt sys class infiniband srp srp mthca0 1 add target OR You can edit etc infiniband openib conf to load the SRP driver and SRP High Avail ability HA daemon automatically that is set SRP LOAD yes and SRPHA ENABLE yes To set up and use the HA feature you need the dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of an SRP Target setup file Xokcke ke ke ek ke ke ke ke ke ke ke ke ke ke ke x srpt sh kkxkkkkkkkkkkkkkkkkkkkkkkkkkkkkxk bin sh modprobe scst scst_threads 1 modprobe scst vdisk scst vdisk ID 100 ec ec ec ec ec ec ec ec no no no no no no no no open vdisk0 dev cciss cld0 BLOCKIO gt proc scsi tgt vdisk vdisk open vdiskl dev sdb BLOCKIO gt proc scsi tgt vdisk vdisk open vdisk2 dev sdc BLOCKIO gt proc scsi tgt vdisk vdisk open vdisk3 dev sdd BLOCKIO gt proc scsi tgt vdisk vdisk add vdisk0 0 gt proc scsi tgt groups Default devices add vdiskl 1
96. torus i e in the absence of failed fabric switches torus 2 QoS consumes 8 SL values SL bits 0 2 and 2 VL values VL bit 0 per QoS level to provide deadlock free routing on a 3D torus Torus 2 QoS routes around link failure by taking the long way around any 1D ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z 4 I I I I I I 3 R 4 D 4 4 I I I I I I 2 L L I r 4 I I I I I I 1 m S n T 0o p I I I I I I y 0 ho ho I I I I I I x 0 i 2 3 4 5 For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has failed torus 2QoS would use the path S m p o T r D Note that it can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say x 5 and x 0 can be ignored for path segments on that ring One result of this is that torus 2QoS can route around many simultaneous link failures as long as no 1D ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabr
97. using the new API The new verbs to be used are ibv open xrcd ibv close xrcd ibv create srq ex ibv get srq num ibv create qp ex ibv open qp Please use ibv xsrq pingpong for basic tests and code reference For detailed information regarding the various options for these verbs please refer to their appropriate man pages 86 Mellanox Technologies m 2 2 1 0 1 4 13 Flow Steering Flow Steering is applicable to the mlx4 driver only Flow steering is a new model which steers network flows based on flow specifications to specific QPs Those flows can be either unicast or multicast network flows In order to maintain flexibil ity domains and priorities are used Flow steering uses a methodology of flow attribute which is a combination of L2 L4 flow specifications a destination QP and a priority Flow steering rules may be inserted either by using ethtool or by using InfiniBand verbs The verbs abstraction uses a different terminology from the flow attribute ibv exp flow attr defined by a combination of specifications struct ibv exp flow spec gt 4 13 1 Enable Disable Flow Steering Flow Steering is disabled by default and regular L2 steering is performed instead BO Steering When using SR IOV flow steering is enabled if there is an adequate amount of space to store the flow steering table for the guest master To enable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step2 S
98. values At the end of the process the updated FDB tables ensure loop free paths through the subnet D Up Down routing does not allow LID routing communication between switches that are located inside spine switch systems The reason is that there is no way to allow a LID route between them that does not break the Up Down rule One ramification aa of this is that you cannot run SM on switches other than the leaf switches of the fabric 8 5 3 1 UPDN Algorithm Usage Activation through OpenSM Use R updn option instead of old u to activate the UPDN algorithm e Use a root guid file gt for adding an UPDN guid file that contains the root nodes for ranking If the a option is not used OpenSM uses its auto detect root nodes algo rithm Notes on the guid list file Mellanox Technologies 153 Rev 2 2 1 0 1 OpenSM Subnet Manager 1 A valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 8 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K
99. 0 0000 admin assigned Once sm guid assign is set to 0 the driver works in admin assigned mode which result in having only valid value for GUID entry 0 for the PF All other entries under the port will be 0 and corresponding admin guids will be ffffffffffffffff Port Up Down When moving to multi guid mode the port is assumed to be up when the base GUID index per VF PF entry 0 has a valid value Setting other GUID entries does not affect the port status However any change in a GUID will cause a GUID change event for its VF PF even if it is not the base one Persistency Support Once admin request is rejected by the SM a retry mechanism is set Retry time is set to 1 second and for each retry it is multiplied by 2 until reaching the maximum value of 60 seconds Addi tionally when looking for the next record to be updated the record having the lowest time to be executed is chosen Any value reset via the admin guid interface is immediately executed and it resets this entry timer 4 14 6 2 4Partitioning IPoIB Communication using PKeys PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by mapping a non default full membership PKey to virtual index 0 and mapping the default PKey to a virtual pkey index other than zero The below describes how to set up two hosts each with 2 Virtual Machines Host 1 vm 1 will be able to communicate via IPoIB only with Host2 vm1 and Host1 vm2 o
100. 20 Congestion Control Manager CA Options File m 2 2 1 0 1 Option File ca control map Desctiption An array of sixteen bits one for each SL Each bit indicates whether or not the corresponding SL entry is to be modified Values Values Oxffff ccti Increase Sets the CC Table Index CCTI increase Default 1 trigger threshold Sets the trigger threshold Default 2 ccti min Sets the CC Table Index CCTI minimum Default 0 cct Sets all the CC table entries to a specified value Values lt comma separated The first entry will remain 0 whereas last value list gt will be set to the rest of the table Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes ccti timer Sets for all SL s the given ccti timer Default 0 Table 21 Congestion Control Manager CC MGR Options File When the value is set to 0 the CCT calculation is based on the number of nodes Option File Desctiption Values max errors error window When number of errors exceeds max errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to proceed Values max errors 0 zero tollerance abort configuration on first error error window 0 mechanism dis abled no error checking Default 5 cc statistics cycle Enables CC MGR to collect statistics from all nodes every cc
101. 260 Mellanox Technologies m 2 2 1 0 1 Step 5 Click on the Configure iSCSI Disks button File View Macros Tools Power NextBoot VirtualMedia Help eu Disk Activation SUSE Linux Enterprise Preparation gt Disk Activation Installation Configuration urrent User s rcon 10 8 18 60 Step 6 Choose Connected Targets tab File View Macros Tools Power NextBoot Virtual Media Help Ge iSCSI Initiator Overview SUSE Linux f Service Connected Targets Enterprise Preparation Disk Activation Installation Initiator Name fl 4 03 com mtl pxe030 01 6a4de9b75b7d d Current User s rcon 10 8 18 60 Mellanox Technologies 261 Rev 2 2 1 0 1 Step 7 Click Add File View Macros Tools Power NextBoot Virtual Media Help Ge iSCSI Initiator Overview SUSE Linux Connected Targets Enterprise Target Name Preparation gt Disk Acti Log Out Current User s rcon 10 8 18 60 Step 8 Enter the IP address of the iSCSI storage target Step9 Click Next File View Macros Tools Power NextBoot Virtual Media Help eque ifj SCSI Initiator Discovery SUSE Linux Enterprise Preparation IP Address 12 7 6 30 No Authentication Incoming Authentication Username Password Outgoing Authentication Username urrent User s rcon 10 8 18 60 Step 10 Select the relevant target from the table In ou
102. 9 mpirun mca mtl mxm np 0 lt other mpirun parameters gt 5 3 3 Tuning MXM Settings The default MXM settings are already optimized To check the available MXM parameters and their default values run the opt mellanox mxm bin mxm dump config utility which is part of the MXM RPM MXM parameters can be modified in one of the following methods Modifying the default MXM parameters value as part of the mpirun mpirun x MXM UD RX MAX BUFFERS 128000 lt gt 0 Modifying the default MXM parameters value from SHELL export MXM UD RX MAX BUFFERS 128000 mpirun lt gt 128 Mellanox Technologies m 2 2 1 0 1 5 3 4 Configuring Multi Rail Support Multi Rail support enables the user to use more than one of the active ports on the card by mak ing a better use of the resources It provides a combined throughput among the used ports To configure dual rail support Specify the list of ports you would like to use to enable multi rail support x MXM RDMA PORTS cardName portNum Or x MXM IB PORTS cardName portNum 5 3 5 Configuring MXM over the Ethernet Fabric To configure MXM over the Ethernet fabric Step 1 Make sure the Ethernet port is active ibv devinfo ibv_devinfo displays the list of cards and ports in the system Please make sure in the ibv_devinfo output that the desired port has Ethernet at the Link layer field and that A Its state 1s PORT ACTIVE Step 2 Specify the ports y
103. 902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 3 valid lids dumped 4 Dump all Lids with valid out ports of the switch with portguid 0x00058c 004016 ibroute G 0x000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 023 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 000 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 023 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 024 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 5 Dump all non empty mlids of switch with Lid 3 ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0 il 2 Poss 012343678 901234506 78901234 MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 0xc022 0xc023 0xc024 0xc040 0xc041 0xc042 12 valid mlids dumped
104. A Al bin burn Step 4 Reboot your machine after the firmware burning is completed 25 Installing MLNX OFED using YUM 2 5 4 Setting up MLNX OFED YUM Repository Step 1 Log into the installation machine as root Step 2 Mount the ISO image on your machine mount o ro loop MLNX OFED LINUX lt ver gt lt 0S label gt lt CPU arch gt iso mnt You can download it from http www mellanox com gt Products gt Software InfiniBand Drivers Step 3 Download and install Mellanox Technologies GPG KEY The key can be downloaded via the following link http www mellanox com downloads ofed RPM GPG KEY Mellanox wget http www mellanox com downloads ofed RPM GPG KEY Mellanox 2014 04 20 13 52 30 http www mellanox com downloads ofed RPM GPG KEY Mellanox Resolving www mellanox com 72 3 194 0 Connecting to www mellanox com 72 3 194 0 80 connected HTTP request sent awaiting response 200 OK Length 1354 1 3K text plain Saving to RPM GPG KEY Mellanox 100 pert AAAAAAAAV AE ES cam OS 2014 04 20 13 52 30 247 MB s RPM GPG KEY Mellanox saved 1354 1354 Step 4 Install the key sudo rpm import RPM GPG KEY Mellanox warning rpmts HdrFromFdno Header V3 DSA SHA1 Signature key ID 6224c050 NOKEY Retrieving key from file repos MLNX OFED MLNX OFED file gt RPM GPG KEY Mellanox Importing GPG key 0x6224C050 Userid Mellanox Technologies Mellanox Technologies Signing Key v2 lt support mel
105. C address for every virtual function is configured randomly therefore it is not nec essary to add it 4 14 5 Uninstalling SR IOV Driver gt To uninstall SR IOV driver perform the following Step 1 For Hypervisors detach all the Virtual Functions VF from all the Virtual Machines VM or stop the Virtual Machines that use the Virtual Functions Please be aware stopping the driver when there are VMs that use the VFs will cause machine to hang Step2 Run the script below Please be aware uninstalling the driver deletes the entire driver s file but does not unload the driver root swl022 usr sbin ofed uninstall sh This program will uninstall all OFED packages on your machine Do you want to continue y N y Running usr sbin vendor pre uninstall sh Removing OFED Software installations Running bin rpm e allmatches kernel ib kernel ib devel libibverbs libibverbs devel libibverbs devel static libibverbs utils libmlx4 libmlx4 devel libibcm libibcm devel libibumad libibumad devel libibumad static libibmad libibmad devel libibmad static librdmacm librdmacm utils librdmacm devel ibacm opensm libs opensm devel perftest com pat dapl compat dapl devel dapl dapl devel dapl devel static dapl utils srptools infini band diags guest ofed scripts opensm devel warning etc infiniband openib conf saved as etc infiniband openib conf rpmsave Running tmp 2818 ofed vendor post uninstall sh Step3 Restart the server 4 14 6 Configuring Pke
106. COVER Si aaa syran slet NOS 0 E NS o RD OUT 0 REVEBROES elec 0 RGWREMOCARIN SIONS MM 0 RE S ooscoscapbtancaos 0 AMEN TSE AROS docto ETT 0 FE CORSETS TECOS j eie eo 00000000 0 Eye onssnante DOCS 000 o0 0 simile eme y Tat OT 000000000 0 218 Mellanox Technologies rev 2 2 1 0 1 CR UY SAS RO IMMUNE 0 SMA OS au ado un 0 ao roma dao 55178210 RD tc ass otis nee treet 55174680 A GG Sista marae dS Lc Nc ies e 766366 ROVPJGES ey teres n te a DOM Eo 766315 2 Read performance counters from LID 2 all ports gt smpquery a 2 Port counters Lid 2 port 255 Poroeleot to A frakten DOS Counters Ee Pee 0x0100 SYMONE PLORS a E E A 65535 RC CO Ne I IER TO 259 DomkDownedseeenetd oa de as 16 REO e eet dM Te ees 657 RevRemotePhysErrors 0 REIR 70 ES dan Uan OO ao Ao 488 ME SLINE ONES Ro oo oo 92 69 0 ECOS TREN sos emn eda o6 0 Toren ise estoy BITS OG SPP Ue 0 HXCBUO SAI DS AEA 0 Vii EODD edet US 0 Moa US E de GA 129840354 Revalbace E CIE EE 129529906 AAA M a eee S 1803332 AS T Nue FU Ao M ER 1799018 3 Read then reset performance counters from LID 2 port 1 gt perfquery r21 Port counters Lid 2 port 1 POTES OSC tern ei i COUNSELS SLC Ct ee a AA A e 0x0100 SIT OMISIT SIEHE MS 0 TNKRECOVEES Seta ee 0 O a 0 NN ne t ato da e 0 RevRemotePhysErrors 0 GW SINS EMCO PT UE 0 MIME DAS CANS ee M AU E ER S ES 3 MUCOSA B eroe ouo tenere 0 ROCOSAS S roo aoro a orn e 0 Ia INE BIOS
107. Default behavior when there are more than cmd sg entries S G entries after mapping fails the request when false default false Enable workarounds for Topspin Cisco SRP target bugs Time between successive reconnect attempts Time between successive reconnect attempts of SRP initiator to a disconnected target until dev loss tmo timer expires if enabled after that the SCSI target will be removed Number of seconds between the observation of a transport layer error and failing all I O Increasing this timeout allows more tolerance to transport errors however doing so increases the total failover time in case of serious transport failure Note fast io fail tmo value must be smaller than the value of reconnect delay Maximum number of seconds that the SRP transport should insulate transport layer errors After this time has been exceeded the SCSI target is removed Normally it is advised to set this to 1 disabled which will never remove the scsi host In deployments where different SRP targets are connected and disconnected frequently it may be required to enable this timeout in order to clean old scsi hosts representing targets that no longer exists Constraints between parameters dev loss tmo fast io fail tmo reconnect delay cannot be all disabled or neg ative values reconnect delay must be positive number fast io fail tmo must be smaller than SCSI block device timeout e fast io fail tmo must be smaller than
108. E false Mellanox Technologies 181 Rev 2 2 1 0 1 OpenSM Subnet Manager 8 9 8 9 1 8 9 2 8 9 3 Congestion Control Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in i e it is a shared library libcc mgr so that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Congestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so perform the following 1 Create the file Run opensm c lt options file name gt 2 Find the event_plugin_name option in the file and add cemgr to it Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt Once the Congestion Control is enabled on the fabric nodes to completely disable Congestion Control you will need to actively turn 1t off Running the SM w o the CC Manager is not sufficient as the hardware st
109. Family ConnectX 3 Link Width 8x PCI Link Speed 5Gb s Installation finished successfully Attempting to perform Firmware update Querying Mellanox devices firmware Device 1 Device Type ConnectX3Pro Part Number MCX354A FCC Ax Description ConnectX 3 Pro VPI adapter card dual port QSFP FDR IB 56Gb s and 40GigE PCIe3 0 x8 8GT s RoHS R6 PSID MT 1090111019 PCI Device Name 0000 05 00 0 Versions Current Available FW 2 30 8000 2 31 5000 PXE 3 4 0224 3 4 0224 Status Update required Found 1 device s requiring firmware update Device 1 Updating FW Done A restart is needed for updates to take effect Log File tmp MLNX OFED LINUX 2 2 0 0 9 10694 10gs fw update log To load the new driver run etc init d openibd restart 36 Mellanox Technologies m 2 2 1 0 1 In case your machine has the latest firmware no firmware update will occur and the installation script will print at the end of installation a message similar to the following Device 1 AB Device Type ConnectX3Pro Part Number MCX354A FCC Ax Description ConnectX 3 Pro VPI adapter card dual port OSFP FDR IB 56G
110. Fs for each HCA on the host or a string to map device function numbers to their probe vf values e g 0000 04 00 0 3 002b 1c 0b a 13 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for probe vf value e g 13 string 252 Mellanox Technologies log num mgm entry size high rate steer fast drop enable 64b cqe eqe log num mac log num vlan log mtts per seg port type array log num qp log num srq log rdmarc per qp log num cq log num mcg log num mpt log num mtt enable qos internal err reset B 3 mlx4 en Parameters inline thold udp rss p ctx pfcrx Mellanox Technologies 253 m 2 2 1 0 1 log mgm size that defines the num of qp per mcg for example 10 gives 248 range 7 log num mgm entry size 12 To activate device managed flow steering when available set to 1 int Enable steering mode for higher packet rate default off int Enable fast packet drop when no recieve WQEs are posted int Enable 64 byte CQEs EQEs when the FW supports this if non zero default 1 int Log2 max number of MACs per ETH port 1 7 int Obsolete Log2 max number of VLANs per ETH port 0 7 int Log2 number of MTT entries per segment 0 7 default 0 int Either pair of values e g 1 2 to define uniform port1 port2 types configuration for all devices functions or a string to map device function numbers to their pair of por
111. IB subinterface gt create child Example host1 echo 1 gt sys class net ib0 create child This will create the interface ib0 8001 Mellanox Technologies 59 J Rev 2 2 1 0 1 Driver Features 4 3 4 2 4 3 5 Step 3 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 nostis ifconfig ib0 8001 ib0 8001 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 4 As can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 3 3 3 Step 5 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 8 OpenSM Subnet Manager Removing a Subinterface To remove a child interface subinterface run echo lt subinterface PKey gt sys class net lt ib interface delete child Using the example of Step 2 echo 0x8001 gt sys class net ib0 delete child Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example abov
112. IDs Tool option 1 In this mode the source and destination ports are defined by means of their LIDs If the fabric is configured to allow multiple LIDs per port then using any of them is valid for defining a port Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool oper ates as in the l option 9 4 Diagnostic Utilities The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric 9 4 1 ibdiagnet of ibutils2 IB Net Diagnostic This version of ibdiagnet is included in the ibutils2 package and it is run by default after installing Mellanox OFED To use this ibdiagnet version run ibdiagnet Please see ibutils2_release_notes txt for additional information and known issues Ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Mellanox Technologies 187 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Synopsis i dev
113. LAN N A In the example above the ib0 2 IPoIB interface serves the MAC 52 54 00 60 55 88 with no VLAN tag for that interface 4 9 3 VLAN Configuration Over an elPoIB Interface eIPoIB driver supports VLAN Switch Tagging VST mode which enables the virtual machine interface to have no VLAN tag over it thus allowing VLAN tagging to be handled by the Hyper visor To attach a Virtual Machine interface to a specific isolated tag Step 1 Verify the VLAN tag to be used has the same pkey value that is already configured on that ib port cat sys class infiniband mlx4 0 ports lt ib port gt pkeys Step 2 Create a VLAN interface in the Hypervisor over the eIPoIB interface vconfig add lt eIPoIB interface lt vlan tag Step 3 Attach the new VLAN interface to the same bridge that the virtual machine interface is already attached to brctl addif lt br name gt lt interface name gt Mellanox Technologies 83 J Rev 2 2 1 0 1 Driver Features 4 9 4 4 10 For example to create the VLAN tag 3 with pkey 0x8003 over that port in the eIPoIB interface eth4 run vconfig add eth4 3 brctl addif br2 eth4 3 Setting Performance Tuning Use 4K MTU over OpenSM For further information please refer to Section 8 4 1 File Format on page 148 Default 0xffff ipoib mtu 5 ALL full Use MTU for 4K 4092 Bytes In UD mode the maximum MTU value is 4092 Bytes Make sure that all interfaces includi
114. LP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes 4 4 3 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four sub sections I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription Mellanox Technologies 63 J Rev 2 2 1 0 1 Driver Features ll Fabric Setup Defines how the SL2VL and VLArb tables should be setup In OFED this part of the policy is ignored SL2VL and VLArb tables should be config ured in the OpenSM options file opensm opts Ill QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Path Bits are not implemented in OFED IV M
115. LT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by Source port group whether a source port is a member of a specified group Destination port group same as above only for destination port PKey QoSclass Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Service ID disregarding rest of the query fields However if a certain query has only Service ID which means that this 1s the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 8 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the adv
116. M Preparing mpi selector Mellanox Technologies 33 J Rev 2 2 1 0 1 Installation Installing user level RPMs Preparing ofed scripts Preparing libibverbs Preparing libibverbs devel Preparing libibverbs devel static Preparing libibverbs utils Preparing libml Preparing libmlx4 devel Preparing ibml a i x4 libmlx5 Preparing ibmlx5 devel reparing libibem Preparing libibcm devel Preparing libibumad Preparing libibumad devel Preparing libibumad static Preparing libibmad Preparing libibmad devel Preparing ibibmad static reparing ibsim reparing ibacm Preparing librdmacm Preparing librdmacm utils Preparing ibrdmacm devel reparing pensm libs reparing pensm reparing opensm devel Preparing opensm static il i il i Jg il o gi dq ii XO
117. Mellanox TECHNOLOGIES Connect Accelerate Outperform Mellanox OFED for Linux User Manual Rev 2 2 1 0 1 www mellanox com Rev 2 2 1 0 1 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES As 15 WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCT O 5 AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANT ABILIT Y FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY
118. Mforge net dag 2599 repolist 8 351 2 5 2 Installing MLNX OFED using the YUM Tool After setting up the YUM repository for MLNX OFED package perform the following Step 1 View the available package groups by invoking yum grouplist grep MLNX OFED LNX OFED ALL LNX OFED BASIC LNX OFED GUEST L X OFED HYPERVISOR LNX OFED VMA LNX OFED VMA ETH LNX OFED VMA VPI Where LNX OFED ALL Installs all available packages in MLNX OFED LNX OFED BASIC Installs basic packages required for running Mellanox cards LNX OFED GUEST Installs packages required by guest OS LNX OFED HYPERVISOR Installs packages required by hypervisor OS LNX OFED VMA nstalls packages required by VMA LNX OFED VMA ETH Installs packages required by VMA to work over Ethernet LNX OFED VMA VPI nstalls packages required by VMA to support VPI 42 Mellanox Technologies m 2 2 1 0 1 Step 2 Install the desired group yum groupinstall MLNX OFED ALL Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management You can use subscrip tion manager to register Setting up Group Process Resolving Dependencies Running transaction check Package ar mgr x86 64 0 1 0 0 11 g22fff4a will be installed rds devel x86 64 6mlnx 1 rds tools x86 64 6mlnx 1 srptools x86 64 0 0 0 4mlnx3 OFED 2 0 2 6 7 11 ge863cb7 Complete 0320 rs Qe 2 5 3 Upd
119. OF SUCH DAMAGE Mellanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2014 Mellanox Technologies All Rights Reserved Mellanox Mellanox logo BridgeX ConnectX Connect IB CORE Direct InfiniBridge InfiniHost InfiniScale amp MetroX MLNX O8 PhyX ScalableHPC SwitchX UFM Virtual Protocol Interconnect and V oltaire are registered trademarks of Mellanox Technologies Ltd ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroDXTM TestX Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2877 m 2 2 1 0 1 Table of Contents Table of Contents 4 2 4 ir e e o RE e e eS E be eed eee e ek e ee SS Eist of FIGURES Ao ob ESSA A ad Lastof Tables 7 23 2 eR ex a ER ta DU Chapter 1 Mellanox OFED Overview esee nn nn nns 20 1 1 Introduction to Mellanox OFED sssssssseesee eee 20 1 2 Mellanox OFED Package o 0ooooooooooroororrrr teens 20 1 261 ISO MAS A ae ERES ORDERS NI te 20 1 2 2 Software Compania 20 1 2 3 A ul IW RISORSE DPMPPLES 21
120. Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Bench mark and Presta 1 3 6 InfiniBand Subnet Manager All InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OF ED See Chap ter 8 OpenSM Subnet Manager 1 3 7 Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data center managers e 1butils Mellanox Technologies diagnostic utilities infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools 1 3 8 Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image e Querying for firmware information Burning a firmware image to a single InfiniBand node MFT includes the following tools e mlxbur provides the following functions Generation of a standard or customized Mellanox firmware image for burning in bin binary or img format Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device Querying the firmware version loaded on an HCA board
121. PI benchmark tests OSU BW LAT Intel MPI Benchmark Presta e OpenSM InfiniBand Subnet Manager Utilities Diagnostic tools Performance tests Firmware tools MFT Source code for all the OFED software modules for use under the conditions men tioned in the modules LICENSE files Documentation 1 2 5 Firmware The ISO image includes the following firmware items Firmware images mlx format for ConnectX 3 ConnectX 3 Pro Connect IB net work adapters Firmware configuration INI files for Mellanox standard network adapter cards and custom cards FlexBoot for ConnectX 3 HCA devices 1 2 4 Directory Structure The ISO image of MLNX OFED LINUX contains the following files and directories mlnxofedinstall This is the MLNX OFED LINUX installation script ofed_uninstall sh This is the MLNX OFED LINUX un installation script RPMS folders Directory of binary RPMs for a specific CPU architecture firmware Directory of the Mellanox IB HCA firmware images including Boot over IB e src Directory of the OFED source tarball mlnx add kernel support sh Script required to rebuild MLNX OFED LINUX for customized kernel version on supported Linux Distribution e docs Directory of Mellanox OFED related documentation Mellanox Technologies 21 Rev 2 2 1 0 1 Mellanox OFED Overview 13 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protoco
122. Print out all results default print summary only 1 ib port lt port gt Use port lt port gt of IB device default 1 L inline_size lt size gt Max size of message to be sent in inline m mtu lt mtu gt MTU size 256 4096 default port mtu M MGID lt multicast_gid gt In multicast uses lt multicast_gid gt as the group MGID n iters lt iters gt Number of exchanges at least 5 default 1000 p port lt port gt Listen on connect to port lt port gt default 18515 Mellanox Technologies 245 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 59 raw_ethernet_lat Flags and Options Flag Description r rx depth lt dep gt Rx queue size default 512 If using srq rx depth con trols max wr size of the srq R rdma_cm Connect QPs with rdma_cm and run test on those QPs S S1Ze lt size gt Size of message to exchange default 2 S sl lt sl gt SL default 0 T tos lt tos value gt u qp timeout lt timeout gt Set tos value to RDMA CM QPs availible only with R flag values 0 256 default off QP timeout timeout value is 4 usec 2 timeout default 14 U report unsorted implies H print out unsorted results default sorted V version Display version number x gid index lt index gt Test uses GID with GID index Default IB no gid ETH 0 Z com_rdma_cm Additional Options
123. S will detect and report when it has insufficient configuration for a torus with radix 4 dimensions In the event the torus is significantly degraded i e there are many missing switches or links it may happen that torus 2QoS 1s unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condition Mellanox Technologies 161 Rev 2 2 1 0 1 OpenSM Subnet Manager occurs if torus 2QoS is misconfigured 1 e the radix of a torus dimension as configured does not match the radix of that torus dimension as wired and many switches links in the fabric will not be placed into the torus 8 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with Q Since torus 2QoS depends on such functionality for correct operation always invoke OpenSM with Q when torus 2QoS is in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines sup ported by OpenSM except torus 2QoS there is a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used for unicast QoS configuration will be honored by torus 2QoS For multicast QoS confi
124. Section 4 14 2 Setting Up SR IOV on page 91 Section 5 3 1 Compiling OpenMPI with MXM on page 127 Section 5 3 2 Enabling MXM in OpenMPT on page 128 Section 5 3 4 Configuring Multi Rail Support on page 129 Section 4 9 4 Setting Performance Tuning on page 84 Section 8 4 1 File Format on page 148 e Appendix B mlx4 Module Parameters page 252 Section 4 1 2 2 Manually Establishing an SRP Connec tion on page 47 Section 4 1 2 3 SRP Tools ibsrpdm srp_ daemon and srpd Service Script on page 49 Section 4 1 2 4 Automatic Discovery and Connection to Targets on page 51 Section 4 1 2 5 Multiple Connections from Initiator InfiniBand Port to the Target on page 52 Section 4 1 2 6 High Availability HA on page 52 Section 4 1 2 7 Shutting Down SRP on page 53 Section 4 17 Ethtool on page 111 14 Mellanox Technologies m 2 2 1 0 1 Table 1 Document Revision History Release Date Description 2 0 3 0 0 October 2013 Removed section Command Line Interface CLI Updated the following sections e Appendix page 254 August 2013 Updated the following sections Section 1 3 4 ULPs on page 24 Section 4 13 Flow Steering on page 87 and its subsec tions Section 1 3 3 Mid layer Core on page 24 Section 4 9 Ethernet Tunneling Over IPoIB Driver eIPoIB on page 80 Section 8 2 1 o
125. Server ib send bw options Client ib send bw options hostname Mellanox Technologies 229 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Options The table below lists the various flags of the command Table 42 ib_send_bw Flags and Options Flag Description a all Run sizes from 2 till 2223 b bidirectional Measure bidirectional bandwidth default unidirectional c connection lt RC XRC UC Connection type RC XRC UC UD DC default RC UD DC gt d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec F CPU freq Do not fail even if cpufreq ondemand module is loaded g mcg num of qps gt Send messages to multicast group with num of qps gt qps attached to it h help Show this help screen 1 ib port lt port gt Use port lt port gt of IB device default 1 I inline_size lt size gt Max size of message to be sent in inline post_list lt list size gt Post list of WQEs of lt list size gt size instead of single post m mtu lt mtu gt MTU size 256 4096 default port mtu M MGID lt multicast_gid gt In multicast uses lt multicast_gid gt as the group MGID n iters lt iters gt Number of exchanges at least 5 defa
126. Service ToS 2 The ToS is translated into a Socket Priority sk prio 3 The sk prio is mapped to a User Priority UP by the system administrator some applica tions set sk prio directly 4 The UP is mapped to TC by the network system administrator 5 TCs hold the actual QoS parameters QoS can be applied on the following types of traffic However the general QoS flow may vary among them Plain Ethernet Applications use regular inet sockets and the traffic passes via the ker nel Ethernet driver RoCE Applications use the RDMA API to transmit using QPs Raw Ethernet QP Application use VERBs API to transmit using a Raw Ethernet QP 4 5 3 Plain Ethernet Quality of Service Mapping Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver The following is the Plain Ethernet QoS mapping flow Mellanox Technologies 65 Rev 2 2 1 0 1 Driver Features 1 The application sets the ToS of the socket using setsockopt IP TOS value 2 ToS is translated into the sk prio using a fixed translation TOS 0 lt gt sk prio 0 NOS lt gt sk prio 2 TOS 24 lt gt sk prio 4 TOS 16 lt gt slk xe 6 3 The Socket Priority is mapped to the UP Ifthe underlying device is a VLAN device egress map is used controlled by the vconfig command This is per VLAN mapping Ifthe underlying device is not a VLAN device the tc command is used In this case even though tc manual
127. Spine 3 EA N N N1 Switch N2 Switch N3 PAN Going down to compute nodes To solve this problem a list of non CN nodes can be specified by V GV or V io guid fileV option These nodes will be allowed to use switches the wrong way around a specific number of times specified by V HY or V max reverse hopsV With the proper max reverse hops and io guid file values you can ensure full connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt gt N3 With a max reverse hops value of 2 N1 N2 and N3 will all have routes between them Using max reverse hops creates routes that use the switch in a counter stream way This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or similar Also having routes the other way around can cause credit loops 8 5 4 2 Activation through OpenSM e Use R ftree option to activate the fat tree algorithm LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm 1s invoked instead de 8 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path rout ing algorithm that enables topology agnostic deadlock free routing within communication net works When computing the routing funct
128. The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID This step is common to standard and Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same target port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group c Ifnone prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node 8 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign lids option is specified i eeissuigm lites This option causes OpenSM to reassign LIDs to all end nodes Specify ing r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or
129. U freq Do not fail even if cpufreq ondemand module is loaded h help Show this help screen 1 ib port lt port gt Use port lt port gt of IB device default 1 post_list lt list size gt Post list of WQEs of list size size instead of single post m mtu lt mtu gt MTU size 256 4096 default port mtu n iters lt iters gt Number of exchanges at least 5 default 1000 N no peak bw Cancel peak bw calculation default with peak 0 outs lt num gt Number of outstanding read atom default max of device O dualport Run test in dual port mode p port lt port gt Listen on connect to port lt port gt default 18515 q qp lt num of qp s gt Number of qp s default 1 Q cq mod Generate Cqe only after lt cq mod gt completion 226 Mellanox Technologies m 2 2 1 0 1 Table 38 ib read bw Flags and Options Flag R rdma_cm Description Connect QPs with rdma cm and run test on those QPs S size lt size gt S sl lt sl gt Size of message to exchange default 65536 SL default 0 t tx depth lt dep gt T tos lt tos value gt Size of tx queue default 128 Set lt tos_value gt to RDMA CM QPs available only with R flag values 0 256 default off u qp timeout lt timeout gt QP timeout timeout value is 4 usec 2 timeout default 14 V version Display version number w limit_bw Set ver
130. XM MellanoX Messaging MXM library provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hardware This includes a variety of enhancements that take advantage of Mellanox networking hardware including Multiple transport support including RC XRC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication Mellanox Technologies 123 Rev 2 2 1 0 1 HPC Features These enhancements significantly increase the scalability and performance of message com muni cations in the network alleviating bottlenecks within the parallel communication libraries 5 1 4 Running SHMEM with Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over contiguous pages It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv reg mr gt To activate MLNX OFED 2 0 and the contiguous pages allocator with SHMEM Run the following argument to enable compound pages with SHMEM opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 If using compound pages is not possible then the user will fall back to regular hugepages mechanism To force use of compound pages allocator Run the followin
131. a evt vla 270 EQ aSCSELOSiN s o0t eee des tt a A oe PA EAS i doe Rhee sls ae 270 E 10 DHCP Configuration for PXELINUX with FlexBoot 272 ETI pxelinux cfg default ssri es a hs MEE S 273 E 12 DHCP Configuration for iSCSI Boot with FlexBoot PXE SAN Boot 273 E 13 SAN Booting with FlexBoot in CHAP Environment 273 8 Mellanox Technologies m 2 2 1 0 1 List of Figures Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards o ooooooo o 22 Figure 2 I O Consolidation Over InfiniBand lslseeeseeee eese 62 Figure 3 An Example of a Virtual Network sseeseeeeeeeeee eh 83 Ripur 4 QoS Manager sess sng ota e pre SP pepe ees 165 Figure 5 Example QoS Deployment on InfiniBand Subnet 00 02 0020 ee 174 Mellanox Technologies 9 J Rev 2 2 1 0 1 List of Tables Table 1 Document Revision History 0 ce ccc ea 12 Table 2 Abbreviations and Acronyms 0 ccc cect eee t een n teens 16 Table 3 GlossaEy es freds sded dele at IO enini Red ote ean ele ghd 17 Table 4 Reference Documents recer eriein e e e eak a e a 19 Table 5 Software and Hardware Requirements 0 0 cece cece eee 28 Table 6 mlnxofedinstall Return Codes 0 cece rr 32 Table 7 Butter Valid A A ee 84 Table 8 Parameters Used to Control Error Cases Contiguity 0 00 cece eee eee 85 Table 9 Flo
132. adcast packets successfully received rx_errors Number of receive packets that contained errors preventing them from being deliverable to a higher layer protocol rx_dropped Number of receive packets which were chosen to be discarded even though no errors had been detected to prevent their being deliverable to a higher layer protocol rx_length_errors Number of received frames that were dropped due to an error in frame length IX Over errors Number of received frames that were dropped due to overflow IX CIC errors Number of received frames with a bad CRC that are not runts jabbers or alignment errors Mellanox Technologies 115 Table 7 Port IN Counters Rev 2 2 1 0 1 Driver Features Counter Description rx_jabbers Number of received frames with a length greater than MTU octets and a bad CRC rx in range length error Number of received frames with a length type field value in the decimal range 1500 46 42 is also counted for VLANtagged frames rx out range length error Number of received frames with a length type field value in the decimal range 1535 1501 Ix lt 64 bytes packets Number of received 64 or less octet frames rx 127 bytes packets Number of received 65 to 127 octet frames IX 255 bytes packets Number of received 128 to 255 octet frames rx 511 bytes packets Number of received 256 to 511 octet frames rx 1023 bytes packe
133. age 196 Section 9 4 6 ibqueryerrors on page 197 Section 9 4 7 saquery on page 199 Section 9 4 8 smpdump on page 201 Section 9 5 1 ib read bw on page 225 Section 9 5 2 ib read lat on page 227 Section 9 5 3 ib send bw on page 229 Section 9 5 4 ib send lat on page 232 Section 9 5 5 ib write bw on page 234 Section 9 5 6 ib write lat on page 236 Section 9 5 7 ib atomic bw on page 237 Section 9 5 8 ib atomic lat on page 239 Section 9 5 9 raw ethernet bw on page 241 Section 9 5 10 raw ethernet lat on page 245 2 2 1 0 1 April 30 2014 12 Mellanox Technologies m 2 2 1 0 1 Table 1 Document Revision History Release Date Description Updated the following section Section 2 3 1 Pre installation Notes on page 29 Section 2 32 Installation Script on page 30 Section Options on page 31 Section 2 3 3 Installation Procedure on page 33 Section 2 3 6 Installation Logging on page 39 Section 2 5 1 Setting up MLNX OFED YUM Reposi tory on page 41 Section 4 5 4 RoCE Quality of Service Mapping on page 66 Section 4 6 2 2 Creating Time Stamping Completion Queue on page 77 Section 4 6 2 3 Polling a Completion Queue on page 77 Section 4 6 2 4 Querying the Hardware Time on page 78 Section 4 10 Contiguous Pages on page 84 Section 4 14 2 Setting Up SR IOV on page 91 Section 4 14 6 2 Virt
134. ailover port order defines the order on which the ports would be chosen for routing pot gue 7 10 11 9 12 25 29 26 29 27 90 164 Mellanox Technologies Rev 2 2 1 0 1 8 6 Quality of Service Management in OpenSM 8 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the provided policy on client requests The overall flow for such requests is as follows The request is matched against the defined matching rules such that the QoS Level def inition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 4 QoS Manager Administrator qos ulps default ipoib sdp srp target port guid 0x1234 end qos ulps NERO QoS Policy Config File QoS Manager InfiniBand Subnet OSM with OFED based Nodes There are two ways to define QoS policy Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR Simple the simple policy file syntax enables the administrator to match PR MPR reque
135. al 12 7 6 30 3260 multiple Login to iface default target iqn 2013 10 qalab com sqa030 prt9 portal IDE Ge SUS 2160 EST GC SS A successful LUN login at this stage is mandatory for proceeding with the process of SCSI boot A failure at this stage is probably a result of an erroneous target or network configuration and troubleshooting that is out of the scope of this document Mellanox Technologies 271 Rev 2 2 1 0 1 Step 6 Verify the remote partition appears to the initiator as a local HDD root sqa070 fdisk 1 Disk dev sda 500 1 GB 500107862016 bytes 255 heads 63 sectors track 60801 cylinders Units cylinders of 16065 512 8225280 bytes Sector size logical physical 512 bytes 512 bytes I O size minimum optimal 512 bytes 512 bytes Disk identifier 0x000518f2 Device Boot Start End Blocks Id System dev sdal 1 131 1048576 83 Linux Partition 1 does not end on cylinder boundary dev sda2 TS 2681 20480000 83 Linux dev sda3 2681 2936 2048000 82 Linux swap Solaris Disk dev sdb 21 5 GB 21478670336 bytes 64 heads 32 sectors track 20483 cylinders Units cylinders of 2048 512 1048576 bytes Sector size logical physical 512 bytes 512 bytes I O size minimum optimal 512 bytes 512 bytes Disk identifier 0x00000000 E 6 2 DHCP Configuration for PXELINUX with FlexBoot The following DHCP configuration is presented as is and may not work in all environments authoritative
136. all HWTSTAMP FILTER NONE time stamp any incoming packet HWTSTAMP FILTER ALL return value time stamp all packets requested plus some others HWTSTAMP FILTER SOME Je P wily U HWTSTAMP FIL js jns wil qi HWTSTAMP FIL Js An yvi U HWTSTAMP FIL Jj Pn wg qi HWTSTAMP FIL fe An ww U HWTSTAMP FIL js pns wo qi HWTSTAMP FILT P any kind of event packet PTP V1 L4 EVENT P Sync packet ER PTP V1 L4 SYNC P Delay req packet PTP V1 L4 DELAY REQ P any kind of event packet ER PTP V2 L4 EVENT P Sync packet ER PTP V2 L4 SYNC P Delay req packet PTP V2 L4 DELAY REQ E Es Sas aesSasasas JJ lies mt 802 AS1 Ethernet any kind of event packet HWTSTAMP FILTER PTP V2 L2 EVENT 802 AS1 Ethernet Sync packet HWTSTAMP FILTER PTP V2 L2 SYNC 802 AS1 Ethernet Delay req packet HWTSTAMP FILTER PTP V2 L2 DELAY REQ PTP v2 802 AS1 any layer any kind of event packet HWISTAMP FILTER PTP V2 EVENT PTP v2 802 AS1 any layer Sync packet HWISTAMP FILTER PTP V2 SYNC PTP v2 802 AS1 any layer Delay req packet HWTSTAMP FILTER PTP V2 DELAY REQ NH Note for receive side time stamping currently only HWTSTAMP FILTER NONE and HWTSTAMP FILTER ALL are supported 4 6 1 2 Getting Time Stamping Once time stamping is enabled time stamp is placed in the socket Ancillary
137. all IB devices ibstat 1 list all IB devices ibstat p show port guids ibstat mthca0 2 show status of port 2 of mthcal ibtracert ibtracert uses SMPs to trace the path from a source GID LID to a destination GID LID Each hop along the path is displayed until the destination is reached or a hop does not respond By using the m option multicast path tracing can be performed between source and destination nodes Synopsys ibtracert d ebug v erbose D irect L id e rrors u sage G uids f orce n o info m mlid s smlid C ca name P ca port t imeout timeout ms V ersion node name map lt node name map gt h elp lt dest dr path lid guid lt startlid gt lt endlid gt Options The table below lists the various flags of the command Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util_name h syntax Table 26 ibtracert Flags and Options Flag Description force f Force n no_info Simple format do not show additional information mlid m lt mlid gt Shows the multicast trace of the specified mlid node name map lt node name Specifies a node name map The node name map file maps GUIDs to map gt more user friendly names debug d ddd d d d Raises the IB debugging level Lid L Uses LID address argument
138. alling opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file hostl osmtest f c Immediately afterwards run the following command to test opensm hostl osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed 8 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm par titions conf To change this filename you can use opensm with the Pconfig or P flags The default partition is created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P Key value of Ox7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial mem bership 8 4 1 File Format Notes Line content followed after character is comment and ignored by parser General File Format Partition Definition gt lt PortGUIDs list Partition Definition PartitionName PKey flag value defmember full limited 148 Mellanox Technologies m 2 2 1 0 1 where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value fo
139. anced QoS policy it has a list of match rules and their QoS Level but in this case a match 166 Mellanox Technologies m 2 2 1 0 1 rule has only one criterion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below 8 6 4 Policy File Syntax Guidelines Leading and trailing blanks as well as empty lines are ignored so the indentation in the example is just for better readability Comments are started with the pound sign and terminated by EOL Any keyword should be the first non blank in the line unless it s a comment Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules Any section subsection of the policy file is optional 8 6 5 Examples of Advanced Policy File As mentioned earlier any section ofthe policy file is optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file qos levels qos level name DEFAULT ska M end qos level end qos levels Port groups section is missing because there are no m
140. and etc infiniband info 2 3 4 Installation Results Software Most of MLNX OFED packages are installed under the usr directory except for the following packages which are installed under the opt directory openshmem bupc fca and ibutils The kernel modules are installed under lib modules uname r updates on SLES and Fedora Distributions lib modules uname r extra mlnx ofa kernel on RHEL and other RedHat like Distribu tions e lib modules uname r updates dkms on Ubuntu 38 Mellanox Technologies m 2 2 1 0 1 Firmware The firmware of existing network adapter devices will be updated if the following two conditions are fulfilled a You run the installation script in default mode that is without the option without fw update b The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image If an adapter s Flash was originally programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image e Incase your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file 2 3 5 Post installation Notes Most of the Mellanox OFED components can be configured
141. ap is as for standard IB Atomic operations Masked Fetch and Add MFetchAdd The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the field boundary parameter specifies the field boundaries The pseudocode below describes the operation brt adder cr bl b2 co value ci bl b2 ee IM valus E 2 return value 1 define MASK IS SET mask attr 11 mask amp attr bit position 1 carry 0 atomic response 0 roe 2 0 ue os i x t bit position bit position lt lt 1 bit add res bit adder carry MASK IS SET va bit position MASK IS SET compare add bit position amp new carry if bit add res atomic response bit position carry new carry amp amp MASK IS SET compare add mask bit position return atomic response Ethernet Tunneling Over IPoIB Driver elPolB The eth ipoib driver provides a standard Ethernet interface to be used as a Physical Interface PIF into the Hypervisor virtual network and serves one or more Virtual Interfaces VIF This driver supports L2 Switching Direct Bridging as well as other L3 Switching modes e g NAT This document explains the configuration and driver behavior when configured in Bridging mode m 2 2 1 0 1 In virtualization environme
142. ases 1 Without High Availability Mellanox Technologies 53 J Rev 2 2 1 0 1 Driver Features When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Stop service srpd Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shutdown etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown 4 2 iSCSI Extensions for ROMA iSER 4 2 1 Overview iSCSI Extensions for RDMA iSER extends the iSCSI protocol to RDMA It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies 4 2 2 SER Initiator Setting the SER target is out of scope of this manual For guidelines of how to do so please refer to the relevant target documentation e g stgt clitarget The iSER initiator is controlled through the iSCSI interface available from the iscsi initiator utils pac
143. atch rules which means that port groups are not referred anywhere and there is no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning port groups port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment Mellanox Technologies 167 Rev 2 2 1 0 1 OpenSM Subnet Manager uses SIP Reb S port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF end port group port group name Virtual Servers The syntax of the port name is as follows node description Pnum node description is compared to the NodeDescription of the node and Pnum is a port number on that node port name vsl HCA 1 P1 vs2 HCA 1 P1 end port group using partitions defined in the partition policy port group name Partitions partition Partl pkey 0x1234 end port group using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for all the nodes in the subnet port group name CAs and SM node type CA SELF end
144. atching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the fol lowing fields SRC and DST to lists of port groups e Service ID to a list of Service ID values or ranges e QoS Class to a list of QoS Class values or ranges 4 4 4 CMA Features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma_resolve_add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 4 4 4 1 IPoIB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 4 4 4 2 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 64 Me
145. ath LID or guid gt GUID lt portnum gt Optional Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 Ifnot found the first port that is UP physical link state is LinkUp 208 Mellanox Technologies rev 2 2 1 0 1 Examples 1 Query the status of Port 1 of CA mlx4 0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstatus mlx4 0 1 Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sm lid 0x3 Gites 2 INIT phys state 5 LinkUp rate 20 Gb sec 4X DDR gt ibportstate C mlx4 0 3 1 query PortInfo Port aos Wol 3 port i O ARO ao Sa EET Initialize AN GOR ERR LinkUp ARAS UP pra NETT TEE 1X or 4X unica somo MM IS 1X or 4X Vatis ls hole MNCS yy do coo po o9 od AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps calco e e e 2 5 Gbps or 5 0 Gbps MKS PE CA 5 0 Gbps 2 Query the status of two channel adapters using directed paths gt ibportstate C mlx4 0 D 0 1 PortInfo Port mios DX paci skici 655357 elle 655357 pore 1 Ka A Lao cdas Initialize INS
146. ating Firmware After Installation Installing MLNX OFED using the YUM tool does not automatically update the firmware To update the firmware to the version included in MLNX OFED package you can either e Runtheminxofedinstal1 script with the w update only flag or Update the firmware to the latest version available on Mellanox Technologies Web site as described in section Section 2 4 Updating Firmware After Installation on page 40 2 6 Uninstalling Mellanox OFED Use the script usr sbin ofed uninstall sh to uninstall the Mellanox OFED package The script is part of the ofed scripts RPM 2 7 Uninstalling Mellanox OFED using the YUM Tool If MLNX OFED was installed using the yum tool then it can be uninstalled as follow yum groupremove group name 1 The lt group name gt must be the same group name that was previously used to install MLNX OFED Mellanox Technologies 43 J Rev 2 2 1 0 1 Configuration Files 3 Configuration Files For the complete list of configuration files please refer to MLNX OFED configuration files txt at the following location docs readme and user manual MLNX OFED configuration files txt 3 1 Persistent Naming for Network Interfaces To avoid network interface renaming after boot or driver restart use the etc udev rules d 70 persistent net rules file Example for Ethernet interfaces
147. ation file to allow PXE SAN Boot host sqa070 filename option root path iscsi 12 7 6 30 iqn 2013 10 qalab com sqa030 prt9 fixed address 12 7 60 70 hardware ethernet 00 02 c9 32 e8 80 E 7 SAN Booting with FlexBoot in CHAP Environment For Successful SAN Booting with FlexBoot in CHAP environment the FlexBoot in use must support passing CHAP credentials Please note that FlexBoot v3 4 149 does not support such operation Hence use FlexBoot to chain load an UNDI software which supports passing CHAP credentials This can be achieved by using the UNDI software module from iPXE org called undionly kpxe A Linux host is required to create undionly kpxe Step 1 Install the below prerequisite software to support necessary UNDI s compilation root sqa030 yum install y gcc binutils make perl Step 2 Download the UNDI sources from iPXE org Mellanox Technologies 273 Rev 2 2 1 0 1 Step 3 Step 4 Step 5 Step 6 For more information visit http www ipxe org download root sqa030 git clone git git ipxe org ipxe git Edit a command file named sanbootnchap ipxe the name is given as an example whereas the ipxe file extension is mandatory with the following lines Make sure to enter your own values for username and password per your CHAP configu ration For reasons of simplicity and coherence with this document examples we gave our CHAP the username joe and the password secret Note that these CHAP
148. b s and 40GigE PCIe3 0 x8 8GT s RoHS R6 PSID MT 1090111019 PCI Device Name 0000 05 00 0 Versions Current Available FW 2 SL 000 2c ME OO PXE 3 4 01272141 3 4 0224 STEENSS Up to date In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message Device 1 Device OOOO IOS 01020 Part Number Description EPSIDS MT 0DB0110010 Versions Current Available FW 2 5 1000 N A iste No matching image found Step 4 Reboot the machine if the installation script performed firmware updates to your network adapter hardware Otherwise restart the driver by running etc init d openibd restart Note The script adds the following lines to etc security limits conf for the user space components such as MPI soft memlock unlimited hard memlock unlimited These settings set the amount of memory that can be pinned by a user space application to unlimited If desired tune the value unlimited to a specific amount of RAM Note For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be running on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on start ing OpenSM see Chapter 8 OpenSM Subnet Manager Step 5 InfiniBand only Run the hca se
149. be on the subnet High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 VLArb low table Low priority VL Arbitration table IBA 7 6 9 template VLArb high table High priority VL Arbitration table IBA 7 6 9 template SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by qos type string Here is a full list of the currently supported sets qos ca QoS configuration parameters set for CAs qos rtr parameters set for routers qos sw parameters set for switches port 0 qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos ca max vls 15 qos ca high limit 0 Goe ca vieo miea Wad 180 230 350 430 S308 180080930 10 0 13012280 45 80 1430 gos Ce wlar Vow 020 deed 2547 34 A GRA a dot ed A a AAA pe Ca si n 0 1234 Bi On 198 97110101 12 18 10 7 172 Mellanox Technologies m 2 2 1 0 1 qos swe max vls 15 qos swe high limit 0 Cos swe vlenio ioa et e 280801807380 630 10630 050 10807 LLE 0 12510 LLA gos swe vllarb Vow O30 aa A A A USA LISA Aage a AA qos swe
150. bleUPC For further information please refer to FCA User Man ual MXM support enabled by default 5 5 2 FCA Runtime Parameters The following parameters can be passed to upcrun in order to change FCA support behavior Table 15 Runtime Parameters Parameter Description fca enable lt 0 1 gt Disables Enables FCA support at runtime default disable fca np value Enables FCA support for collective operations if the number of processes in the job is greater than the ca np value default 64 fca verbose level Sets verbosity level for the FCA modules fca ops op list op list comma separated list of collective operations e fca ops op list Enables disables only the specified operations e fca ops Enables disables all operations By default all operations are enabled Allowed operation names are barrier br bcast bt reduce rc allgather ag Each operation can be also enabled disabled via environment variable GASNET FCA ENABLE BARRIER GASNET FCA ENABLE BCAST GASNET FCA ENABLE REDUCE Note All the operations are enabled by default 5 5 2 1 Enabling FCA Operations through Environment Variables in ScalableUPC This method can be used to control UPC FCA offload from environment using job scheduler srun utility The valid values are 1 enable 0 disable To enable a specific operation with shell environment variables in ScalableUPC
151. by running applications is not required when the fabric is routed with torus 2QoS Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover provided that all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that prevent routing that is free of credit loops and will log warnings and refuse to route If no fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that case all paths that do not transit the failed compo nents will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was config ured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and mes sage deadlock are likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capable of routing a torus without credit loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To veri
152. cho command max cmd per lun Default 62 max sect short for max sectors sets the request size of a command jo class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 tl retry count a number in the range 2 7 specifying the IB RC retry count Default 2 comp vector a number in the range 0 n 1 specifying the MSI X completion vector Some HCA s allo cate multiple n MSI X vectors per HCA port If the IRQ affinity masks of these interrupts have been configured such that each MSI X interrupt is handled by a different CPU then the comp vector parame ter can be used to spread the SRP completion workload over multiple CPU s cmd sg entries a number in the range 1 255 that specifies the maximum number of data buffer descrip tors stored in the SRP CMD information unit itself With allow ext sg 0 the parameter cmd sg entries defines the maximum S G list length for a single SRP CMD and commands whose S G list length exceeds this limit after S G list collapsing will fail e initiator ext Please refer to Section 9 Multiple Connections To list the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk 1 This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 4 1 2 2 1 SRP sysfs Parameter
153. ci tate cd a Nias Seat ee Sala OILS Th I ee ede ed RS 111 4 16 1 CORE Direct Overview sese teen ene 111 4 17 Elton a A A a LER ea tte A Ee 111 4 18 Dynamically Connected Transport Service 0 0 0 cee ee eee eee 113 ALO Peri poU YR eum toe del er dicent eg 114 4 20 nline RECEIVE ss detecte ge hp d dees tall AUR Re 114 4 20 1 Querying Inline Receive Capability 0 0 eee 114 4 20 2 Activating Inline Receive 0 ec cette eens 114 4 2 Ethernet Performance Counters 2 0000 cece ees 115 4 22 Memory Window so i Lese RESET ERD Rate ries 119 4 22 1 Query Capabilities usui iTi es ee UEERSA NO SG b ERN 120 4 22 2 Allocating Memory Window 0 cece eect eh 120 4 22 3 Binding Memory Windows 000 cece eee cect teen ees 120 4 22 4 Invalidating Memory Window 0 0 0 cee eects 120 4 22 5 Deallocating Memory Window seseeesereererrerererer ere reser ret 120 4 23 pm qos usage on ingress Packet Traffic ssssosesererererrrr ere resa 121 4 04 XOR RSS Hash Function ss sssccssssssrscsc eeeeess eee Vera 121 Chapters HPC Features s oye A RW REA ess ta 22 5 Shared Memory ACCESS js vers sei eek EY eee it HDI E ow agate 122 5 1 1 Mellanox ScalableSHMEM 00 0 cece cee e 122 5 1 2 Running SHMEM with FCA sseeeeeeeee e 123 5 1 3 Running ScalableSHMEM with MXM 0 0 0 cece ee 123 5 1 4 Running SHMEM with Contiguous Pages sseseereererrrsre
154. cifies the name of the inventory file Normally osmtest expects to find an inventory file which osmtest uses to validate real time information 146 Mellanox Technologies m 2 2 1 0 1 received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See c option for related information s Stress This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description sl Single MAD response SA queries s2 Multi MAD RMPP response SA queries EO Multi MAD RMPP Path Record SA queries Without s stress testing is not performed M Multicast ModeThis option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC Multiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is per formed t timeout This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds l eleg iile This option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout v
155. clock SOF TIMESTAMPING RAW HARDWARE PTP Hardware Clock 1 Hardware Transmit Timestamp Modes off HWTSTAMP TX OFF on HWTSTAMP TX ON Hardware Receive Filter Modes none HWTSTAI P FILTER NONE all HWTSTAI P FILTER ALL For more details on PTP Hardware Clock please refer to https www kernel org doc Documentation ptp ptp txt 4 6 0 RoCE Time Stamping RoCE Time Stamping is currently at beta level Please be aware that everything listed here 1s subject to change RoCE Time Stamping allows you to stamp packets when they are sent to the wire received from the wire The time stamp is given in a raw hardware cycles but could be easily converted into hardware referenced nanoseconds based time Additionally it enables you to query the hardware for the hardware time thus stamp other application s event and compare time 76 Mellanox Technologies m 2 2 1 0 1 4 6 2 1 Query Capabilities Time stamping is available if and only the hardware reports it is capable of reporting it To verify whether RoCE Time Stamping is available run ibv exp query device For example struct ibv exp device attr attr ibv exp query device context amp attr idt attr comp mask amp IBV EXP DEVICE ATTR WITH TIMESTAMP MASK if attr timestamp mask Time stamping is supported with mask attr timestamp mask idt attr comp mask amp IBV EXP DEVICE ATTR WITH HCA CORE CLOCK if attr hca core cloc
156. configuration is enabled using a different initiator_ext value for each SRP connection The initiator_ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator_ext value on each path The conventions is to use the Target port GUID as the initiator_ext value for the rele vant path If you use srp_daemon with n flag it automatically assigns initiator_ext values according to this convention For example id_ext 200500A0B81146A1 ioc_guid 0002c90200402bec dgid fe800000000000000002c90200402bed pkey ffff service _id 200500a0b81146a1 initiator ext ed2b400002c90200 Notes 1 Itis recommended to use the n flag for all srp_ daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the user or automat ically at startup by setting SRP_DAEMON_ENABLE to yes 4 1 2 6 High Availability HA Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each initiator is connected to the same target from several ports HCAs The DM multpath is responsi ble for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices
157. d Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 4 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 5 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Where e 03 00 represents the Physical Function e 03 00 X represents the Virtual Function connected to the Physical Function 4 14 3 Enabling SR IOV and Para Virtualization on the Same Setup To enable SR IOV and Para Virtualization on the same setup Step 1 Create a bridge vim etc sysconfig network scripts ifcfg bridge0 DEVICE bridgeO TYPE Bridge IPADDR 12 195 15 1 ETMASK 255 255 0 0 BOOTPROTO static ONBOOT yes M CONTROLLED no DELAY 0 Step 2 Change the related interface in the example below bridge is created over eth5 DEVICE eth5 BOOTPROTO none STARTMODE on HWADDR 00 02 c9 2e 66 52 TYPE Ethernet M CONTROLLED no ONBOOT yes BRIDGE bridge0 Step3 Restart the service network Mellanox Technologies 97 Rev 2 2 1 0 1 Driver Features Step 4 Attach a virtual NIC to VM ifconfig a eth6 Link encap Ethernet HWaddr 52 54 00 E7 77 99 imet acei 13 195 155 Beasts 199 299 239 Melo S9 o5 99 0 c0 inet6 addr fe80 5054 ff fee7 7799 64 Scope Link UP BROADCAST RUNNING MULTICAST MTU 1500 Metric 1 RX packets 481 errors 0 dropped 0 overruns 0 frame 0 TX packets 450 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen
158. d 1519 to 1522 octet frames tx 1548 bytes packets Number of transmitted 1523 to 1548 octet frames tx gt 1548 bytes packets Number of transmitted 1549 or greater octet frames Table 9 Port VLAN Priority Tagging where i is in the range 0 7 Counter IX prio lt i gt packets Description Total packets successfully received with priority i IX prio lt i gt bytes rx novlan packets Total bytes in successfully received packets with priority i Total packets successfully received with no VLAN priority rx novlan bytes Total bytes in successfully received packets with no VLAN pri ority tx prio lt i gt packets Total packets successfully transmitted with priority 1 tx prio lt i gt bytes Total bytes in successfully transmitted packets with priority i tx novlan packets Total packets successfully transmitted with no VLAN priority tx novlan bytes Total bytes in successfully transmitted packets with no VLAN priority Table 10 Port Pause where i is in the range 0 7 Counter Description IX pause prio lt i gt IX pause duration prio i gt The total number of PAUSE frames received from the far end port The total time in microseconds that far end port was requested to pause transmission of packets IX pause transition prio i gt The number of receiver transitions from XON state paused to XOFF state non pau
159. d Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Verify the entire Flash bb Burn Block Burn the given image as is without running any checks Sg Set GUIDs ri lt out file gt dc lt out file gt Read the firmware image on the Flash into the specified file Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase sector rw lt addr gt Read one DWORD from Flash ww lt addr gt lt data gt Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt lt size gt lt data gt Write a data block to Flash without sector erase rb lt addr gt lt size gt out file swreset Read a data block from Flash SW reset the target InfniScale IV device This command is supported only in the In Band access method 222 Mellanox Technologies m 2 2 1 0 1 Possible command return values are 0 successful completion 1 error has occurred 7 the burn command was aborted because firmware is current Examples 1 Find Mellanox Technologies s ConnectX amp VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE gt sbin lspci d 15b3 634a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In th
160. d a subnet mask to each HCA port like any other network adapter card i e you need to prepare a file called ifcfg ib lt n gt for each port The first port on the first HCA in the host is called interface 1b0 the second port is called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 3 3 1 or on a static configuration Section 4 3 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 3 3 3 4 3 3 1 IPolB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP is performed similarly to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp If IPoIB configuration files are included ifefg ib lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine dean etc sysconfig network on a SuSE machine A patch for DHCP may be required for supporting IPoIB For further information please see the REAME which is available under the docs dhcp directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hard ware address To overcome this problem DHCP over InfiniBand messages convey a client iden tifier field used to identify the DHCP session This client identifier field can be used to as
161. d from storage Min BW of 50 e SRP Min BW 50 Bottleneck at storage nodes Administration e OpenSM QoS policy file In the following policy file example replace SRPT with the real SRP Target port GUIDs ade Mellanox Technologies 175 Rev 2 2 1 0 1 OpenSM Subnet Manager qos ulps default 0 ipoib gil sdp st srp target port guid SRPT1 SRPT2 SRPT3 2 end qos ulps e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 32 qos vlarb low 0 1 qe Ibl 5352 Sy 05 By Oy 115 19 Loy US Loy 9 13 Ld US 8 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels Management traffic ssh e IPoIB management VLAN partition A Min BW 10 Application traffic e IPoIB application VLAN partition B Isolated from storage and database Min BW of 30 Database Cluster traffic RDS Min BW of 30 e SRP Min BW 30 Bottleneck at storage nodes Administration e OpenSM QoS policy file In the following policy file example replace SRPT with the real SRP Initiator port GUIDs Ad 176 Mellanox Technologies m 2 2 1 0 1 qos ulps default ipoib pkey 0x8001 ipoib pkey 0x8002 rds Srp target port guid SRPT1 SRPT2 SRPT3 end qos ulps m GC BO qp c5 OpenSM options f
162. d to the hypervisor fits format is a string the string specifies the probe vf parameter separately per installed HCA The string format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to use in the PF driver for that HCA which is either a single value or a triplet as described above For example probe _v s 5 The PF driver will activate 5 VFs on the HCA and this will be applied to all ConnectX HCAs on the host probe vfs 00 04 0 5 00 07 0 8 The PF driver will activate 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for the one in 00 07 0 Mellanox Technologies 95 Rev 2 2 1 0 1 Driver Features Parameter Recommended Value probe vf 1 2 3 The PF driver will activate 1 VF on physical port 1 2 VFs on physical port 2 and 3 dual port VFs applies only to dual port HCA when all ports are Ethernet ports This applies to all ConnectX HCAs in the host probe vf 00 04 0 5 6 7 00 07 0 8 9 10 The PF driver will activate HCA positioned in BDF 00 04 0 Ssingle VFs on port 1 6 single VFs on port 2 7 dual port VFs HCA positioned in BDF 00 07 0 8 single VFs on port 1 Osingle VFs on port 2 10 dual port VFs Applies when all ports are configure as Ethernet in dual port HCAs Notes PFs not included in the above list will not activate any of their VFs in the PF driver Triplets and single port VFs are only valid when all po
163. e Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step 1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command kosti ping e 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seg 0 ttl 64 time 0 079 ms 64 bytes from 11 4 3 176 icmp seq 1 ttl 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seq 3 ttl 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seq 4 ttl 64 time 0 065 ms 11 4 3 176 pirig statistics 60 Mellanox Technologies m 2 2 1 0 1 5 packets transmitted 5 received 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 3 6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver
164. e Target ign 2013 10 qalab com sqa030 prt9 Lun 0 Path dev cciss c0d0p9 Type fileio 10Mode wb axConnections 1 umber of connections session We only support 1 InitialR2T Yes Wait first for R2T ImmediateData Yes Data can accompany command axRecvDataSegmentLength 8192 ax data per PDU to receive axXmitDataSegmentLength 8192 ax data per PDU to transmit axBurstLength 262144 ax data per sequence R2T FirstBurstLength 65536 ax unsolicited data sequence DefaultTime2Wait 2 Secs wait for ini to log out Not used DefaultTime2Retain 20 Secs keep cmnds after log out Not used axOutstandingR2T il ax outstanding R2Ts per cmnd DataPDUInOrder Yes Data in PDUs is ordered We only support ordered DataSequenceInOrder Yes PDUs in sequence are ordered We only support ordered ErrorRecoveryLevel 0 We only support level 0 HeaderDigest ONE PDU header checksum algo list None or CRC32C If only one is set then the initiator must agree to it or the connection will fail DataDigest ONE PDU data checksum algo list Same as above axSessions 0 aximum number of sessions to this target 0 unlimited OPInterval 0 Send a NOP In ping each after that many seconds if the conn is otherwise idle 0 off OPTimeout 0 Wait that many seconds for a response on a NOP In ping If 0 or gt NOPInterval NOPInterval is used Various target parameters Wthreads 8 Number of IO threads QueuedCommands EU Number of queued commands The local Hard Disk
165. e field that was returned from the ibv exp reg mrasthemr handle Supplies the desired access mode for that MR Supplies the address field which can be either NULL or any hint as the required output The address and its length are returned as part ofthe ibv mr struct To achieve high performance it 1s highly recommended to supply an address that 1s aligned as the origi nal memory region address Generally it may be an alignment to 4M address For further information on how to use the ibv exp reg shared mr verb please refer to the ibv exp reg shared mr man page and orto the ibv shared mr sample program which demon strates a basic usage of this verb Further information on the ibv shared mr sample program can be found in the ibv shared mr man page 4 12 XRC eXtended Reliable Connected Transport Service for InfiniBand XRC allows significant savings in the number of QPs and the associated memory resources required to establish all to all process connectivity in large clusters It significantly improves the scalability of the solution for large clusters of multicore end nodes by reducing the required resources For further details please refer to the Annex A14 Supplement to InfiniBand Architecture Speci fication Volume 1 2 1 A new API can be used by user space applications to work with the XRC transport The legacy API is currently supported in both binary and source modes however it is deprecated Thus we recommend
166. e InfiniBand or Ethernet based on the link partner and load the appropriate driver stack InfiniBand or Ethernet For example if the first port is connected to an InfiniBand switch and the second to Ethernet switch the NIC will automatically load the first switch as InfiniBand and the second as Ethernet 6 2 1 Enabling Auto Sensing Upon driver start up 1 Sense the adapter card s port type If a valid cable or module is connected QSFP SFP or SFP with EEPROM in the cable mod ule Set the port type to the sensed link type IB Ethernet Otherwise e Set the port type as default Ethernet Mellanox Technologies 133 Rev 2 2 1 0 1 Working With VPI During driver run time Sense a link every 3 seconds if no link is sensed detected fsensed set the port type as sensed 134 Mellanox Technologies m 2 2 1 0 1 7 Performance For further information on Linux performance please refer to the Performance Tuning Guide for Mellanox Network Adapters Mellanox Technologies 135 Rev 2 2 1 0 1 OpenSM Subnet Manager 8 OpenSM Subnet Manager 8 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Management Model 13 Subnet Management 14 and Subnet Administration
167. e Mato eet ee a ES beo Sp sesto eo 236 9 5 7 1b atomic DW sites east ROI A eee a I DERE E 237 9 5 8 1b atomic late iaa a ERE eta Fulci ENG RR CAES ERES 239 9 39 raw ethernet DW eso vi iras 241 9 5 1Otraw ethernet lat e or Le e s aie 245 Appendix A SRP Target Driver 0020 cece rr rr rr rr rr RAA 248 A l Prerequisites and Installation 0 0 cece esee 248 AZ Howto RUN qt eknntesessknedestqeMES IAS E RENE 248 A 3 How to Unload Shutdown seseeseeeeee eh 251 Appendix B mlx4 Module Parameters 000 cee eee eee e eee 252 Bel amlx41b Parameters iaa tt mal fa 252 B 2 mlx4 core Parameters 0 unanunua cette teens 252 B 3 milx4 en Parameters 5 tss RR SR RR ee ee 253 Appendix C mlx5 Module Parameters 0002 eee eee eee 254 Appendix D Lustre Compilation over MLNX OFED 255 Appendix E Using FlexBoot for Booting Various OSs from an iSCSI Target 256 E Configuring the iSCSI Target Machine l l 00 02 ee eee 256 E 2 Configuring the DHCP Server 0 0 eee nee 257 E3 Configuring the PXE Server cette 258 E 4 Installing SLES11 SP3 on a Remote Storage over iSCSI 259 E 5 Using PXE Boot Services for Booting the SLES11 SP3 from the iSCSI Target 266 E 6 Installing RHEL6 4 on a Remote Storage over iSCSI 266 E 7 SAN Booting the Diskless Client with FlexBoo0t o 270 ES Sanity Checks tes ee tet rte Tre E Tect
168. e and application skew on application scalability The relevant verbs to be used for CORE Direct e ibv exp create qp e Ibv exp modify cq e Ibv exp query device jbv exp post task Samples programs for reference e jbv task pingpong ibv cc pingpong 4 17 Ethtool ethtool is a standard Linux utility for controlling network drivers and hardware particularly for wired Ethernet devices It can be used to Get identification and diagnostic information Get extended device statistics Control speed duplex autonegotiation and flow control for Ethernet devices e Control checksum offload and other hardware offload features Control DMA ring sizes and interrupt moderation Mellanox Technologies 111 Rev 2 2 1 0 1 Driver Features The following are the ethtool supported options Table 6 ethtool Supported Options Options Description ethtool i eth lt x gt Checks driver and device information For example gt ethtool i eth2 driver mlx4 en MT 0DD0120009 CX3 version 2 1 6 Aug 2013 firmware version 2 30 3000 bus info 0000 1a 00 0 ethtool k eth lt x gt Queries the stateless offload status ethtool K eth lt x gt rx onloff tx Sets the stateless offload status onjoff sg on off tso onjoff Iro TCP Segmentation Offload TSO Generic Segmentation onjoff gro on off gso onjoff Offload GSO increase outbound throughput by reducing CPU overhead It works by queuing up large
169. e example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 ade 2 Verify the ConnectX firmware using its ID using the results of the example above gt mstflint d 04 00 0 v ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634f 0x000 20 DDR OK 0x00006350 0x0000 29b 0x008f4c DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x0004749c 0x0005913f 0x011ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 0x0007a124 0x0007bdff 0x001cdc DDR OK 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007f0af 0x000518 Configuration OK 0x0007f0b0 0x0007f0fb 0x00004c Jump addresses OK 0x0007 0fc 0x0007 2a7 0x0001ac FW Config
170. e minimum hops to each node where the path length is optimized 2 UPDN Algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat tree Routing Algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing 1s con strained to ranking rules 4 LASH Routing Algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distributing the paths between layers LASH is an alternative deadlock free topology agnostic routing algorithm to the non minimal UPDN algorithm It avoids the use of a potentially con gested root node 5 DOR Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh 6 Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2QoS provides deadlock free routing while supporting two quality of service QoS levels Addi
171. e radix 1 for that dimension By default the common switch in a torus seed is taken as the origin of the coordinate system used to describe switch location The position param Mellanox Technologies 163 Rev 2 2 1 0 1 OpenSM Subnet Manager eter for a dateline keyword moves the origin and hence the dateline the specified amount rela tive to the common switch in a torus seed next_seed If any of the switches used to specify a seed were to fail torus 2QoS would be unable to complete topology discovery successfully The next seed keyword specifies that the following link and dateline keywords apply to a new seed specification For maximum resiliency no seed specification should share a switch with any other seed specifi cation Multiple seed specifications should use dateline configuration to ensure that torus 2QoS can grant path SL values that are constant regardless of which seed was used to initiate topology discovery portgroup max ports max ports This keyword specifies the maximum number of parallel inter switch links and also the maximum number of host ports per switch that torus 2QoS can accommodate The default value is 16 Torus 2QoS will log an error message during topology discovery if this parameter needs to be increased If this keyword appears multiple times the last instance prevails port order pl p2 p3 This keyword specifies the order in which CA ports on a destination switch are visited when computing
172. ease one ibX Y slave to serve the PIF itself In the VIFs list of ethX you will notice that ibX 1 is always created to serve applications running from the Hypervisor on top of the ethX interface directly For InfiniBand applications that require native IPoIB interfaces e g CMA the original IPoIB interfaces ibX can still be used For example CMA and ethX drivers can co exist and make use of IPoIB ports CMA can use ib0 while eth0 ipoib interface will use ibX Y interfaces To see the list of eIPoIB interfaces cat sys class net eth ipoib interfaces For example cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl 82 Mellanox Technologies m 2 2 1 0 1 The example above shows two elPoIB interfaces where eth4 runs traffic over 1b0 and eth5 runs traffic over ibl Figure 3 An Example of a Virtual Network Host E PEIUS IL S ib0 2 ib0 3 L 4 KVM GUEST1 l A ed IPoIB LAN e us em mae coe em via port 1 l I etho SSE J tapo 1 tapl 4 5 KVM GUEST2 fos Ey S bro L vif0 3 The example above shows a few IPoIB instances that serve the virtual interfaces at the Virtual Machines To display the services provided to the Virtual Machine interfaces cat sys class net eth0 eth vifs Example cat sys class net eth0 eth vifs SLAVE ib0 2 MAC 52 54 00 60 55 88 V
173. eate a shared MR The application sends a request via the ibv exp reg mr API to create a shared MR The application supplies the allowed sharing access to that MR If the MR was created suc cessfully a unique MR ID is returned as part of the struct ibv mr which can be used by other applications to register with that MR The underlying physical pages must not be Least Recently Used LRU or Anonymous To disable that you need to turn on the IBV EXP ACCESS ALLOCATE MR bit as part of the sharing bits Usage e Turns on via the ibv exp reg mr one or more of the sharing access bits The sharing bits are part of the ibv exp reg mr man page e Turns on the IBV EXP ACCESS ALLOCATE MR bit Mellanox Technologies 85 Rev 2 2 1 0 1 Driver Features Step 2 Request to register to a shared MR A new verb called ibv exp reg shared mr is added to enable sharing an MR To use this verb the application supplies the MR ID that it wants to register for and the desired access mode to that MR The desired access is validated against its given permissions and upon successful creation the physical pages of the original MR are shared by the new MR Once the MR is shared it can be used even if the original MR was destroyed The request to share the MR can be repeated multiple times and an arbitrary number of Memory Regions can potentially share the same physical memory locations Usage Uses the handl
174. echnologies 89 J Rev 2 2 1 0 1 Driver Features The mlx4 ipoib driver when it attaches its QP to his configured GIDS Fragmented UDP traffic cannot be steered It is treated as other protocol by hardware from the first packet and not considered as UDP traffic We recommend using 1ibibverbs v2 0 3 0 0 and 1ibm1x4 v2 0 3 0 0 and higher as of MLNX_OFED v2 0 3 0 0 due to API changes 4 14 Single Root IO Virtualization SR IOV Single Root IO Virtualization SR IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus This technology enables multiple virtual instances of the device with separate resources Mellanox adapters are capable of exposing in ConnectX 3 adapter cards up to 126 virtual instances called Virtual Functions VFs These vir tual functions can then be provisioned separately Each VF can be seen as an addition device con nected to the Physical Function It shares the same resources with the Physical Function and its number of ports equals those of the Physical Function SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual machines direct hardware access to network resources hence increasing its performance In this chapter we will demonstrate setup and configuration of SR IOV in a Red Hat Linux envi ronment using Mellanox ConnectX VPI adapter cards family 4 14 1 System Requirements To set up an SR IOV
175. ective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet qos ca max vls 15 qos ca high limit 6 qos ca vlarb high 0 4 qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 qos ca sil2vl 0 1 2 3 4 5 6 187 9 1071401 12 13 1477 qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 Gos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 04 7 64 qos swe sl2vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x AKB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VLA is effectively turned off Mellanox Technologies 173 Rev 2 2 1 0 1 OpenSM Subnet Manager 8 6 8 Deployment Example Figure 5 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs Figure 5 Example QoS Deployment on InfiniBand Subnet Traffic class SDP Service level 2 fap A Servir oar Policy min 20 BW App A Server Traffic class Partition A Service level 0 Policy min 40 Traffic class SRP Service Level 1 Policy min 30 BW
176. ed even for a single port HCA the HCA ignores this value guids burn sg 4 GUIDs must be specified here The specified GUIDs are lt GUIDs gt assigned the following values repectively node portl port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 220 Mellanox Technologies m 2 2 1 0 1 Table 36 mstflint Switches Sheet 2 of 3 Affected Switch Relevant Description Commands mac burn sg MAC address base value Two MACS are automatically lt MAC gt assigned to the following values mac gt portl mac 1 gt port2 Note This switch is applicable only for Mellanox Technolo gies Ethernet products macs burn sg Two MACs must be specified here The specified MACs are lt MACs gt assigned to port and port2 repectively Note This switch is applicable only for Mellanox Technolo gies Ethernet products blank_guids burn Burn the image with blank GUIDs and MACS where applica ble These values can be set later using the sg command see Table 37 below No com Force clear the Flash semaphore on the device No command is clear_semap mands allowed when this switch is used hore allowed Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage burn verify Binary image file lt image gt qq b
177. eep interval in seconds prefix routes file path to file This option specifies the prefix routes file Prefix routes control how the SA responds to path record queries for off subnet DGIDs Default file is etc opensm prefix routes conf 142 Mellanox Technologies m 2 2 1 0 1 consolidate ipv6 snm req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P Key consolidate ipv4 mask Use mask for IPv4 multicast groups multiplexing per MGID scope and P Key pid file path to file Specifies the file that contains the process ID of the opensm daemon The default is var run opensm pid max seq redisc Specifies the maximum number of failed discovery loops done by the SM before completing the whole heavy sweep cycle mc secondary root guid lt GUID in hex This option defines the guid of the multicast secondary root switch mc primary root guid lt GUID in hex This option defines the guid of the multicast primary root switch guid routing order no scatter Don t use scatter for ports defined in guid routing order file sa pr full world queries allowed This option allows OpenSM to respond full World Path Record queries path record for each pair of ports in a fabric enable crashd This option causes OpenSM to run Crash Daemon child process that allows backtrace dump in case of fatal terminating signals log prefix prefix text Pre
178. efines the number of resources supported The parameter name for selecting the profile is prof sel The supported values for profiles are e 0 for medium resources medium performance 1 for low resources e 2 for high performance int default 254 Mellanox Technologies m 2 2 1 0 1 Appendix D Lustre Compilation over MLNX OFED This procedure applies to RHEL SLES OSs supported by Lustre For further information please refer to Lustre Release Notes ae To compile Lustre version 2 4 0 and higher 8 configure with o2ib usr src ofa kernel default make rpms To compile older Lustre versions EXTRA LNET INCLUDE I usr src ofa kernel default include include usr src ofa kernel default include linux compat 2 6 h configure with o2ib usr src ofa kernel default EXTRA LNET INCLUDE I usr src ofa kernel default include include usr src ofa kernel default include linux compat 2 6 h make rpms Mellanox Technologies 255 Rev 2 2 1 0 1 Appendix E Using FlexBoot for Booting Various OSs from an iSCSI Target Below are instructions on how to provision a diskless system the client with a fresh SLES11 SP3 and RHEL6 4 installation to a remote storage IE a LUN partition on an iSCSI target and then SAN Booting iSCSI boot the client using Mellanox PXE boot agent FlexBoot The iSCSI configuration in this document is very basic no CHAP authentication no multipath I O and de
179. ellanox fca folder For further information on configuration instructions please refer to the FCA User Manual 5 5 ScalableUPC Unified Parallel C UPC is an extension of the C programming language designed for high per formance computing on large scale parallel machines The language provides a uniform program ming model for both shared and distributed memory hardware The programmer is presented with a single shared partitioned address space where variables may be directly read and written by any processor but each variable is physically associated with a single processor UPC uses a Single Program Multiple Data SPMD model of computation in which the amount of parallelism is fixed at program startup time typically with a single thread of execution per processor In order to express parallelism UPC extends ISO C 99 with the following constructs Anexplicitly parallel execution model A shared address space Synchronization primitives and a memory consistency model Memory management primitives The UPC language evolved from experiences with three other earlier languages that proposed parallel extensions to ISO C 99 AC Split C and Parallel C Preprocessor PCP UPC is not a superset of these three languages but rather an attempt to distill the best characteristics of each UPC combines the programmability advantages of the shared memory programming paradigm and the control over data layout and performance of the message pass
180. en 64 hops are required for traversing the local port to the Source port and then to the Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File 6 Failed to load required Package 9 4 4 ibstat Ibstat is a binary which displays basic information obtained from the local IB driver Output includes LID SMLID port state link width active and port physical state Synopsis star eleso else ar eash Esla I or list iersinii a ca name portnum Options The table below lists the various flags of the command Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util name h syntax Table 25 ibstat Flags and Options Flag Description 1 list of cas List all IB devices s short Short output p port_list Show port list ca_name InfiniBand device name portnum Port number of InfiniBand device debug d ddd d d d Raise the IB debugging level help h Show the usage message verbose v vv v v v Increase the application verbosity level Mellanox Technologies 195 Rev 2 2 1 0 1 InfiniBand Fabric Utilities 9 4 5 Table 25 ibstat Flags and Options Flag Description version V Show the version info usage u usage message Examples ibstat display status of all ports on
181. end TCP Header Z server Choose server side for the current machine server client must be selected P client Choose client side for the current machine server cli ent must be selected v mac fwd Run mac forwarding test tcp Send TCP Packets must include IP and Ports informa tion Mellanox Technologies 247 Rev 2 2 1 0 1 Appendix A SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possible to work with and support a lot of IO modes on real or virtual devices in the back end 1 scst_vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices AA Prerequisites and Installation 1 SRP target is part of the OpenFabrics OFED software stacks Use the latest OFED distribu tion package to install SRP target On distribution default kernels you can run scst vdisk blockio mode to obtain good performance aes 2 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from
182. end_bw Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive output lt units gt Set verbosity output level bandwidth message_rate latency_typical pkey_index lt pkey index gt PKey index to use for QP report both Report RX amp TX results separately on Bidirectinal BW tests report_gbits Report Max Average BW of test in Gbit sec instead of MB sec run_infinitely Run test forever print results every lt duration gt seconds Mellanox Technologies 231 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Rate Limiter The table below lists the Rate Limiter flags of the command Table 44 Additional Rate Limiter Flags and Options Flag Description burst_size lt size gt Set the amount of messages to send in a burst when using rate limiter rate_limit lt rate pps gt Set the maximum rate of sent packages rate_units lt units gt Mgp Set the units for rate limit to MBps M Gbps g or pps p 9 5 4 ib send lat ib send lat calculates the latency of sending a packet in message size between a pair of machines One acts as a server and the other as a client They perform a ping pong benchmark on which you send packet only if you receive one Each of the sides samples the CPU each time they receive a packet in order to calculate the latency Using the a provides results for all message sizes Synops
183. ensitive 84 Mellanox Technologies m 2 2 1 0 1 Usage The application calls the ibv exp reg mr API which turns on the IBV EXP ACCESS ALLOCATE MR bit and sets the input address to NULL Upon success the address field of the struct ibv mr will hold the address to the allocated memory block This block will be freed implicitly when the ibv dereg mr is called The following are environment variables that can be used to control error cases contiguity Table 4 Parameters Used to Control Error Cases Contiguity Parameters Description MLX MR ALLOC TYPE Configures the allocator type ALL Default Uses all possible allocator and selects most efficient allocator ANON Enables the usage of anonymous pages and disables the allocator CONTIG Forces the usage of the contiguous pages allocator If contiguous pages are not available the allocation fails MLX MR MAX LOG2 CONTIG BSIZE Sets the maximum contiguous block size order e Values 12 23 e Default 23 MLX MR MIN LOG2 CONTIG BSIZE Sets the minimum contiguous block size order e Values 12 23 Default 12 4 11 Shared Memory Region Shared Memory Region is only applicable to the mlx4 driver ade Shared Memory Region MR enables sharing MR among applications by implementing the Register Shared MR verb which is part of the IB spec Sharing MR involves the following steps Step 1 Request to cr
184. ent successfully vportci tx multicast pack ets Multicast packets sent successfully vportci tx multicast byte S vport lt i gt tx broadcast pac kets Multicast packet bytes sent successfully Broadcast packets sent successfully vport lt i gt tx broadcast byte S Broadcast packet bytes sent successfully vportci tx errors Packets dropped due to transmit errors 118 Mellanox Technologies m 2 2 1 0 1 Table 12 SW Statistics Counter Description Ix lro aggregated Number of packets aggregated rx lro flushed Number of LRO flush to the stack IX lro no desc Number of times LRO description was not found rx alloc failed Number of times failed preparing receive descriptor IX csum good Number of packets received with good checksum IX csum none Number of packets received with no checksum indication tx chksum offload Number of packets transmitted with checksum offload tx queue stopped Number of times transmit queue suspended tx wake queue Number of times transmit queue resumed tx timeout Number of times transmitter timeout tx tso packets Number of packet that were aggregated Table 13 Per Ring SW Statistics where i is the ring per configuration Counter Description rx lt i gt _packets Total packets successfully received on ring i rx lt i gt _bytes Total bytes in successfully received packets on ring 1 tx lt
185. environment the following is required MLNX OFED Driver Aserver blade with an SR IOV capable motherboard BIOS Hypervisor that supports SR IOV such as Red Hat Enterprise Linux Server Version 6 Mellanox ConnectX VPI Adapter Card family with SR IOV capability 90 Mellanox Technologies m 2 2 1 0 1 4 14 2 Setting Up SR IOV Depending on your system perform the steps below to set up your BIOS The figures used in this section are for illustration purposes only For further information please refer to the appropriate BIOS User Manual Step 1 Enable SR IOV in the system BIOS BIOS SETUP UTILITY di Dis Enabled I I l I Step 2 Enable Intel Virtualization Technology I HADT rtualiz e Di Step3 Install a hypervisor that supports SR IOV Step 4 Depending on your system update the boot grub grub conf file to include a similar com mand line load parameter for the Linux kernel Mellanox Technologies 91 Rev 2 2 1 0 1 Driver Features For example to Intel systems add default 0 timeout 5 splashimage hd0 0 grub splash xpm gz hiddenmenu title Red Hat Enterprise Linux Server 2 6 32 36 x86 645 root hd0 0 kernel vmlinuz 2 6 32 36 x86 64 ro root dev VolGroup00 LogVol00 rhgb quiet intel iommu on initrd initrd 2 6 32 36 x86 64 img 1 Please make sure the parameter intel iommu on exists when updating the boot grub grub conf file o
186. er print results every lt duration gt seconds 9 5 8 ib atomic lat Calculates the latency of RDMA Atomic transaction of message size between a pair of machines One acts as a server and the other as a client The client sends RDMA atomic opera tion and sample the CPU clock when it receives a successful completion in order to calculate latency Mellanox Technologies 239 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Synopsis Server ib atomic lat options Client ib atomic lat options hostname Options The table below lists the various flags of the command Table 53 ib atomic lat Flags and Options Flag Description A atomic_type lt type gt Type of atomic operation from CMP AND SWAP FETCH AND ADD default FETCH AND ADD c connection lt RC XRC Connection type RC XRC DC default RC DC gt C report cycles Report times in cpu cycle units default microseconds d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec h help Show this help screen H report histogram Print out all results default print summary only 1 ib port lt port gt Use port lt port gt of IB device default 1 m mtu lt mtu gt MTU size 256 4096 default
187. er as it may get timeouts on the AR related queries to these switches ae 8 8 2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug in i e it is a shared library libarmgr so that is dynamically loaded by the Subnet Manager Adaptive Routing Manager is installed as a part of Mellanox OFED installation 8 8 3 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing AR Manager can be enabled disabled through SM options file 8 8 3 1 Enabling Adaptive Routing To enable Adaptive Routing perform the following 1 Create the Subnet Manager options file Run opensm c lt options file name gt 2 Add armgr to the event plugin name option in the file Event plugin name s event plugin name armgr 3 Run Subnet Manager with the new options file opensm F lt options file name gt Adaptive Routig Manager can read options file with various configuration parameters to fine tune AR mechanism and AR Manager behavior Default location of the AR Manager options file is etc opensm ar mgr conf To provide an alternative location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin options option in the file Options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F lt options file name gt See
188. ernel v2 6 28 MLXA Driver Support The mlx4 driver supports only a subset of the flow specification the ethtool API defines Ask ing for an unsupported flow specification will result with an invalid value failure The following are the flow specific parameters Table 5 Flow Specific Parameters ether tcp4 udp4 ip4 Mandatory dst src ip dst ip Optional vlan src ip dst ip src src ip dst ip vlan port dst port vlan RFS RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUS that are used by flow s owner applications This domain allows the RFS mechanism to use the flow steering infrastructure to support the RFS logic by implementing the ndo rx flow steer Which in turn calls the underlying flow steering mechanism with the RFS domain Enabling the RFS requires enabling the ntuple flag via the ethtool For example to enable ntuple for eth0 run ethtool K eth0 ntuple on RES requires the kernel to be compiled with the CONFIG RFS ACCEL option This options is available in kernels 2 6 39 and above Furthermore RFS requires Device Managed Flow Steer ing support RFS cannot function if LRO is enabled LRO can be disabled via ethtool de e All of the rest The lowest priority domain serves the following users The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP using L2 flow specifications Mellanox T
189. es The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS Upto 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served Packets carry class of service marking in the range 0 to 15 in their header SL field Each switch can map the incoming packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL The Subnet Administrator controls the parameters of each communication flow by pro viding them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The following subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack 62 Mellanox Technologies rov 2 2 1 0 1 4 4 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chro nology approach to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network is being con
190. es and aliases ClassPortInfo CPI NodeRecord NR lid PortInfoRecord PIR lid port options SL2VLTableRecord SL2VL lid in port out port PKeyTableRecord PKTR lid port block VLArbitrationTableRecord VLAR lid port block InformInfoRecord IIR LinkRecord LR from lid from port to_lid to_port ServiceRecord SR PathRecord PR MCMemberRecord MCMR LFTRecord LFTR lid block MFTRecord MFTR mlid position block GUIDInfoRecord GIR lid block d enables debugging h Shows help 9 4 8 smpdump smpdump is a general purpose SMP utility which gets SM attributes from a specified SMA The result is dumped in hex by default Synopsis smpdump Ele En fasor SS ca nemal ISP ca porel e t imeout timeout ms V ersion h elp dlid dr path lt attr gt mod Mellanox Technologies 201 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Options Table 29 smpdump Flags and Options Flags Description attr IBA attribute ID for SM attribute mod IBA modifier for SM attribute Debugging Flags Description NOTE Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util_name h syntax d Raises the IB debugging level Can be used several times ddd or d d d e Shows send and receive errors timeou
191. es set the follow ing parameter enable e The values are lt TRUE FALSE gt The default is True CC manager configures CC mechanism behavior based on the fabric size The larger the fabric is the more aggressive CC mechanism is in its response to congestion To manu ally modify CC manager behavior by providing it with an arbitrary fabric size set the following parameter num hosts The values are 0 48K The default is o base on the CCT calculation on the current subnet size Thesmaller the number value of the parameter the faster HCAs will respond to the con gestion and will throttle the traffic Note that if the number is too low it will result in suboptimal bandwidth To change the mean number of packets between marking eligi ble packets with a FECN set the following parameter marking rate The values are 0 ox The default is oxa e You can set the minimal packet size that can be marked with FECN Any packet less than this size bytes will not be marked with FECN To do so set the following param eter packet size The values are 0 0x3 c0 The default is ox200 Mellanox Technologies 183 Rev 2 2 1 0 1 OpenSM Subnet Manager When number of errors exceeds max errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to pro ceed To do so set the following parameter max errors error window
192. et the parameter log num mgm entry size to 1 by writing the option m1x4 core log num mgm entry size 1 Step3 Restart the driver To disable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Remove the options mlx4 core log num mgm entry size 1 Step3 Restart the driver 4 13 2 Flow Domains and Priorities Flow steering defines the concept of domain and priority Each domain represents a user agent that can attach a flow The domains are prioritized A higher priority domain will always super sede a lower priority domain when their flow specifications overlap Setting a lower priority value will result in higher priority In addition to the domain there is priority within each of the domains Each domain can have at most 2 12 priorities in accordance with its needs The following are the domains at a descending order of priority User Verbs allows a user application QP to be attached into a specified flow when using ibv exp create flow and ibv exp destroy flow verbs ibv exp create flow struct ibv exp flow ibv exp create flow struct ibv qp qp struct ibv exp flow attr flow Mellanox Technologies 87 Rev 2 2 1 0 1 Driver Features Input parameters e struct ibv_qp the attached QP e struct ibv exp flow attr attaches the QP to the flow specified The flow contains mandatory control parameters and optional L2 L3 and L4 headers The optional headers are detected by settin
193. ether the first port in the direct route must be equal to the one specified in the p option Otherwise an error is reported Mellanox Technologies 193 Rev 2 2 1 0 1 InfiniBand Fabric Utilities When ibdiagpath queries for the performance counters along the path between the source and destination ports it always traverses the LID route even if a directed route Is specified If along the LID route one or more links are not in the ACTIVE state ibdi agpath reports an error Moreover the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source Synopsis ibdiagpath n lt src name dst name gt 1 lt src lid dst lid gt d GOO Inocom Ie cat IV Je gle s lt sys name gt ic lt dev index gt c p lt port num gt 29 souti ely lt a sl als lt 2 5 5 20 gt San jae P lt lt PM counter gt lt Trash Limit gt gt Options n lt src name dst name gt Names of the source and destination ports as defined in the topology file source may be omit ted gt local port is assumed to be the source 1 src lid dst lid Source and destination LIDs source may be omit ted gt the local port is assumed to be the source Directed route from the local node which is the source and the destination node The minimal number of packets to be sent across each link default 100 V Enable ve
194. ex to use for QP 9 5 5 ib write bw ib write bw calculates the BW of ROMA write between a pair of machines One acts as a server and the other as a client The client RDMA writes to the server memory and calculate the BW by sampling the CPU each time it receive a successful completion The test supports features such as Bidirectional in which they both RDMA write to each other at the same time change of mtu size tx size number of iteration message size and more Using the a flag provides results for all message sizes more Using the a provides results for all message sizes Synopsis Server ib write bw options Client ib write bw options hostname Options The table below lists the various flags of the command Table 47 ib write bw Flags and Options Flag Description a all Run sizes from 2 till 2 23 b bidirectional Measure bidirectional bandwidth default unidirectional c connection lt RC XRC UC Connection type RC XRC UC DC default RC DC d ib dev lt dev gt Use IB device dev default first device found D duration Run test for a customized period of seconds f margin Measure results within margins default 2sec F CPU freq Do not fail even if cpufreq_ondemand module is loaded h help Show this help screen 1 ib port lt port gt Use port lt port gt of IB device default 1 I inline_size lt size gt Max size of message to be
195. extension Show the new configuration gt postas C mlx 0 SN 0 1 PortInfo Bort mios DR peda slic 655357 llo 655357 Q port I Lenkoran e E E Initialize ya ae oo goods ao LinkUp Est ewe nS TS CURSO CIN ICE CE 1X or 4X A c ooo mero e c6 68 1X or 4X O MEI S So o og onore 9009 AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 6 5 0 Gbps IBA extension OK PE A ME CE E 5 0 Gbps ibroute Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop 210 Mellanox Technologies m 2 2 1 0 1 Synopsis ibrovte Sa Ed Es AN E ess E Se Su Es lt smluck MEC em mansa P lt ca_port gt I t timeout ms gt lt dest dr path lid guid gt lt star tlid gt lt endlid gt Output Files Table 33 lists the various flags of the command Table 33 ibportstate Flags and Options a Default Flag UR i If Not Description m Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a ll Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several time
196. f it determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs 156 Mellanox Technologies m 2 2 1 0 1 8 5 6 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses short est paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimension Use R dor option to activate the DOR algorithm 8 5 7 Torus 2QoS Routing Algorithm Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics The tor
197. figured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fab ric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the target IP and port number ULPs might also provide QoS Class The CMA then creates Ser vice ID for the U
198. fix to syslog messages from Open M verbose v This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the D option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing The V is equivalent to D OxFF d 2 See the D option for more information about log verbosity DEDE qm Mellanox Technologies 143 Rev 2 2 1 0 1 OpenSM Subnet Manager This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without D OpenSM defaults to ERROR INFO 0x3 Specifying D 0 disables all messages Specifying D OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option debug d lt number gt This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description
199. for Msg Rate Z com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 48 Additional ib_write_bw Flags and Options Flag Description output lt units gt Set verbosity output level bandwidth message_rate latency_typical Mellanox Technologies 235 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 48 Additional ib_write_bw Flags and Options Flag Description pkey_index lt pkey index gt PKey index to use for QP report both Report RX amp TX results separately on Bidirectinal BW tests report_gbits Report Max Average BW of test in Gbit sec instead of MB sec run_infinitely Run test forever print results every lt duration gt seconds 9 5 6 ib write lat ib write lat calculates the latency of ROMA write operation of message size between a pair of machines One acts as a server and the other as a client They perform a ping pong benchmark on which one side RDMA writes to the other side memory only after the other side wrote on his memory Each of the sides samples the CPU clock each time they write to the other side memory in order to calculate latency Synopsis Server ib write lat options Client ib write lat options hostname Options The table below lists the various flags of the command Table 49 ib write lat
200. fy that a torus fabric is routed free of credit loops use ibdmchk to analyze data collected via ibdiagnet vlr 162 Mellanox Technologies rov 2 2 1 0 1 8 5 7 6 Torus 2QoS Configuration File Syntax The file torus 2QoS conf contains configuration information that is specific to the OpenSM rout ing engine torus 2QoS Blank lines and lines where the first non whitespace character is are ignored A token is any contiguous group of non whitespace characters Any tokens on a line fol lowing the recognized configuration tokens described below are ignored torus mesh x radix m M t T y radix m M t T z radix m M t T Either torus or mesh must be the first keyword in the configuration and sets the topology that torus 2QoS will try to construct A 2D topology can be configured by specifying one of x radix radix or z radix as 1 An individual dimension can be configured as mesh open or torus looped by suffixing its radix specification with one of m M t or T Thus mesh 3T 4 5 and torus 3 4M 5M both specify the same topology Note that although torus 2QoS can route mesh fabrics its ability to route around failed compo nents is severely compromised on such fabrics A failed fabric componentis very likely to cause a disjoint ring see UNICAST ROUTING in torus 2QoS 8 xp link sw0 GUID swl_GUID yp link sw0 GUID swl GUID zp link sw0 GUID sw1_ GUID xm link sw0 GUID swl GUID ym link sw0 GUID swl_GUID zm link sw0 GUID
201. g command opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 x MR FORCE CONTIG PAGES 1 For further information on the Contiguous Pages please refer to Section 4 10 Contiguous Pages on page 84 5 1 5 Running ScalableSHMEM Application The ScalableSHMEM framework contains the shmemrun utility which launches the executable from a service node to compute nodes This utility accepts the same command line parameters as mpirun from the OpenMPI package For further information please refer to OpenMPI MCA parameters documentation at http www open mp1 org faq category running Run shmemrun help to obtain ScalableSHMEM job launcher runtime parameters ScalableSHMEM contains support for environment module system http mod ules sf net The modules configuration file can be found at and opt mellanox openshmem 2 2 etc shmem modulefile 5 2 Message Passing Interface 5 2 1 Overview Mellanox OFED for Linux includes the following Message Passing Interface MPI implementa tions over InfiniBand Open MPI 1 4 6 8 1 6 1 an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH2 1 7 an MPI 1 implementation by Ohio State University 124 Mellanox Technologies m 2 2 1 0 1 These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Tab
202. g steps are also appropriate in case you wish to burn newer firmware that you have downloaded from Mellanox Technologies Web site http www mella ad nox com gt Support gt Firmware Download Step 1 Start mst hostl mst start Step 2 Identify your target InfiniBand device for firmware update 1 Get the list of InfiniBand device names on your machine hostl mst status MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre 12C module is not loaded MST devices dev mst mt25418 pciconf0 PCI configuration cycles access bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is A0 dev mst mt25418 pci cr PCI ghuwec ACCESS bus dev fn 02 00 0 bar 0xdef00000 size 0x100000 Chip revision is A0 dev mst mt25418 pci msix0 PCT direct ACCESS bus dev fn 02 00 0 bar 0xdeefe000 size 0x2000 dev mst mt25418 pci uar0 PCI direct ACCESS bus dev fn 02 00 0 bar 0xdc800000 size 0x800000 2 Your InfiniBand device is the one with the postfix pci cr0 In the example listed above this will be dev mst mt25418 pci cro 40 Mellanox Technologies rov 2 2 1 0 1 Step3 Burn firmware Burma firmware image from a mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 gt flint d dev mst mt25418 pci cr0 i fw 25408 2 1 8000 MCX353A FC
203. g the size and num_of_specs fields struct ibv_exp flow_attr can be followed by the optional flow headers structs struct ibv exp flow spec ib struct ibv exp flow spec eth struct ibv exp flow spec ipv4 struct ibv exp flow spec tcp udp For further information please refer to the ibv exp create flow man page Be advised that as of MLNX OFED v2 0 3 0 0 the parameters both the value and the mask should be set in big endian format AH Each header struct holds the relevant network layer parameters for matching To enforce the match the user sets a mask for each parameter The supported masks are All one mask include the parameter value in the attached rule Note Since the VLAN ID in the Ethernet header is 12bit long the following parameter should be used flow spec eth mask vlan tag htons 0x0fff All zero mask ignore the parameter value in the attached rule When setting the flow type to NORMAL the incoming traffic will be steered according to the rule spec ifications ALL DEFAULT and MC DEFAULT rules options are valid only for Ethernet link type since InfiniBand link type packets always include QP number For further information please refer to the relevant man pages ibv exp destroy flow int ibv exp destroy flow struct ibv exp flow flow id Input parameters ibv exp destroy flow requires struct ibv exp flow which is the return value of ibv exp create flow in case of success Output parameters Retur
204. ghting factor per port for customizing the least weight hops for the routing 140 Mellanox Technologies m 2 2 1 0 1 port search ordering file 0 path to file gt This option provides the means to define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR oreover this option provides the means to define non default routing port order dimn ports file 0 path to file gt DEPRECATED This option provides the means to define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR honor guid2lid x This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and is valid By default this is FALSE const multicast This option forces OpenSM to conserver previously built multicast trees log file f lt log file name gt This option defines the log to be the given file By default the log goes to var log opensm log For the log to go to standard output use f stdout log limit L size in MB This option defines maximal log file size in MB When specified the log file will be truncated upon reaching this limit erage los file e This option will cause deletion of the log file if it previously exists By default the log file is accumulative Pconfig P lt partition config file gt This option defines the optional partition configuration fi
205. guration only SL values O and 8 should be used with torus 2QoS Since SL to VL map configuration must be under the complete control of torus 2QoS any con figuration via qos_sl2vl qos swe sl2vl etc must and will be ignored and a warning will be generated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise if traffic 1s not delivered fairly across each of these two VL ranges Torus 2QoS will detect and warn if VL arbi tration is configured unfairly across VLs in the range 0 3 and also in the range 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via qos vlarb high qos vlarb low etc 8 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the cor rect path SL for connection setup Applications that use rdma cm for connection setup will auto matically meet this requirement If a change in fabric topology causes changes in path SL values required to route without credit loops in general all applications would need to repath to avoid message deadlock Since torus 2QoS has the ability to reroute after a single switch failure without changing path SL values repathing
206. h a specific target port GUID end qos ulps Similar to the advanced policy definition matching of PR MPR queries is done in order of appearance in the QoS policy file such as the first match takes precedence except for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched 8 6 6 1 IPolB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the mul ticast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent ipoib MESI ipoib pkey Ox7fff SL any pkey Ox7fff lt SL gt 8 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The follow ing two match rules are equivalent sdp g lt SL gt any service id 0x0000000000010000 0x000000000001ffff SL 8
207. h3 Down 9 4 12 ibstatus Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h lt device name gt lt port gt Output Files Table 31 lists the various flags of the command Table 31 ibstatus Flags and Options 4 Default Flag NS If Not Description ry Specified h Optional Print the help menu lt device gt Optional All devices Print information for the specified device May specify more than one device Mellanox Technologies 205 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 31 ibstatus Flags and Options Optional Detant Flag p If Not Description Mandatory a Specified lt port gt Optional but All ports of Print information for the specified port only requires the specified of the specified device specifyinga device device name Examples 1 List the status of all available InfiniBand devices and their ports ibstatus Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 0007 3896 base lid 0x3 sm lid 0x3 States 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Infiniband device mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0x1 States 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR
208. he Mellanox OFED Release Notes file Required Disk Space 1GB for Installation Device ID For the latest list of device IDs please visit Mellanox website Operating System Linux operating system For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file Installer Privileges The installation requires administrator privileges on the target machine 2 2 Downloading Mellanox OFED Step 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHost entries in the display The following example shows a system with an installed Mellanox HCA lspci v grep Mellanox 06 00 0 Network controller Mellanox Technologies MT27500 Family ConnectX 3 Subsystem Mellanox Technologies Device 0024 Step 2 Download the ISO image to your host The images name has the format MLNX OFED LINUX ver OS label gt lt CPU arch iso You can download it from http www mellanox com gt Products gt Software InfiniBand Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO image Run the following command and compare the result to the value provided on the download page host1 md5sum MLNX OFED LINUX lt ver gt lt 0S label gt iso 28 Mellanox Technologies m 2 2 1 0 1 2 3 Installing Mellanox OFED The installation script minxofedinstal1 performs t
209. he following e Discovers the currently installed kernel Uninstalls any software stacks that are part of the standard operating system distribution or another vendor s commercial stack Installs the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identifies the currently installed InfiniBand and Ethernet network adapters and automat ically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages Pre existing configuration files will be saved with the extension conf rpmsave de If you need to install Mellanox OFED on an entire homogeneous cluster a common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh If your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the minx add kernel support sh script located under the docs directory On Redhat and SLES distributions with errata kernel installed there 1s no need to use the mlnx add kernel support sh script The regular installation can be performed and weak adi updates mechanism will create symbolic links to the MLNX OFED kernel modules 1 The fir
210. he logfile will be truncated and Default 5 restarted upon reaching this limit This option cannot be changed on the fly 8 8 5 1 1 Per switch AR Options A user can provide per switch configuration options with the following syntax 180 Mellanox Technologies m 2 2 1 0 1 SWITCH lt GUID gt switch option 1 gt switch option 2 gt The following are the per switch options Table 17 Adaptive Routing Manager Pre Switch Options File Option File Description Values ENABLE Allows you to enable disable the AR on this Default true lt true false gt switch If the general ENABLE option value is set to false then this per switch option is ignored This option can be changed on the fly AGEING TIME Applicable to bounded AR mode only Specifies Default 30 lt usec gt how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value In the pre switch options file this option refers to the particular switch only This option can be changed on the fly 8 8 5 1 2 Example of Adaptive Routing Manager Options File ENABLE true LOG FILE tmp ar mgr log LOG_SIZE 100 MAX ERRORS 10 ERROR WINDOW Sp SWITCH 0x12345 ENABLE true AGEING TIME 77 SWITCH 0x0002c902004050 8 AGEING TIME 44 SWITCH Oxabcde ENABL
211. hernet bw calculates the BW of SEND between a pair of machines One acts as a server and the other as a client The server receive packets from the client and they both calculate the throughput of the operation The test supports features such as Bidirectional on which they both send and receive at the same time change of mtu size tx size number of iteration message size and more Using the a provides results for all message sizes Server raw ethernet bw options Client raw ethernet bw options dest mac MAC address of interface on server Options The table below lists the various flags of the command Table 55 raw ethernet bw Flags and Options Flag Description a all Run sizes from 2 till 2 23 b bidirectional Measure bidirectional bandwidth default unidirectional c connection lt RC XRC UC UD DC gt Mellanox Technologies 241 Connection type RC XRC UC UD DC default RC Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 55 raw_ethernet_bw Flags and Options Flag Description d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec F CPU freq Do not fail even if cpufreq_ondemand module is loaded g mcg lt num_of_qps gt Send mes
212. hout the interactive menus and prompts It is suitable for scripting 126 Mellanox Technologies m 2 2 1 0 1 5 2 4 Compiling MPI Applications Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user guide html To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps 5 3 MellanoX Messaging MellanoX Messaging MXM provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hard ware This includes a variety of enhancements that take advantage of Mellanox networking hard ware including Multiple transport support including RC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication These enhancements significantly increase the scalability and performance of message commu nications in the network alleviating bottlenecks within the parallel communication libraries The latest MXM software can be downloaded from the Mellanox website MLNX OFED v2 0 or later comes with a pre installed version of MXM v2 x and Ope
213. i gt _packets Total packets successfully transmitted on ring i tx lt i gt bytes Total bytes in successfully transmitted packets on ring i 4 22 Memory Window Memory Window allows the application to have a more flexible control over remote access to its memory It is available only on physical functions native machines The two types of Memory Windows supported are type 1 and type 2B Memory Windows are intended for situations where the application wants to grant and revoke remote access rights to a registered region in a dynamic fashion with less of a performance penalty grant different remote access rights to different remote agents and or grant those rights over different ranges within registered region For further information please refer to the InfiniBand specification document Mellanox Technologies 119 Rev 2 2 1 0 1 Driver Features Memory Windows API cannot co work with peer memory clients PeerDirect seo 4 22 1 Query Capabilities Memory Windows are available if and only the hardware supports it To verify whether Memory Windows are available run ibv exp query device For example truct ibv exp device attr device attr comp mask IBV EXP DEVICE ATTR RESERVED 1 ibv exp query device context amp device attr if device attr exp device cap flags amp IBV EXP DEVICE MEM WINDOW device attr exp device cap flags IBV EXP DEVICE MW TYPE 2B Memory window is supported
214. ication with a specific PKey in the PR MPR query Any ULP application with a specific target IB port GUID in the PR MPR query Since any section of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows qos ulps default 0 default SL end qos ulps It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible key words qos ulps default 0 default SL Sdp port num 30000 0 SL for application running on top of SDP when a destination TCP IPport is 30000 Sdp port num 10000 20000 5 C sdp 1 default SL for any other application running on top of SDP rds SO NDS ABRIO 170 Mellanox Technologies m 2 2 1 0 1 ipoib pkey 0x0001 0 SL for IPoIB on partition with pkey 0x0001 ipoib 4 default IPoIB partition pkey 0x7FFF any service id 0x6234 6 match any PR MPR query with a specific Service ID any pkey 0x0ABC 6 match any PR MPR query with a specific PKey srp target port guid 0x1234 5 SRP when SRP Target is located on a specified IB port GUID any target port guid Ox0ABC OxFFFFF 6 match any PR MPR query wit
215. ice lt dev name gt p port lt port num gt g guid GUID in hex gt vlr lt file gt r routing u fat_tree o output path lt directory gt skip lt stage gt skip plugin library name gt pc P counter lt lt PM gt lt value gt gt pm pause time lt seconds gt ber test ber use data ber thresh lt value gt extended speeds lt dev type gt pm per lane 1s lt 2 5 5 10 14 25 FDR10 gt 1lw 1x 4x 8x 12x w write topo file file name gt t topo file file out ibnl dir lt directory gt screen num errs lt num gt smp window lt num gt gmp window lt num gt max hops lt max hops gt V version h help H deep help Options i device lt dev name gt Specifies the name of the device of the port used to connect to the IB fabric in case of multiple devices on he local system p port lt port num gt Specifies the local device s port number used to connect to the IB fabric g guid GUID in hex Specifies the local port GUID value of the port used to connect to the IB fabric If GUID given is 0 than ibdiagnet displays a list of possible port GUIDs and waits for user input vlr file Specifies opensm path records dump file path src dst to SL mapping generated by SM plugin ibdiagnet will use this mapping for MADs sending and credit loop check if r option Selected r routing
216. ics Note that in the case where there are multiple parallel links between a pair of switches torus 2008 will allocate routes across such links in a round robin fashion based on ports at the path destination switch that are active and not used for inter switch links Should a link that is one of severalsuch parallel links fail routes are redistributed across the remaining links When the last of such a set of parallel links fails traffic is rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise illegal i e not allowed by DOR rules Torus 2QoS will introduce such a turn as close as possible to the failed switch in order to route around it n the above example suppose switch T has failed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns i e those turns that are illegal under DOR This causes the first hop after any such turn to use a separate set of VL values and pre vents deadlock in the presence of a single failed sw
217. ields to the INI if they are missing 3 Set the total_vfs parameter to the desired number if you need to change the num ber of total VFs 4 Reburn the firmware using the mlxburn tool if the fields above were added to the INI or the total vfs parameter was modified If the mlxburn is not installed please downloaded it from the Mellanox website http www mellanox com gt products gt Firmware tools mlxburn fw fw ConnectX3 rel mlx dev dev mst mt4099 pci cr conf MCX341A XCG Ax ini Step 7 Create the text file etc modprobe d mlx4 core conf if it does not exist otherwise delete its contents Step 8 Insert an option line in the etc modprobe d mlx4 core conf file to set the number of VFs the protocol type per port and the allowed number of virtual functions to be used by the physical function driver probe vf For example options mlx4 core num vfs 5 port type array 1 2 probe vf 1 1 If SR IOV is supported to enable SR IOV if it is not enabled it is sufficient to set sriov en true in the INI 2 Ifthe HCA does not support SR IOV please contact Mellanox Support support mellanox com Mellanox Technologies 93 J Rev 2 2 1 0 1 Driver Features Parameter Recommended Value num_vfs e Ifabsent or zero no VFs will be available e If its value is a single number in the range of 0 63 The driver will enable the num_vfs VFs on the HCA and this will be applied to all ConnectX HCAs on the host
218. ies m 2 2 1 0 1 Table 51 ib atomic bw Flags and Options Flag Description R rdma cm Connect QPs with rdma cm and run test on those QPs S sl sl7 SL default 0 t tx depth lt dep gt Size of tx queue default 128 T tos lt tos value gt Set tos value to RDMA CM QPs available only with R flag values 0 256 default off u qp timeout lt timeout gt V version QP timeout timeout value is 4 usec 2 timeout default 14 Display version number w limit_bw x gid index lt index gt Set verifier limit for bandwidth Test uses GID with GID index Default IB no gid ETH 0 y limit_msgrate Set verifier limit for Msg Rate Z com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 52 Additional ib_atomic_bw Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive Set verbosity output level bandwidth message_rate latency_typical output lt units gt pkey_index lt pkey index gt PKey index to use for QP Report RX amp TX results separately on Bidirectinal BW tests report both Report Max Average BW of test in Gbit sec instead of MB sec report_gbits run_infinitely Run test forev
219. ifier limit for bandwidth x gid index lt index gt y limit_msgrate Test uses GID with GID index Default IB no gid ETH 0 Set verifier limit for Msg Rate Z com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 39 Additional ib_read_bw Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive output lt units gt Set verbosity output level bandwidth message_rate latency_typical pkey_index lt pkey index gt PKey index to use for QP report both Report RX amp TX results separately on Bidirectinal BW tests report_gbits Report Max Average BW of test in Gbit sec instead of MB sec run_infinitely Run test forever print results every lt duration gt seconds 9 5 2 ib read lat ib read lat calculates the latency of RDMA read operation of message size between a pair of machines One acts as a server and the other as a client They perform a ping pong benchmark on Mellanox Technologies 227 Rev 2 2 1 0 1 InfiniBand Fabric Utilities which one side RDMA reads the memory of the other side only after the other side have read his memory Each of the sides samples the CPU clock each time they read the other side memory in order to calculate latency Read is available on
220. ify the local system name to an diagnostic tool use one of the following two options 1 On the command line specify the system name using the option s lt local system name gt 2 Define the environment variable IBDIAG SYS NAME 9 2 InfiniBand Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the fol lowing options 1 On the command line specify the port number using the option p local port number gt see below 2 Define the environment variable IBDIAG_PORT_NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option i index of local device gt 2 Define the environment variable IBDIAG_DEV_IDX 186 Mellanox Technologies m 2 2 1 0 1 9 3 Addressing This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies ae The following addressing modes can be used to define the IB ports Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destina tion Using port L
221. igh performance for sparse traffic Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynam ically connect and disconnect from any remote node DCT connections only stay connected while they are active This results in smaller memory footprint less overhead to set connections and higher on chip cache utilization and hence increased performance DCT is supported only in mlx5 and is at beta level Mellanox Technologies 113 Rev 2 2 1 0 1 Driver Features 4 19 PeerDirect PeerDirect uses an API between IB CORE and peer memory clients e g GPU cards to provide access to an HCA to read write peer memory for data buffers As a result it allows RDMA based over InfiniBand RoCE application to use peer device computing power and RDMA intercon nect at the same time without copying the data between the P2P devices For example PeerDirect is being used for GPUDirect RDMA Detailed description for that API exists under MLNX OFED installation please see docs readme and user manual PEER MEMORY API txt 4 20 Inline Receive When Inline Receive is active the HCA may write received data in to the receive WQE or CQE Using Inline Receive saves PCIe read transaction since the HCA does not need to read the scatter list therefore it improves performance in case of short receive messages On poll CQ the driver copies the received data from WQE CQE to the user s buffers Therefore apart from quer
222. ile qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 96 3 96 4 96 qos vlarb low 0 1 qe SL 0511 2p Sy db 5655 Ip Ldn 1397 19 15 TES dH dE US Partition configuration file Default 0x7fff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL full 8 8 Adaptive Routing 8 8 1 Overview Adaptive Routing is at beta stage Adaptive Routing AR enables the switch to select the output port based on the port s load AR supports two routing modes e Free AR No constraints on output port selection e Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric switches It scans all the fabric switches deduces which switches support Adaptive Routing and configures the AR functionality on these switches Currently Adaptive Routing Manager supports only link aggregation algorithm Adaptive Rout ing Manager configures AR mechanism to allow switches to select output port out of all the ports that are linked to the same remote switch This algorithm suits any topology with several links between switches Especially it suits 3D torus mesh where there are several link in each direc tion of the X Y Z axis Mellanox Technologies 177 Rev 2 2 1 0 1 OpenSM Subnet Manager If some switches do not support AR they will slow down the AR Manag
223. ile one to a line io guid file G path to file Set the 1 0 nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line port shifting Attempt to shift port routes around to remove alignment problems in routing tables scatter ports random seed Randomize best port chosen for a route max reverse hops H hop count Set the max number of hops the wrong way around an I O node is allowed to do connectivity for I O nodes on top swithces Biol wile i Yan to le Name of the map file with set of the IDs which will be used by Up Down routing algorithm instead of node GUIDs format lt guid gt lt id gt per line guid routing order file X path to file Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the guids provided in the given file one to a line torus config path to file This option defines the file name for the extra configuration info needed for the torus 2Q0S routing engine The default name is etc opensm torus 2Q0S conf once 0 This option causes OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state sweep s interval This option specifies the number of seconds between subnet sweeps Specifying s 0 disables sweeping Without s OpenSM defaults to a sweep interval of 10 seconds Mellanox Technologies 139 Rev 2 2 1 0 1 OpenSM Subnet Manager timeou
224. ile when Flow Control is disabled The pm qos API contains the following functions pm qos add request pm qos update request pm qos remove request pm qos feature is both global and static once a request is issued it is enforced on all CPUs and does not change in time MLNX OFED v2 2 1 0 0 provides an option to trigger a request when required and to remove it when no longer required It is disabled by default and can be set unset through the ethtool priv flags For further information on how to enable disable this feature please refer to Table 6 ethtool Supported Options on page 112 4 24 XOR RSS Hash Function The device has the ability to use XOR as the RSS distribution function instead of the default Toplitz function The XOR function can be better distributed among driver s receive queues in small number of streams where it distributes each TCP UDP stream to a different queue MLNX OFED v2 2 1 0 0 provides an option to change the working RSS hash function from Toplitz to XOR and vice versa through ethtool priv flags For further information please refer to Table 6 ethtool Supported Options on page 112 Mellanox Technologies 121 Rev 2 2 1 0 1 HPC Features 5 HPC Features 5 1 Shared Memory Access The Shared Memory Access SHMEM routines provide low latency high bandwidth communi cation for use in highly parallel scalable programs The routines in the SHMEM Application Pro gramming Inte
225. ilities Write out the discovered topology into the given file This flag is useful if you later want to check for changes from the current state of the fabric A directory named ibdiag ibnl is also created by this option and holds the IBNL files required to load this topology To use these files you will need to set the environment variable named IBDM IBNL PATH to that directory The directory is located in tmp or in the output directory provided by the o flag name gt gt Load subnet data from the given db file and skip subnet discovery stage Note Some of the checks require actual subnet discovery and therefore would not run when load db is specified These checks are Duplicated zero guids link state SMs status Prints the help page information Prints the version of the tool Prints the tool s environment variables and their values Table 23 ibdiagnet of ibutils Output Files Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet lst List of all the nodes ports and links in the fabric ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet masks ibdiagnet sm In case of duplicate port node Guids these file include the map between masked Guid and real Guids List of all the SM state and pri
226. ill ask to reboot At that point the DHCP server configuration for that client needs to be changed so that when it PXE boots again it will get the root path IQN and LUN information from the DHCP server For further information please refer to section DHCP Configuration for iSCSI Boot with Flex Boot PXE SAN Boot Restart your DHCP service after changing the dhcp configuration file ade e Reboot the system des The expected result is that for the diskless PXE client to boot the newly installed RHEL6 4 from the 1SCSI storage and become an operational environment accessible from any remote PC via ssh over 10GbE IP network Sanity Checks SCSI Login From a remote PC called sqa070 below with 10GE network connection to the 1SCSI target configure an iSCSI Initiator service and verify the correct target configuration by logging into the target For CHAP configuration refer to Step 1 Install the initiator root sqa070 yum install y iscsi initiator utils 270 Mellanox Technologies m 2 2 1 0 1 Step 2 Configure the Initiator root sqa070 J vim etc iscsi iscsid conf node startup automatic Optional for CHAP authentication uncomment the following lines discovery sendtargets auth authmethod CHAP discovery sendtargets auth username joe discovery sendtargets auth password secret node session auth authmethod CHAP node session auth username jack node session auth password 12char
227. ill continues to function in accordance to Aa the previous CC configuration For further information on how to turn OFF CC please refer to Section 8 9 3 Configuring Con gestion Control Manager on page 182 Configuring Congestion Control Manager Congestion Control CC Manager comes with a predefined set of setting However you can fine tune the CC mechanism and CC Manager behavior by modifying some of the options To do so perform the following 1 Findthe event plugin options option in the SM options file and add the following conf file lt cc mgr options file name gt Options string that would be passed to the plugin s event plugin options ccmgr conf file lt cc mgr options file name gt 2 Run the SM with the new options file opensm F lt options file name gt 182 Mellanox Technologies m 2 2 1 0 1 To turn CC OFF set enable to FALSE in the Congestion Control Manager configura tion file and run OpenSM ones with this configuration For the full list of CC Manager options with all the default values See Configuring Congestion Control Manager on page 182 For further details on the list of CC Manager options please refer to the IB spec 8 9 4 Configuring Congestion Control Manager Main Settings To fine tune CC mechanism and CC Manager behavior and set the CC manager main settings perform the following e To enables disables Congestion Control mechanism on the fabric nod
228. implementing the MPI collectives communications Mellanox OFED Package ISO Image Mellanox OFED for Linux MLNX_OFED_LINUX is provided as ISO images or as a tarball one per supported Linux distribution and CPU architecture that includes source code and binary RPMs firmware utilities and documentation The ISO image contains an installation script called mlnxofedinstal1 that performs the necessary steps to accomplish the following Discover the currently installed kernel Uninstall any InfiniBand stacks that are part of the standard operating system distribu tion or another vendor s commercial stack Install the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identify the currently installed InfiniBand HCAs and perform the required firmware updates Software Components MLNX OFED LINUX contains the following software components Mellanox Host Channel Adapter Drivers mlx5 mlx4 VPI which is split into multiple modules e mlx4 core low level helper mlx4 ib IB mlx5 ib e mlx5 core 20 Mellanox Technologies m 2 2 1 0 1 mlx4 en Ethernet Mid layer core Verbs MADs SA CM CMA uVerbs uMADs Upper Layer Protocols ULPs PoIB RDS SRP Initiator and SRP NOTE RDS was not tested by Mellanox Technologies MPI Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces M
229. in Mellanox OFED for Linux This package however does not include an SRP Target 4 4 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D available from http www t10 org The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp t10 drafts spc3 spc3r21b pdf Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf Basic functionality task management and limited error handling 4 1 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or change the value of SRP LOAD in etc infiniband openib conf to yes For the changes to take effect run etc init d openibd restart die When loading the ib srp module it is possible to set the module parameter srp sg tablesize This is the maximum number of gather scatter entries per I O adi default 12 Mellanox Technologies 45 Rev 2 2 1 0 1 Driver Features 4 1 2 1 1 SRP Module Parameters When loading the SRP module the following parameters can be set viewable by the modinfo ib srp command cmd sg entries allow ext sg topspin workarounds reconnect delay fast io fail tmo dev loss tmo Default number of gather scatter entries in the SRP command default is 12 max 255
230. ing YUM 0 0 esse reser rer 4 2 5 1 Setting up MLNX OFED YUM Repository 00 c eee eee eee 41 2 5 2 Installing MLNX OFED using the YUM Tool suse 42 2 5 3 Updating Firmware After Installation 0 0 0 0 esee 43 2 6 Uninstalling Mellanox OFED sssssseeeeeee Ih 43 2 7 Uninstalling Mellanox OFED using the YUM Tool oooocccocoo 43 Chapter 3 Configuration Files 0 ccc ccc ec eee cee cece eee eee ee 44 3 1 Persistent Naming for Network Interfaces 0 0 c eee eee eee 44 Mellanox Technologies 3 Rev 2 2 1 0 1 Chapter 4 Driver Features e Ohr ee em ron nma S 4 1 SCSIRDMA Protocol 0 0 0 rr rer rer tree sea 45 4 1 1 OVerVIeW seek ve kr al nid ue Paes e ue d biete e eA 45 41 2 SRP Initiator e eae o YR hr albe SR 45 4 2 iSCSI Extensions for ROMA ISER sseseseeeee eh 54 42 l OVervieWw cx cone DA DeAu EDS e eR TER DURER 54 42 2 ISER Initiatot eco Rh ra see 54 4 3 TIP over InfiniBand uere ste te RO teg pod v v ee es 55 4 3 1 Introducir uU ER 55 4 3 2 IPoIB Mode Setting 55 4 3 3 IPoIB Configuration ssec p tng oea a E Bare KSR VAR ERS E sr ND 56 4 34 Subinternac curia pd A deed os 59 4 3 5 Verifying IPoIB Functionality 0 0 0 ccc eee 60 4 3 6 Bonding POB oie LEER oR ae AA a eee es de 61 4 4 Quality of Service InfiniBand 0 0 ec eee teens 62 4 4 1 Quality of Service OvervieW 02 0 0 cece tte
231. ing programming para digm Mellanox ScalableUPC is based on Berkely UPC package see http upc Ibl gov and contains the following enhancements e GasNet library used within UPC integrated with Mellanox FCA which off loads from UPC collective operations For further information on FCA please refer to the Mellanox website e GasNet library contains MXM conduit which offloads from UPC all P2P operations as well as some synchronization routines For further information on MXM please refer to the Mellanox website Mellanox OFED 1 8 includes ScalableUPC 2 1 which is installed under opt mellanox bupc po If you have installed OFED 1 8 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website 130 Mellanox Technologies m 2 2 1 0 1 5 5 1 Installing ScalableUPC Mellanox ScalableUPC is installed as part of MLNX OFED package w Mellanox OFED 1 8 5 includes ScalableUPC Rev 2 2 which is installed under opt mellanox bupc Fr If you have installed OFED 1 8 5 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Please note the binary distribution of ScalableUPC is compiled with the following defaults e FCA support FCA is disabled at runtime by default and must be configured prior to using it from the Scala
232. ing srp daemon sh which sends its log to var log srp daemon log Now it is possible to access the SRP LUNs on dev mapper E It is possible for regular non SRP LUNS to also be present the SRP LUNs may be identified by their names You can configure the etc multipath conf file to change adi multipath behavior occur if the SRP LUNs are in the black list of multipath Edit the blacklist section in gt It is also possible that the SRP LUNs will not appear under dev mapper This can etc multipath conf and make sure the SRP LUNs are not black listed Automatic Activation of High Availability Set the value of SRP DAEMON ENABLE in etc infiniband openib conf to yes For the changes in openib conf to take effect run etc init d openibd restart Start srpd service run service srpd start From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper Itis possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name P Itis possible to see the output of the SRP daemon in var log srp daemon log 4 1 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop or as a by product of a complete system shutdown Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three c
233. iniband VFs in SR IOV setting can have more than a single GUID to be used for their purposes Totally there are 128 GUIDs per port where the PF occupies 2 entries and the remaining GUIDs are divided equally between all the VFs If there are any remainders those GUIDs are given to the VFs with the lowest IDs For example in case that are 5 VFs the division will be as follows PF 2 slave 1 will have 26 and last 4 slaves will have 25 2 1 26 4 25 128 To find the mapping between VF s GUID entry to its physical one use the sysfs mechanism gid idx directory 102 Mellanox Technologies m 2 2 1 0 1 The example below shows the mapping between entry 0 of Pc1 FUNCTION 41 to its physical one on port number 1 cat sys class infiniband mlx4 0 iov PCI FUNCTION 1 gt ports 1 gid_idx 0 Initial GUIDs values depend on the 1x4 ib module parameter sm guid assign as follows Mode Type Description sm assigned Asks SM for values for GUID entry 0 per VF Other entries will have value 0 in the port GUID table and ffffffffffffffff under their matching admin guids entry This is the default mode value 1 For example as GUID entry is not the base gid index of the HYP upon startup it will have the below values under the sysfs entries cat sys class infiniband mlx4 0 iov ports 1 admin guids 1 ffffffffffffffff cat sys class infiniband mlx4 0 iov ports 1 gids 1 e80 0000 0000 0000 0000 0000 000
234. interfaces were created cat sys class net eth ipoib interfaces For example on a system with dual port HCA the following two interfaces might be created eth4 and eth5 cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl These interfaces can be used to configure the network for the guest For example if the guest has a VIF that is connected to the Virtual Bridge bro then enslave the eIPoIB interface to bro by running brctl addif br0 ethX In RHEL KVM environment there are other methods to create configure your virtual net work e g macvtap For additional information please refer to the Red Hat User Manual The IPoIB daemon ipoibd detects the new virtual interface that is attached to the same bridge as the eIPoIB interface and creates a new IPoIB instances for it in order to send receive data As a result number of IPoIB interfaces ibX Y are shown as being created destroyed and are being enslaved to the corresponding ethx interface to serve any active VIF in the system according to the set configuration This process is done automatically by the ipoibd service To see the list of IPoIB interfaces enslaved under eth ipoib interface cat sys class net ethX eth vifs For example cat sys class net eth5 eth vifs SLAVE ib0 1 AC 9a c2 1f d7 3b 63 VLAN N A SLAVE ib0 2 AC 52 54 00 60 55 88 VLAN N A SLAVE ib0 3 AC 52 54 00 60 55 89 VLAN N A Each ethX interface has at l
235. ion LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock Mellanox Technologies 155 Rev 2 2 1 0 1 OpenSM Subnet Manager from HCA between and switch does not need virtual layers as deadlock will not arise gt LASH analyzes routes and ensures deadlock freedom between switch pairs The link between switch and HCA ae In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by main taining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 3 Once this stage has been completed it is highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in
236. ions Section E 1 Configuring the iSCSI Target Machine on page 256 and Section E 6 1 iSCSI Login on page 270 must be un commented beforehand and during RHEL installation follow these steps iSCSI Discovery Details To use iSCSI disks you must provide the address of your SCSI target and the SCSI initiator name you ve configured for your host Target IP Address 12 7 6 30 iSCSI Initiator Name What kind of iSCSI discovery authentication do you wish to perform CHAP pair gt CHAP Usemame joe CHAP Password Cancel start Discovery iSCSI Discovered Nodes Check the nodes you wish to log into 2 Node Name Interface IV iqn 2013 10 qalab com sqa030 prt9 p2pl Cancel Login iSCSI Nodes Login What kind of iSCSI login authentication do you wish to perform CHAP pair gt CHAP Username joe CHAP Password eeeeeel J iSCSI Login Results Successfully logged in and attached the following nodes iqn 2013 10 qalab com sqa030 prt9 via p2p1 Mellanox Technologies 275
237. is Server ib send lat options Client ib send lat options hostname Options The table below lists the various flags of the command Table 45 ib send lat Flags and Options Flag Description a all Run sizes from 2 till 2 23 c connection lt RC XRC UC Connection type RC XRC UC UD DC default RC UD DC gt C report cycles Report times in cpu cycle units default microseconds d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec 2232 Mellanox Technologies m 2 2 1 0 1 Table 45 ib send lat Flags and Options Flag Description g mcg num of qps gt Send messages to multicast group with num of qps gt qps attached to it h help Show this help screen H report histogram Print out all results default print summary only 1 ib port lt port gt Use port lt port gt of IB device default 1 L inline_size lt size gt Max size of message to be sent in inline m mtu lt mtu gt MTU size 256 4096 default port mtu M MGID lt multicast_gid gt In multicast uses lt multicast_gid gt as the group MGID h iters lt iters gt Number of exchanges at least 5 default 1000 p
238. itch For any given path only the hops after a turn that is illegal under DOR can contribute to a credit loop that leads to deadlock So in the example above with failed switch T the location of the illegal turn at I in the path from S to D requires that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I i e hop r D cannot contribute to a credit loop 158 Mellanox Technologies l m 2 2 1 0 1 because they cannot be used to construct a loop encircling T The hop I r uses a separate VL so it cannot contribute to a credit loop encircling T Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2008 can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus 5 R c 4 I I I I I I 4 _ _ _HA AA I I I I I I 3 mejl Uj I I I I I I 2 a R I I I I I I 1 e I I I I I I y 0 I I I I I I x 0 i 2 3 4 5 Suppose switches T and R have failed and consider the path from S to D Torus 2QoS will gen erate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that to
239. ith FCA The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI or ScalableSHMEM process onto Mella nox InfiniBand managed switch CPUs As a system wide solution FCA utilizes intelligence on Mellanox InfiniBand switches Unified Fabric Manager and MPI nodes without requiring addi tional hardware The FCA manager creates a topology based collective tree and orchestrates an efficient collective operation using the switch based CPUs on the MPI ScalableSHMEM nodes FCA accelerates MPI ScalableSHMEM collective operation performance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime FCA is disabled by default and must be configured prior to using it from the Scal ableSHMEM gt To enable FCA by default in the ScalableSHMEM 1 Edit the opt mellanox openshmem 2 2 etc openmpi mca params conf file 2 Set the scoll fca enable parameter to 1 scoll fca enable 1 3 Set the scoll_fca np parameter to 0 scoll fca np 0 gt To enable FCA in the shmenrun command line add the following mca scoll fca enable 1 mca scoll fca enable np 0 To disable FCA mca scoll fca enable 0 mca coll fca enable 0 For more details on FCA installation and configuration please refer to the FCA User Manual found in the Mellanox website 5 1 3 Running ScalableSHMEM with M
240. ity TC 0 is lowest It also has an absolute priority over non strict TCs ETS This property needs to be used with care as it may easily cause starvation of other TCs A higher strict priority TC is always given the first chance to transmit Only if the highest strict priority TC has nothing more to transmit will the next highest TC be considered Non strict priority TCs will be considered last to transmit This property is extremely useful for low latency low bandwidth traffic Traffic that needs to get immediate service when it exists but is not of high volume to starve other transmitters in the sys tem Minimal Bandwidth Guarantee ETS After servicing the strict priority TCs the amount of bandwidth BW left on the wire may be split among other TCs according to a minimal guarantee policy If for instance TCO is set to 80 guarantee and TC1 to 20 the TCs sum must be 100 then the BW left after servicing all strict priority TCs will be split according to this ratio Since this is a minimal guarantee there is no maximum enforcement This means in the same example that if TC1 did not use its share of 20 the reminder will be used by TCO Rate Limit Rate limit defines a maximum bandwidth allowed for a TC Please note that 1096 deviation from the requested values is considered acceptable Quality of Service Tools mlnx qos mlnx qos is a centralized tool used to configure QoS features of the local host It communicate
241. k reporting the device s clock is supported attr hca core clock is the frequency in MHZ a 4 6 2 2 Creating Time Stamping Completion Queue To get time stamps a suitable extended Completion Queue CQ must be created via a special call to ibv exp create cq verb cq init attr flags IBV EXP CQ TIMESTAMP cq init attr comp mask IBV EXP CQ INIT ATTR FLAGS cq ibv exp create cq context cge node NULL 0 teg init attr struct ibv exp wc are invalid Only the fields indicated by the exp wc flags field This CQ cannot report SL or SLID information The value of s1 and s1 id fields in ad instruct ibv exp wc contains a valid and usable value When using Time Stamping several fields of struct ibv exp wc are not available resulting in RoCE UD RoCE traffic with VLANS failure 4 6 2 3 Polling a Completion Queue Polling a CQ for time stamp is done via the ibv exp poll cq verb ret ibv exp poll cq cq 1 amp wc ex sizeof wc ex if ret 0 CQ returned a wc dE wc ex exp wc flags amp IBV EXP WC WITH TIMESTAMP This wc contains a timestamp timestamp wc ex timestamp Timestamp is given in raw hardware time Mellanox Technologies 77 Rev 2 2 1 0 1 Driver Features CQs that are opened with the ibv exp create cq verbs should be always be polled with the ibv exp poll cq verb P 4 6 2 4 Querying the Hardware Time 4 7 4 7 1 4 7 2 Querying
242. kage Make sure iSCSI is enabled and properly configured on your system before proceeding with iSER Additionally make sure you have RDMA connectivity between the initiator and the target rping s vVd S size C count a addr p port Targets settings such as timeouts and retries are set the same as any other iSCSI targets If targets are set to auto connect on boot and targets are unreachable it may take a long time to continue the boot process if timeouts and max retries are set too high ba Example for discovering and connecting targets over SER iscsiadm m discovery o new o old t st I iser p lt ip port gt 1 SER also supports RoCE without any additional configuration required To bond the RoCE interfaces set the fail over mac option in the bonding driver see Section 4 3 6 Bonding IPoIB on page 61 54 Mellanox Technologies rov 2 2 1 0 1 4 3 IP over InfiniBand 4 3 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service The IPoIB driver ib ipoib exploits the following capabilities VLAN simulation over an InfiniBand network via child interfaces High Availability via Bonding e Varies MTU values up to 4k in Datagram mode up to 64k in Connected mode Uses any ConnectX IB ports one or two Inserts IP UDP TCP checksum on outgoing packets
243. l sl number Sets the SL to use to communicate with the SM SA Defaults to 0 connect roots z This option enforces routing engines up down and fat tree to make connectivity between root switches and in this way be IBA compliant In many cases this can violate pure deadlock free algorithm so use it carefully Cast cache A This option enables unicast routing cache to prevent routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g in case of host reboot This option becomes very handy when the cluster size is thousands of nodes lid matrix file M file name This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded LTS ille U siile saute This option specifies the name of the LFTs file from where switch forwarding tables will be loaded sadb file S file name gt This option specifies the name of the SA DB dump file from where SA database will be loaded root_guid file a lt path to file gt Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one 138 Mellanox Technologies m 2 2 1 0 1 to a line cn guid file u path to file Set the compute nodes for the Fat Tree routing algorithm to the guids provided in the given f
244. lanox com gt From repos MLNX OFED XMLNX OFED file gt RPM GPG KEY Mellanox Is this ok y N Step 5 Check that the key was successfully imported rpm q gpg pubkey qf NAME VERSION RELEASE t SUMMARY n grep Mellanox gpg pubkey a9e4b643 520791ba gpg Mellanox Technologies lt support mellanox com gt Step 6 Install the createrepo rpm package yum install createrepo Mellanox Technologies 41 Rev 2 2 1 0 1 Installation Step 7 MLNX OFED YUM repository using the minx create yum repo sh script located in the downloaded MLNX_OFED package mlnx create yum repo sh mlnx ofed mnt target repos Creating MLNX OFED LINUX YUM Repository under repos See log file tmp mlnx yum 24250 10g comps file was not provided going to build it Copying RPMS Building YUM Repository Creating YUM Repository settings file at tmp mlnx ofed repo Done Copy tmp mlnx ofed repo to etc yum repos d to use MLNX OFED YUM Repository Step 8 Copy the created YUM repository configuration file to etc yum repos d cp tmp mlnx ofed repo etc yum repos d Step 9 Check that the repository was successfully added yum repolist Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management You can use subscrip tion manager to register repo id repo name status mlnx ofed MLNX OFED Repository 108 rpmforge RHEL 6Server RP
245. le The default name is etc opensm partitions conf no part enforce N DEPRECATED This option disables partition enforcement on switch external ports Mellanox Technologies 141 Rev 2 2 1 0 1 OpenSM Subnet Manager j susr emiomos 4 gado skid QUE OM This option indicates the partition enforcement type for switches Enforcement type can be outbound only out inbound only in both or disabled off Default is both allow both pkeys W This option indicates whether both full and limited membership on the same partition can be configured in the PKeyTable Default is not to allow both pkeys q965 Q This option enables QoS setup 0105 gelbe inde 1 lt 09 eya Les This option defines the optional QoS policy file The default name is etc opensm qos policy conf congestion control EXPERIMENTAL This option enables congestion control configuration SES EXPERIMENTAL This option configures the CCkey to use when configuring congestion control gt stay on fatal y This option will cause SM not to exit on fatal initialization issues if SM discovers duplicated guids or 12x link with lane reversal badly configured By default the SM will exit on these errors daemon B Run in daemon mode OpenSM will run in the background macia cl Start SM in inactive rather than normal init SM state perfmgr Start with PerfMgr enabled perfmgr sweep time s lt sec gt PerfMgr sw
246. le 14 lists some useful MPI links Table 14 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mpi org MVAPICH 2 MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections Section 5 2 2 Prerequisites for Running MPI on page 125 e Section 5 2 3 MPI Selector Which MPI Runs on page 126 Section 5 2 4 Compiling MPI Applications on page 127 5 2 2 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login i e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote computers and or servers 5 2 2 1 SSH Configuration The following steps describe how to configure password less access over SSH Step 1 Generate an ssh key on the initiator machine host1 host1 ssh keygen t rsa Generating public private rsa key pair Enter file in which to save the key home lt username gt ssh id rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home username ssh id rsa Your public key has been saved in home username ssh id rsa pub The key fingerprint is 38 1b 29 df 4 08 00 4a 0e 50 0 05 44 e7 9 05
247. le is a bonding master of a slave in a DOWN state In that case a matching GID to the IP address of the master will not be present in the GID table of the slave s port The first entry in the GID table at index 0 for each port 1s always present and equal to the link local IPv6 address of the net device that is associated with the port Note that even if the link local IPv6 address is not set index O is still populated 26 Mellanox Technologies m 2 2 1 0 1 GID format can be of 2 types IPv4 and IPv6 IPv4 GID is a IPv4 mapped IPv6 address while IPv6 GID is the IPv6 address itself 1 For the IPv4 address A B C D the corresponding IPv4 mapped IPv6 address is ffff A B C D Mellanox Technologies 27 Rev 2 2 1 0 1 Installation 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed 2 1 Hardware and Software Requirements Table 1 Software and Hardware Requirements Requirements Description Platforms A server platform with an adapter card based on one of the following Mellanox Technologies InfiniBand HCA devices e MT27508 ConnectX 3 VPI IB EN firmware fw ConnectX3 MT4103 ConnectX 3 Pro VPI IB EN firmware fw ConnectX3Pro e MT4113 Connect IB IB firmware fw Connect IB For the list of supported architecture platforms please refer to t
248. le only with R flag values 0 256 default off 228 Mellanox Technologies m 2 2 1 0 1 Table 40 ib read lat Flags and Options Flag Description u qp timeout lt timeout gt QP timeout timeout value is 4 usec 2 timeout default 14 U report unsorted implies H print out unsorted results default sorted V version Display version number x gid index lt index gt Test uses GID with GID index Default IB no gid ETH 0 Z com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 41 Additional ib_read_lat Flags and Options Flag Description inline_recv lt size gt Max size of message to be sent in inline receive output lt units gt Set verbosity output level bandwidth message_rate latency_typical pkey_index lt pkey index gt PKey index to use for QP 9 5 3 ib send bw ib send bw calculates the BW of SEND between a pair of machines One acts as a server and the other as a client The server receive packets from the client and they both calculate the through put of the operation The test supports features such as Bidirectional on which they both send and receive at the same time change of mtu size tx size number of iteration message size and more Using the a provides results for all message sizes Synopsis
249. lf test ofed utility to verify whether or not the InfiniBand link is up The utility also checks for and displays additional information such as HCA firmware version Mellanox Technologies 37 Rev 2 2 1 0 1 Installation e Kernel architecture Driver version Number of active HCA ports along with their states Node GUID Note For more details on hca self test ofed seethefilehca self test readm under docs nca seli test Orel Performing Adapter Device Self Test umber b CAS Deuecued coooooococoocoouo il AN SNA CNET TEES PASS kermel AFGI oooooooocoocoononcococconocs x86 64 Host Diiwew Version ooooscoocoocococcoco LNX OFED LINUX 2 2 1 0 0 OFED 2 2 1 0 0 3 0 76 0 11 default dost Dielyal BUM Coeck onoconconsocsonanvs PASS ies i CA eA WRI e EDU v2 31 5000 Hirmware Check on CAN 40 WI irreverent PASS HOS Dieiver Imitaci n cc osoaoconouo PASS dumb eis CA Moldes MCLE oocnnconcooooo 1 Port State of Port 1 on CA 40 VPI UP 4X FDR InfiniBand Port State of Port 2 on CA 0 VPI UP 4X FDR InfiniBand Error Counter Check on CA 0 VPI PASS Kerne Moys log CINSGI ouosuasuacuncnbootu5 PASS opa UD om CA 50 WD cocoovsnocooono 00 02 c9 03 00 30 0e 60 DONE prefix kernel version and installation parameters can be retrieved by running the com After the installer completes information about the Mellanox OFED installation such as Ad m
250. limit for TCs in Gbps LIST is a comma seperated Gbps limit for each TC Example 1 8 8 will limit TCO to 1Gbps and TC1 TC2 to 8 Gbps each i INTF interface INTF Interface name a Show all interface s TCs Mellanox Technologies 69 J Rev 2 2 1 0 1 Driver Features Get Current Configuration 70 Mellanox Technologies m 2 2 1 0 1 Set ratelimit 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2 tc 0 ratelimit 3 Gbps tsa strict up 0 Skprio 0 Skprio 1 Skprio 2 tos 8 Skprio 3 Skprio 4 tos 24 Skprio 5 Skprio 6 tos 16 Skprio 7 Skprio 8 Skprio 9 Skprio 10 Skprio 11 Skprio 12 Skprio 13 Skprio 14 Skprio 15 mos 1 ups 2 woe 3 up 4 up 5 up 6 mos 7 Configure QoS map UP 0 7 to tc0 1 2 3 to tc1 and 4 5 6 to tc 2 set tc0 tc1 as ets and tc2 as strict divide ets 30 for tc0 and 70 for tc1 nii Gos gt erns s ets ets strict p 0 1 der AA 230770 tc 0 ratelimit 3 Gbps tsa ets bw 30 uos Skprio Skprio Skprio Skprio Skprio Skprio Skprio tos 8 tos 24 toss 16 skprio skprio AO OD I LO E PERA A s skprio pa um Skprio Skprio Skprio fase ENT Paes od skprio skprio Ee on mos 7 tc 1 ratelimit 4 Gbps tsa ets bw 70 Mellanox Technologies 71 Rev 2 2 1 0 1 Driver Features weg d weg 2 tog 3 uds 2 mares 2 Coes sas siria up 4 mos y up 6 4 5 8 2 tcand
251. ll physical ports on the network as follows pkey idx pkey value SSS SSeS OX BIB 1 0xB000 2 0xB030 the most significant bit indicates if a PKey is a full PKey The ipoib causes OpenSM to pre create IPoIB the broadcast group for the indicated PKeys ae Step2 Configure on Dom0 the virtual to physical PKey mappings for the VMs Step a Check the PCI ID for the Physical Function and the Virtual Functions lspci grep Mel 104 Mellanox Technologies m 2 2 1 0 1 Stepb Assuming that on Hostl the physical function displayed by Ispci is 0000 02 00 0 and that on Host2 it is 0000 03 00 0 On Hostl do the following cd sys class infiniband mlx4 0 iov 0000 02 00 0 0000 02 00 1 0000 02 00 2 1 0000 02 00 0 contains the virtual to physical mapping tables for the physical func tion 0000 02 00 X contain the virt to phys mapping tables for the virtual functions Do not touch the Dom0 mapping table under lt nnnn gt lt nn gt 00 0 Modify only tables under 0000 02 00 1 and or 0000 02 00 2 We assume that vm1 uses VF 0000 02 00 1 and vm2 uses VF 0000 02 00 2 Stepc Configure the virtual to physical PKey mapping for the VMs echo 0 gt 0000 02 00 1 ports 1 pkey idx 1 echo 1 gt 0000 02 00 1 ports 1 pkey idx 0 echo 0 gt 0000 02 00 2 ports 1 pkey idx 1 echo 2 gt 0000 02 00 2 ports 1 pkey idx 0 vml pkey index 0 will be mapped to physical pkey index 1 and vm2 pkey index 0 will be
252. llanox Technologies m 2 2 1 0 1 4 4 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 8 can be split into two main parts l Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the discovered fabric elements Il PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined match rules such that the target QoS Level defini tion is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level 4 5 Quality of Service Ethernet 4 5 1 Quality of Service Overview Quality of Service QoS is a mechanism of assigning a priority to a network flow socket rdma cm connection and manage its guarantees limitations and its priority over other flows This is accomplished by mapping the user s priority to a hardware TC traffic class through a 2 3 stages process The TC is assigned with the QoS attributes and the different flows behave accordingly 4 5 2 Mapping Traffic to Traffic Classes Mapping traffic to TCs consists of several actions which are user controllable some controlled by the application itself and others by the system network administrators The following is the general mapping traffic to Traffic Classes flow 1 The application sets the required Type of
253. llowing kernel modules mlx5 core Acts as a library of common functions e g initializing the device after reset required by the Connect IB M adapter card mlx5 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer libmIx5 libmlx5 is the provider library that implements hardware specific user space functionality If there is no compatibility between the firmware and the driver the driver will not load and a mes sage will be printed in the dmesg The following are the Libmlx5 environment variables e MLX5 FREEZE ON ERROR CQE e Causes the process to hang in a loop when completion with error which is not flushed with error or retry exceeded occurs Otherwise disabled e MLX5 POST SEND PREFER BF Configures every work request that can use blue flame will use blue flame Otherwise blue flame depends on the size of the message and inline indication in the packet MLX5 SHUT UP BF Disables blue flame feature Otherwise do not disable e MLX5 SINGLE THREADED All spinlocks are disabled Otherwise spinlocks enabled Used by applications that are single threaded and would like to save the overhead of taking spinlocks MLX5 CQE SIZE 64 completion queue entry size is 64 bytes default 128 completion queue entry size is 128 bytes Mellanox Technologies 23 J Rev 2 2 1 0 1 Mellanox OFED Overview MLX5 SCATTER TO CQE Small buffers are scattered to the comple
254. ls ULPs interface with the hardware and with the kernel and user space The application level also shows the versatility of markets that Mellanox OFED applies to Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards The following sub sections briefly describe the various components of the Mellanox OFED stack 1 3 1 mlx4 VPI Driver m1x4 is the low level driver implementation for the ConnectX family adapters designed by Mel lanox Technologies ConnectX family adapters can operate as an InfiniBand adapter or as an Ethernet NIC The OFED driver supports InfiniBand and Ethernet NIC configurations To accommodate the supported configurations the driver is split into the following modules mlx4 core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mlx4 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer 22 Mellanox Technologies m 2 2 1 0 1 mlx4 en A 10 40GigE driver under drivers net ethernet mellanox mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer 1 3 2 mlx5 Driver m1x5 is the low level driver implementation for the Connect IB adapters designed by Mella nox Technologies Connect IB operates as an InfiniBand adapter The mlx5 driver is com prised of the fo
255. ly in RC connection mode as specified in IB spec Synopsis Server ib read lat options Client ib read lat options hostname Options The table below lists the various flags of the command Table 40 ib_read_lat Flags and Options Flag Description a all Run sizes from 2 till 2923 c connection lt RC XRC DC gt Connection type RC XRC DC default RC C report cycles Report times in cpu cycle units default microseconds d ib dev lt dev gt Use IB device lt dev gt default first device found D duration Run test for a customized period of seconds e events Sleep on CQ events default poll f margin Measure results within margins default 2sec h help Show this help screen H report histogram Print out all results default print summary only 1 ib port lt port gt Use port lt port gt of IB device default 1 m mtu lt mtu gt MTU size 256 4096 default port mtu n iters lt iters gt Number of exchanges at least 5 default 1000 0 outs lt num gt Num of outstanding read atom default max of device p port lt port gt Listen on connect to port lt port gt default 18515 R rdma_cm Connect QPs with rdma cm and run test on those QPs S S1Ze slze Size of message to exchange default 2 S sl lt sl gt SL default 0 T tos lt tos value Set tos value to RDMA CM QPs availib
256. me for the Guid specified C Gets the SA s class port info S Returns the PortInfoRecords with isSM or isSMdisabled capability mask bit on g Gets multicast group info m Gets multicast member info If a group is specified limit the output to the group specified and print one line containing only the GUID and node description for each entry Example saquery m 0xc000 X Gets LinkRecord info src to dst Gets a PathRecord for lt src dst gt where src and dst are either node names or LIDs sgid to dgid Gets a PathRecord for sgid to dgid where both GIDs are in an IPv6 format acceptable to inet_pton 3 C lt ca_name gt Uses the specified ca_name P lt ca_port gt Uses the specified ca_port smkey lt val gt Uses SM_Key value for the query Will be used only with trusted queries If non numeric value like x is specified then saquery will prompt for a value t timeout lt msec gt Specifies SA query response timeout in milliseconds Default is 100 milliseconds You may want to use this option if IB_TIMEOUT is indicated 200 Mellanox Technologies m 2 2 1 0 1 Table 28 saquery Flags and Options Flags Description node name map lt node name map gt Specifies a node name map The node name map file maps GUIDs to more user friendly names See ibnetdiscover 8 for node name map file format Only used with the O and U options Supported query nam
257. mlnx ofed install 4 15 Quantized Congestion Control Congestion control is used to reduce packet drops in lossy environments and mitigate congestion spreading and resulting victim flows in lossless environments The Quantized Congestion Notification QCN IEEE standard 802 1Qau provides congestion control for long lived flows in limited bandwidth delay product Ethernet networks It is part of the IEEE Data Center Bridging DCB protocol suite which also includes ETS PFC and DCBX QCN in conducted at L2 and is targeted for hardware implementations QCN applies to all Ethernet packets and all transports and both the host and switch behavior is detailed in the standard QCN user interface allows the user to configure QCN activity QCN configuration and retrieval of information is done by the minx gen tool The command interface provides the user with a set of changeable attributes and with information regarding QCN s counters and statistics All parameters and statistics are defined per port and priority QCN command interface is available if and only the hardware supports it 108 Mellanox Technologies m 2 2 1 0 1 4 15 1 QCN Tool minx qcn mlnx_qcn is a tool used to configure QCN attributes of the local host It communicates directly with the driver thus does not require setting up a DCBX daemon on the system The mlnx qcn enables the user to nspect the current QCN configurations for a certain port sorted by priority
258. monstrates basic PXE SAN Boot capability In this chapter IET is used as the iSCSI Target In the example below e sqa030 a Linux host performs the role of the PXE server and an iSCSI target and sqa070 a host performs the role of the PXE client and a sanity check environment sqa030 The OS is configured with IP 12 7 6 30 and the DHCP IP configuration for the diskless client is 12 7 6 70 sqa030 has a 21 5GB disk partition on its local HDD called dev cciss cOd0p9 which will be used as the storage volume to host the client s OS It also serves as DHCP TFTP and NFS server for PXE clients Both servers have a Mellanox ConnectX 3 based 10GE NIC equipped with PXE boot capabilities via Expansion ROM called FlexBoot http www mellanox com page products dyn amp product_family 34 amp mtag flexboot E 1 Configuring the iSCSI Target Machine gt To configure the iSCSI target Step 1 Download the IET target software from http sourceforge net projects iscsitarget files iscsitarget 1 4 20 2 Step 2 Install iSCSI target and additional required software on target server root sqa030 yum install kernel devel openssl devel gcc rpm build root sqa030 tmp tar xzvf iscsitarget 1 4 20 2 tar gz root sqa030 tmp cd iscsitarget 1 4 20 2 root sqa030 iscsitarget 1 4 20 2 make amp amp make install 256 Mellanox Technologies m 2 2 1 0 1 Step3 Create the IQN in the ietd configuration fil
259. mware will not be updated if you run the install script with the without fw update option Mellanox Technologies 29 Rev 2 2 1 0 1 Installation Usage mlnx add kernel support sh m mlnx ofed path to MLNX OFED directory make iso make tgz make iso Create MLNX OFED ISO image make tgz Create MLNX OFED tarball Default t tmpdir local work dir gt kmp Enable KMP format if supported k kernel kernel version Kernel version to use s kernel sources path to the kernel sources Path to kernel headers v verbose n name Name of the package to be created y yes Answer yes to all questions force Force removing packages that depends on MLNX OFED Example The following command will create a MLNX OFED LINUX ISO image for RedHat 6 3 under the tmp directory MLNX OFED LINUX 2 2 1 0 0 rhel6 3 x86 64 mlnx add kernel support sh m tmp MINX OFED LINUX 2 2 1 0 0 rhel6 3 x86 64 make tgz Note This program will create MLNX OFED LINUX TGZ for rhel6 3 under tmp directory All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y See log file tmp mlnx ofed iso 21642 10g Building OFED RPMs Please wait Removing OFED RPMs Created tmp MLNX OFED LINUX 2 2 1 0 0 rhel6 3 x86 64 ext tgz The minx add kernel support sh script can be executed directly from the mlnxofedinstall script For further info
260. n CA to CA paths are not reported adi the output reports Error Codes Failed to fully discover the fabric Failed to parse command line options Failed to intract with IB fabric Failed to use local device or local port Failed to use Topology File Failed to load requierd Package SS Cn a CO B B I 9 4 3 ibdiagpath IB Diagnostic Path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing 1s used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d tog
261. n MYM f r y fos proba 4 Specialized Storage Devices ratas or upgrades to enterprise devices such as Storage Area Networks SAN Tha option wt allow you to add Col I OCU CP diis wd to iter ut devices Ihe installer showed agree Step 5 Click on the Add Advanced Target button 0 device s 0 MB selected cut of 0 device s 0 MB total g Tip Selecting a drive on this screen does not necessarily mean it will be wiped by the Installation process Also note that post installabon you may mount drives you did not select here by modifying your jetc fstab file 266 Mellanox Technologies l m 2 2 1 0 1 Step 6 In the Advanced Storage Options window perform the following Stepa Select the Add Iscsi Target option Stepb Check the Bind targets to network interfaces checkbox Stepc Click Add drive button Advanced Storage Options How would you like to modify your drive configuration Add SCS target v Bind targets to network interfaces Active network interfaces etho Configure Network Step 7 Enter the IP address of iSCSI target Optionally you may choose to enter a customized Initiator Name and select the necessary CHAP authentication of choice Please refer to Section E 5 1 SAN Booting the Diskless Client with FlexBoot on page 270 for further information In the example below iSCSI Initiator Name is left with the default value given by the installer and iSCSI discovery authentication is
262. n at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file message registers only general major events the second file opensm log includes details of reported errors All errors reported in opensm 10og should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly If a fatal non recoverable error occurs opensm exits 8 2 4 1 Running OpenSM As Daemon OpenSM can also run as daemon To run OpenSM in this mode enter hostl etc init d opensmd start 8 3 osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administra tor osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 8 3 2 osmtest has the following test flows Multicast Compliancy test Event Forwarding test Service Record registration test RMPP stress test Mellanox Technologies 145 Rev 2 2 1 0 1 OpenSM Subnet Manager 8 3 1 Small SA Queries stress test Syntax osmtest OPTIONS where OPTIONS are Ml ow This option directs osmtest to run a specific flow Flow Description C 7 create an inve
263. nMPI compiled with MXM v2 x 5 3 1 Compiling OpenMPI with MXM Step 1 Install MXM from RPM rpm ihv mxm x y z 1 x86 64 rpm MXM will be installed automatically in the opt me11anox mxm folder Step 2 Enter OpenMPI source directory and run cd SOMPI HOME configure with mxm opt mellanox mxm lt other configure parameters make all amp amp make install oo MLNX OFED v2 0 or later comes with a pre installed version of MXM v2 x and OpenMPI compiled with MXM v2 x To check the version of MXM installed on your host run rpm qi mxm Mellanox Technologies 127 Rev 2 2 1 0 1 HPC Features gt To upgrade MLNX_OFED v2 0 or later with a newer MXM Step 1 Remove MXM vl l rpm e mxm Step 2 Remove the pre compiled OpenMPI rpm e mlnx openmpi gcc Step3 Install the new MXM and compile the OpenMPI with it To run OpenMPI without MXM run S molina 1ea miel mam X555 When upgrading to MXM v0 52 OpenMPI compiled with the previous versions of the MXM should be recompiled with MXM v0 52 5 3 2 Enabling MXM in OpenMPI MXM v0 52 is automatically selected by OpenMPI up to v1 6 when the Number of Processes NP is higher or equal to 128 To enable MXM for any NP use the following OpenMPI parame ter mca mtl _ mxm np lt number gt From OpenMPI v1 7 MXM is selected when the number of processes is higher or equal to 0 i e by default gt To activate MXM for any NP run
264. nager will need to be restarted by restarting Subnet Man ager to allow it to configure the AR on this switch This option can be changed on the fly AR_MODE Adaptive Routing Mode Default bounded lt bounded free gt free no constraints on output port selection bounded the switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets This option can be changed on the fly AGEING_TIME Applicable to bounded AR mode only Specifies Default 30 lt usec gt how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value This option can be changed on the fly MAX_ERRORS When number of errors exceeds Values for both options 0 lt N gt MAX ERRORS of send receive errors or time Oxffff ERROR WINDOW outs in less than ERROR WINDOW seconds MAX ERRORS 0 zero tolle DENS the AR Manager will abort returning control E wan ean k to the Subnet Manager aaaea bac E ger ERROR WINDOW 0 mecha This option can be changed on the fly nism disabled no error checking Default 5 LOG FILE lt full AR Manager log file Default var log armgr log path gt This option can be changed on the fly LOG SIZE size This option defines maximal AR Manager log 0 unlimited log file size in MB gt file size in MB T
265. ng the following command dhclient cf client conf file IB network interface name Example of a configuration file for the ConnectX PCI Device ID 26428 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier irit 8 00 500 8 00 5 00 OO 8 012 Be tre 090 00 8 02 ee e 013 e 0 602 is se Example of a configuration file for InfiniHost III Ex PCI Device ID 25218 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 2208 0 0 SSS 8104 tte 0 8 000 6 010 6 00 81010 2 0 0 8 010 8 00 80 2 6 9 6 8 0 278 00 8 273 SA In order to use the configuration file run host1 dhclient cf dhclient conf ibi Mellanox Technologies 57 Rev 2 2 1 0 1 Driver Features 4 3 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the installation script with a configuration file using the n option containing the full IP configu ration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface A static IPoIB configuration e An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring IP addresses The following code lines are an excerpt from a sample IPoIB configuration file Sta
266. ng the guest interface and its virtual bridge have the same MTU value MTU 4092 Bytes For further information of MTU settings please refer to the Hypervisor User Manual Tune the TCP IP stack using sysctl dom0 domu sbin sysctl perf tuning Other performance tuning for KVM environment such as vCPU pinning and NUMA tuning may apply For further information please refer to the Hypervisor User Manual Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over physical con tiguous pages It enables a user application to ask low level drivers to allocate contiguous mem ory for it as part of ibv reg mr Additional performance improvements can be reached by allocating Queue Pair QP and Com pletion Queue CQJ buffers to the Contiguous Pages To activate set the below environment variables with values of PREFER CONTIG or CONTIG For QP MLX QP ALLOC TYPE e For CQ MLX CQ ALLOC TYPE The following are all the possible values that can be allocated to the buffer Table 3 Buffer Values Possible Value Description ANON Use current pages ANON small ones HUGE Force huge pages CONTIG Force contiguous pages PREFER CONTIG Try contiguous fallback to ANON small pages Default PREFER HUGE Try huge fallback to ANON small pages ALL Try huge fallback to contiguous if failed fallback to ANON small pages Values are NOT case s
267. nly with Host2 vm2 In addition Host1 Dom0 will be able to communicate only with Host2 Dom0 over ib0 vm1 and vm2 will not be able to communicate with each other nor with Dom0 Mellanox Technologies 103 Rev 2 2 1 0 1 Driver Features This is done by configuring the virtual to physical PKey mappings for all the VMs such that at virtual PKey index 0 both vm 1s will have the same pkey and both vm 2s will have the same PKey different from the vm 1 s and the Dom0 s will have the default pkey different from the vm s pkeys at index 0 OpenSM must be used to configure the physical Pkey tables on both hosts The physical Pkey table on both hosts Dom0 will be configured by OpenSM to be index 0 Oxffff index 1 0xb000 index 2 0xb030 The vml s virt to physical PKey mapping will be pkey idx 0 1 pkey_ idx 1 0 The vm2 s virt to phys pkey mapping will be pkey_idx 0 2 pkey_ idx 1 0 so that the default pkey will reside on the vms at index 1 instead of at index 0 The IPoIB QPs are created to use the PKey at index 0 As a result the Dom0 vml and vm2 IPoIB QPs will all use different PKeys gt To partition IPoIB communication using PKeys Step 1 Create a file etc opensm partitions conf on the host on which OpenSM runs con taining lines Default 0x7fff ipoib ALL full Pkey1 0x3000 ipoib ALL full Pkey3 0x3030 ipoib ALL full This will cause OpenSM to configure the physical Port Pkey tables on a
268. ns 0 on success or the value of errno on failure For further information please refer to the ibv exp destroy flow man page e Ethtool Ethtool domain is used to attach an RX ring specifically its QP to a specified flow Please refer to the most recent ethtool manpage for all the ways to specify a flow Examples ethtool U eth5 flow type ether dst 00 11 22 33 44 55 loc 5 action 2 All packets that contain the above destination MAC address are to be steered into rx ring 2 its underlying QP with priority 5 within the ethtool domain e ethtool U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2 88 Mellanox Technologies m 2 2 1 0 1 All packets that contain the above destination IP address and source port are to be steered into rx ring 2 When destination MAC is not given the user s destination MAC is filled automatically ethtool U eth5 flow type ether dst 00 11 22 33 44 55 vlan 45 m Oxf000 loc 5 action 2 All packets that contain the above destination MAC address and specific VLAN are steered into ring 2 Please pay attention to the VLAN s mask Oxf000 It is required in order to add such a rule e ethtool u eth5 Shows all of ethtool s steering rule When configuring two rules with the same priority the second rule will overwrite the first one so this ethtool interface is effectively a table Inserting Flow Steering rules in the kernel requires support from both the ethtool in the user space and in k
269. nt a virtual machine can be expose to the physical network by per forming the next setting Step 1 Create a virtual bridge Step 2 Attach the para virtualized interface created by the eth ipoib driver to the bridge Step 3 Attach the Ethernet interface in the Virtual Machine to that bridge The diagram below describes the topology that was created after these steps Virtual Interface s vifX Virtual Bridge s vbrX aka vSwitch Bridge Uplink s Pi elPolB IPoib Uplink InfiniBand Fabric The diagram shows how the traffic from the Virtual Machine goes to the virtual bridge in the Hypervisor and from the bridge to the eIPoIB interface eIPoIB interface is the Ethernet interface that enslaves the IPoIB interfaces in order to send receive packets from the Ethernet interface in the Virtual Machine to the IB fabric beneath 4 9 1 Enabling the elPolB Driver Once the mlnx_ofed driver installation is completed perform the following Step 1 Open the etc infiniband openib conf file and include E IPOIB LOAD yes Step 2 Restart the InfiniBand drivers etc init d openibd restart Mellanox Technologies 81 Rev 2 2 1 0 1 Driver Features 4 9 2 Configuring the Ethernet Tunneling Over IPoIB Driver When eth ipoib is loaded number of elPoIB interfaces are created with the following default naming scheme ethx where X represents the ETH port available on the system To check which eIPoIB
270. ntory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file S run service registration deregistration and lease test run event forwarding test flood the SA with queries according to the stress mode QoS info dump VLArb and SLtoVL tables e f m multicast flow q t run trap 64 65 flow this flow requires running of external tool Default all flows except QoS W wait This option specifies the wait time for trap 64 65 in seconds It is used only when running f t the trap 64 65 flow Default 10 sec d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support m max lid This option specifies the maximal LID number to be searched for during inventory file build Default 100 cual This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port D cogo This option displays a menu of possible local port GUID values with which osmtest could bind i inventory This option spe
271. onization and atomic memory operations An atomic memory operation is an atomic read and update oper ation such as a fetch and increment on a remote or local data object SHMEM libraries implement active messaging The sending of data involves only one CPU where the source processor puts the data into the memory of the destination processor Likewise a processor can read data from another processor s memory without interrupting the remote CPU The remote processor is unaware that its memory has been read or written unless the programmer implements a mechanism to accomplish this 5 1 1 Mellanox ScalableSHMEM The ScalableSHMEM programming library is a one side communications library that supports a unique set of parallel programming features including point to point and collective routines syn chronizations atomic operations and a shared memory paradigm used between the processes of a parallel programming application Mellanox ScalableSHMEM is based on the API defined by the OpenSHMEM org consortium The library works with the OpenFabrics RDMA for Linux stack OFED and also has the ability to utilize MellanoX Messaging libraries MXM as well as Mellanox Fabric Collective Accelera tions FCA providing an unprecedented level of scalability for SHMEM programs running over InfiniBand The latest ScalableSHMEM software can be downloaded from the Mellanox website 122 Mellanox Technologies m 2 2 1 0 1 5 1 2 Running SHMEM w
272. ooo os oo aaa LinkUp MNK EN Up ONES COE CECI 1X or 4X asiste PTS MIO CEP UM 1X or 4X UNAM ono 9090990900009 AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps LAMKSPECCACEAVE Nr T sateen ares 5 0 Gbps gt ibportstate C mthca0 D 0 1 PortInfo Bort tios DR pera slici 655357 lol 655357 Q port I E oo Down IN os pues opone Polling RS CEN IE CET 1X or 4X le c a oeteroeroe co oro 1X or 4X GRA tole NTC MEI p55 omo gentre 9e 9 AX LinkSpeedSupported 6 2 5 Gbps Mellanox Technologies 209 Rev 2 2 1 0 1 InfiniBand Fabric Utilities 9 4 14 MESE od T MUS 2 5 Gbps Linko pecdAC TIVE PISTE 2 5 Gbps 3 Change the speed of a port First query for current configuration gt ibportstate C mlx4 0 D 0 1 PortInfo Pore aros DR peda slic 655357 lu 655357 Q pore I LAMIKG Gate e a E Tasio rt Initialize IAS tooo daa LinkUp ARS ESOS scc o oo oio o 1X or 4X Gain ehe alise O 666 1X or 4X A 5o 05996090909 AX LinkSpeedSupported 6 2 5 Gbps or 5 0 Gbps isnkSpeedinablle dee TT EET 2 5 Gbps or 5 0 Gbps LK PE A E NU EIS 5 0 Gbps Now change the enabled link speed gt ibportstate C mlx4 0 D 0 1 speed 2 ibportstate C mlx4 0 D 0 1 speed 2 Initial PortInfo jr Bort baros DR parda Sich 655957 cllicl fu O parr ii LinkSpeedEnabled 2 5 Gbps After PortInfo set Port info DR path slid 65535 dlid 65535 0 port 1 KE eo 5 0 Gbps IBA
273. oooo 80 4 9 1 Enabling the eIPoIB Driver 0 000 81 4 9 2 Configuring the Ethernet Tunneling Over IPoIB Driver 82 4 9 3 VLAN Configuration Over an eIPoIB Interface oooooooooo oo 83 4 9 4 Setting Performance Tuning 0 0 cece cect 84 4 10 Contiguous Pages 0 0 0 ccc ee an 84 4 Mellanox Technologies m 2 2 1 0 1 4 11 Shared Memory Region 0 cece cece en 85 4 12 XRC eXtended Reliable Connected Transport Service for InfiniBand 86 413 Flow Steering aie do iue et eee tel ere nti e ote d oe ales 87 4 13 1 Enable Disable Flow Steering llle eese 87 4 13 2 Flow Domains and Priorities 5 87 4 14 Single Root IO Virtualization SR IOV 0 0 cece eee eee ee 90 4 14 1 System Requirements opregte rrera at tE e 90 4 14 2 Setting Up SR IOV 1 cece m 91 4 14 3 Enabling SR IOV and Para Virtualization on the Same Setup 97 4 14 4 Assigning a Virtual Function to a Virtual Machine 0 4 98 4 14 5 Uninstalling SR IOV Driver 0 00 cc teen rr ra 99 4 14 6 Configuring Pkeys and GUIDs under SR IOV 0 2 0 0 eee eee eee 99 4 14 7 Running Network Diagnostic Tools on a Virtual Function 107 4 15 Quantized Congestion Control 0 0 0 0 eens 108 4 15 1 QCN Fool mlnx gone Lx ne een eque Rec an 109 4 15 2 Setting QCN Configuration 0 0 0 0 00 cece eee 111 416 CORE Ditects a
274. opensm attempts to use as few layers as possible This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoid ing the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm QoS support has to be turned on in order that SL VL mappings are used P LMC gt 0 is not supported by the LASH routing If this is specified the default routing algorithm 1s invoked instead ds For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do mesh analysis This will add an additional phase that analyses the mesh to try to determine the dimension and size of a mesh I
275. or counters after read k and K can be used together to clear both errors and counters clear counts K details Clear data counters after read CAUTION clearing data counters will occur regardless of if they are printed or not This is because data counters are only printed on ports which have errors This means if a port has 0 errors and the K option is specified the data counters will be cleared without any printed output Includes receive error and transmits discard details load cache lt filename gt Loads and uses the cached ibnetdiscover data stored in the specified filename May be useful for outputting and learn ing about other fabrics or a previous state of a fabric Cannot be used if user specifies a direct route path See ibnetdis cover for information on caching ibnetdiscover output R This option is obsolete and has no effect d Raises the IB debugging level May be used several times ddd or d d d e Shows send and receive errors time outs and others 198 Mellanox Technologies l m 2 2 1 0 1 Table 27 ibqueryerrors Flags and Options Flags Description h Shows the usage message V Increases the application verbosity level May be used sev eral times vv or v v v C ca name Uses the specified ca name P ca port Uses the specified ca port t timeout ms Overrides the default timeout for the solicited mads E
276. or reconfigured after the installation by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 2 3 6 Installation Logging While installing MLNX OFED the install log for each selected package will be saved in a sepa rate log file The path to the directory containing the log files will be displayed after running the installation script in the following format Logs dir tmp MLNX OFED LINUX version PID logs Example Logs dir tmp MLNX OFED LINUX 2 2 0 0 2 31701 10gs 2 3 7 openibd Script As of MLNX OFED v2 2 1 0 0 the openibd script supports pre post start stop scripts This can be controlled by setting the variables below in the etc infiniband openibd conf file PENIB PENIB PENIB PENIB PRE START POST START PRE STOP D E D D POST STOP OPENIB Mellanox Technologies 39 J O POST START sbin openibd post start sh Rev 2 2 1 0 1 Installation 2 4 Updating Firmware After Installation In case you ran the mlnxofedinstall script with the without fw update option and now you wish to manually update firmware on your adapter card s you need to perform the following steps If you need to burn an Expansion ROM image please refer to Burning the Expan sion ROM Image on page 258 The followin
277. or switch failures that result in a fabric for which torus 2QoS can generate credit loop free unicast routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example consider that same 2D 6x5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree E I I I I I 3 l I I l I I 2 Xx 4 4 I I I I I 1 I I I I I I y 0 x 1 2 3 4 5 160 Mellanox Technologies m 2 2 1 0 1 Two things are notable about this master spanning tree First assuming the x dateline was between x 5 and x 0 this spanning tree has a branch that crosses the dateline However just as for unicast crossing a dateline on a 1D ring here the ring for y 2 that is broken by a failure cannot contribute to a torus credit loop Second this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric That unfortunately is a compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appro priately selecting a root switch In the 2D 6x5 torus example assume now that the switch at 3 2 i e the root for a pristine fabric fails Torus 2QoS will generate the following master spanning tree for that case
278. ority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet pkey A dump of the existing partitions and their member host ports ibdiagnet mcg A dump of the multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output 192 Mellanox Technologies m 2 2 1 0 1 After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes SM report Number of nodes and systems Hop count information maximal hop count an example path and a hop count histo gram All CA to CA paths traced Credit loop report mgid mlid HCAs multicast group and report Partitions report IPoIB report Furthermore if a topology file is provided ibdiagnet uses the names defined in it for gt In case the IB fabric includes only one CA the
279. orrectly 173 Failed to start the mst driver 32 Mellanox Technologies rov 2 2 1 0 1 2 3 3 Installation Procedure Step 1 Login to the installation machine as root Step 2 Mount the ISO image on your machine hostlf mount o ro loop MLNX OFED LINUX lt ver gt lt 0S label gt lt CPU arch gt iso mnt Step3 Run the installation script Logs dir tmp MLNX OFED LINUX 2 2 0 0 9 10694 10gs This program will install the MLNX OFED LINUX package on your machine ote that all other Mellanox OEM OFED or Distribution IB packages will be removed Uninstalling the previous version of MLNX OFED LINUX Starting MLNX OFED LINUX 2 2 0 0 9 installation Installing mlnx ofa kernel RPM Preparing lnx ofa kernel nstalling kmod mlnx ofa_ kernel 2 2 RPM reparing mod mlnx ofa kernel nstalling mlnx ofa_kernel devel RPM reparing Inx ofa kernel devel nstalling kmod kernel mft mlnx 3 6 0 RPM reparing mod kernel mft mlnx nstalling knem mlnx RPM reparing nem mlnx nstalling kmod knem mlnx 1 1 1 90mlnx RPM reparing mod knem mlnx nstalling ummunotify mlnx RPM reparing mmunotify mlnx nstalling kmod ummunotify mlnx 1 0 RPM reparing mod ummunotify mlnx talling kmod iser 1 2 RPM reparing GH sg Sh en rare i fel ys heb deh psp dneh fel zap Gel eh fer ice gt Si Ural rar ie talling kmod srp 1 3 2 RPM reparing Oh tral T E k Installing mpi selector RP
280. otes for the Mellanox Firmware Tools See under docs folder of installed package Support and Updates Webpage Please visit http www mellanox com gt Products gt InfiniBand VPI Drivers gt Linux SW Drivers for downloads FAQ troubleshooting future updates to this manual etc Mellanox Technologies 19 J Rev 2 2 1 0 1 Mellanox OFED Overview 1 1 1 1 2 1 2 1 1 2 2 Mellanox OFED Overview Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Interconnect VPI software stack which operates across all Mellanox network adapter solutions supporting 10 20 40 and 56 Gb s InfiniBand IB 10 40 and 56 Gb s Ethernet and 2 5 or 5 0 GT s PCI Express 2 0 and 8 GT s PCI Express 3 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are supported with major operating system distributions Mellanox OFED is certified with the following products Mellanox Messaging Accelerator VMA software Socket acceleration library that performs OS bypass for standard socket based applications Mellanox Unified Fabric Manager UFM software Powerful platform for managing demanding scale out computing fabric environments built on top of the OpenSM industry standard routing engine Fabric Collective Accelerator FCA FCA is a Mellanox MPl integrated software package that utilizes CORE Direct technology for
281. ou would like to use if there is a non Ethernet active port in the card x MXM RDMA PORTS mlx4 0 1 or x MXM IB PORTS mlx4 0 1 5 4 Fabric Collective Accelerator The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI process to the server CPUs As a system wide solution FCA does not require any additional hardware The FCA manager creates a topol ogy based collective tree and orchestrates an efficient collective operation using the CPUs in the servers that are part of the collective operation FCA accelerates MPI collective operation perfor mance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime MLNX_OFED v2 0 or later comes with a pre installed version of FCA v2 x FCA is built on the following main principles Topology aware Orchestration The MPI collective logical tree is matched to the physical topology The collective logical tree is constructed to assure Maximum utilization of fast inter core communication Distribution of the results Communication Isolation Mellanox Technologies 129 Rev 2 2 1 0 1 HPC Features Collective communications are isolated from the rest of the traffic in the fabric using a private virtual network VLane eliminating contention with other types of traffic After MLNX OFED installation FCA can be found at opt m
282. ow version info C Optional Use the specified channel adapter or router ca name P ca port Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout ms MADs msec gt 214 Mellanox Technologies m 2 2 1 0 1 Table 34 smpquery Flags and Options Default If Not Description Specified Optional Flag Mandatory lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2vl lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt mepi lt addr gt lt portnum gt lt destdr path Optional Destination s directed path LID or GUID lid guid gt Examples 1 Query PortInfo by LID with port modifier gt smpquery portinfo 1 1 Port info Lid 1 port 1 MESS eere rU ER E 0x0000000000000000 GEARS I e eMe DU sys E cere 0xfe80000000000000 linteis cM PEE 0x0001 dias oler 0x0001 CapMa ases ado aora ets ES Re sale 0x251086a SS IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported Diag U OER Mel eee s tao oo do 0x0000 MkeyleasePeriod a dor os BAT 0 Neo e PX TEUER TER il TAA
283. owing example shows how to verify the configuration host1 ifconfig ib0 b0 Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 inet addr 11 4 3 175 Bcast 11 4 255 255 Mask 255 255 0 0 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s 4 3 4 Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri mary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to Create a subinterface Section 4 3 4 1 Remove a subinterface Section 4 3 4 2 4 3 4 1 Creating a Subinterface In the following procedure 1b0 is used as an example of an IB subinterface To create a child interface subinterface follow this procedure Step 1 Decide on the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For exam ple a value of 1 will give a PKey with the value 0x8001 Step 2 Create a child interface by running hostl echo lt PKey gt gt sys class net lt
284. p state DOWN mode DEFAULT group default qlen 1000 IMAI 000236981 13 729 Trel sie oi i 9d 31E 1E dele g dele f 0 MAC 00 00 00 00 00 00 vlan 4095 spoof checking off link state auto f 37 MAC 00 00 00 00 00 00 vlan 4095 spoof checking off link state auto f 38 MAC ff ff ff ff ff ff vlan 65535 spoof checking off link state disable f 39 MAC ff ff ff ff ff ff vlan 65535 spoof checking off link state disable 106 Mellanox Technologies cep E m 2 2 1 0 1 When a MAC is ff ff ff ff ff ff the VF is not assigned to the port of the net device it is listed under In the example above vf 38 is not assigned to the same port as p1p1 in contrast to vf0 However even VFs that are not assigned to the net device could be used to set and change its settings For example the following is a valid command to change the spoof check ip link set dev plpl vf 38 spoofchk on This command will affect only the vf 38 The changes can be seen in ip link on the net device that this device is assigned to 4 14 6 3 4Mapping VFs to Ports using the mlInx get vfs pl tool gt To map the PCI representation in BDF to the respective ports minx get vfs pl The output is as following BDF 0000 04 00 0 poa ie 7 vf0 0000 04 00 1 vfl 0000 04 00 2 Don vf2 0000 04 00 3 vf3 0000 04 00 4 Eolelag 11 vf4 0000 04 00 5 4 14 6 3 5RoCE Support RoCE is supported on Virtual Functions and VLANs may be used with it For RoCE the hy
285. partition assigned to the LUN dev cciss cOd0p9 in the example YN above must not contain any valuable data as this data will be destroyed by the instal lation process taking place later in this procedure Step 4 Edit the etc sysconfig iscsi target file as follow OPTIONS c etc iet ietd conf address 12 7 6 30 Step 5 Start the SCSI target service root sqa030 etc init d iscsi target start Step 6 Perform a sanity check by connecting to the iSCSI target from a remote PC on the 10GE network link E 2 Configuring the DHCP Server Edit a host declaration for your PXE client in the DHCP configuration file serving it with pxe linux 0 and restart your DHCP Mellanox Technologies 257 Rev 2 2 1 0 1 Here is an example of such host declaration inside DHCP config file host qadell011 filename pxelinux 0 next server 12 7 6 30 option host name gadell011 fixed address 12 7 6 11 hardware ethernet 00 02 C9 E5 D8 E0 E 3 Configuring the PXE Server Step 1 Step 2 Download SLES11SP3 kISO VPLtgz from http www mellanox com page products dyn amp product family 34 amp mtag flexboot Extract the tgz file on the PXE server under the TFTP root directory For example root sqa030 cd var lib tftpboot tar xzvf SLESIISP3 kISO VPI tgz The following are examples of PXE server configuration PXE server configuration SLES11SP3 kISO VPI pxelinux cfg default e Kernel and initrd for the installation
286. pendently configured as either IB or Eth 6 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx port config script after the driver is loaded Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved con figuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are eth Ethernet ib Infiniband auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified if there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device This utility also has a non interactive mode sbin connectx_port_config d device PCI device ID gt c conf lt portl port2 gt 6 2 Auto Sensing Auto Sensing enables the NIC to automatically sense the link typ
287. pensm Syntax on page 136 e Appendix C mlx5 Module Parameters page 254 Added the following sections Section 1 5 RDMA over Converged Ethernet RoCE on page 26 Section 4 5 Quality of Service Ethernet on page 65 and its subsections Section 4 12 XRC eXtended Reliable Connected Transport Service for InfiniBand on page 86 Section 4 14 6 Configuring Pkeys and GUIDs under SR IOV on page 99 and its subsections Section 4 17 Ethtool on page 111 2 0 2 0 5 April 2013 Initial release Mellanox Technologies 15 Rev 2 2 1 0 1 About this Manual This preface provides general information concerning the scope and organization of this User s Manual Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g IKB 1024 bytes and 1MB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g Kb 1024 bits FW Firmware HCA Host Channel Adapter HW Ha
288. per visor GID table size is of 16 entries while the VFs share the remaining 112 entries When the number of VFs is larger than 56 entries some of them will have GID table with only a single entry which is inadequate 1f VF s Ethernet device is assigned with an IP address When setting num vfs in mlx4 core module parameter it is important to check that the number of the assigned IP addresses per VF does not exceed the limit for GID table size 4 14 7 Running Network Diagnostic Tools on a Virtual Function 4 14 7 1 Overview Until now in MLNX OFED administrators were unable to run network diagnostics from a VF since sending and receiving Subnet Management Packets SMPs from a VF was not allowed for security reasons SMPs are not restricted by network partitioning and may affect the physical net work topology Moreover even the SM may be denied access from portions of the network by setting management keys unknown to the SM However it is desirable to grant SMP capability to certain privileged VFs so certain network management activities may be conducted within virtual machines rather than only on the hyper visor Mellanox Technologies 107 Rev 2 2 1 0 1 Driver Features 4 14 7 2 Granting SMP Capability to a Virtual Function To enable SMP capability for a VF one must enable the Subnet Management Interface SMI for that VF By default the SMI interface is disabled for VFs To enable SMI mads for VFs there are two new sysfs en
289. pg ai rate 10 rpg hai rate 50 rpg gd 8 rpg min dec fac 2 rpg min rate 10 cndd state machine 0 priority 7 rpg enable 0 rppp max rps 1000 rpg time reset 1464 rpg byte reset 150000 rpg threshold 5 110 Mellanox Technologies m 2 2 1 0 1 rpg max rate 40000 rpg ai rate 10 rpg hai rate 50 rpg gd 8 rpg min dec fac 2 rpg min rate 10 cndd state machine 0 4 15 2 Setting QCN Configuration Setting the QCN parameters requires updating its value for each priority 1 indicates no change in the current value Example for setting rp g enable in order to enable QCN for priorities 3 5 6 mine cen i eth Seg enclle 1 1 St 1 1 1 1 Example for setting rpg_hai_rate for priorities 1 6 7 mile cen i etha 399 hai rare l lt i il i 1 00 60 4 16 CORE Direct 4 16 1 CORE Direct Overview CORE Direct provides a solution for off loading the MPI collectives operations from the soft ware library to the network CORE Direct accelerates MPI applications and solves the scalability issues in large scale systems by eliminating the issues of operating systems noise and jitter It addresses the collectives communication scalability problem by off loading a sequence of data dependent communications to the Host Channel Adapter HCA This solution provides the hooks needed to support computation and communication overlap Additionally it provides a means to reduce the effects of system nois
290. plementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mella nox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and login again Other packages such as environ ment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also shows what the current selections are This command is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functional ity as mpi selector menu but wit
291. port group end port groups qos setup This section of the policy file describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED the section is parsed and ignored SL2VL and VLArb tables should be configured in the OpenSM options file by default var cache opensm opensm opts end qos setup qos levels Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules qos level name DEFAULT use default QoS Level gle 0 168 Mellanox Technologies Rev 2 2 1 0 1 Mellanox Technologies 169 Rev 2 2 1 0 1 OpenSM Subnet Manager 8 6 6 pkey 0x0F00 0x0FFF qos level name WholeSet end qos match rule end qos match rules Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include Default match rule that is applied to PR MPR query that didn t match any of the other match rules SDP SDP application with a specific target TCP IP port range SRP with a specific target IB port GUID RDS IPoIB with a default PKey IPoIB with a specific PKey Any ULP application with a specific Service ID in the PR MPR query Any ULP appl
292. r Flags and Options Flag Description burst_size lt size gt Set the amount of messages to send in a burst when using rate limiter rate_limit lt rate pps gt Set the maximum rate of sent packages rate_units lt units gt Mgp Set the units for rate limit to MBps M Gbps g or pps p Raw Ethernet Options The table below lists the Raw Ethernet flags of the command Table 58 raw_ethernet_bw Raw Ethernet Flags and Options Flag Description B source_mac Source MAC address by this format XX XX XX XX XX XX default take the MAC address form GID E dest_mac Destination MAC address by this format XX XX XX XX XX XX MUST be entered J dest_ip Destination ip address by this format X X X X using to send packets with IP header j SOUrce_1p Source ip address by this format X X X X using to send packets with IP header K dest_port Destination port number using to send packets with UDP header as default or you can use tcp flag to send TCP Header k source_port Source port number using to send packets with UDP header as default or you can use tcp flag to send TCP Header Z server Choose server side for the current machine server client must be selected P client Choose client side for the current machine server cli ent must be selected v mac_fwd Run mac forwarding test tcp Send TCP Packets must include IP and Ports
293. r example only one target exist so only one was discovered 262 Mellanox Technologies m 2 2 1 0 1 Step 11 Click Connect File View Macros Tools Power NextBoot Virtual Media Help eque I SCSI Initiator Discovery SUSE Linux Entorer es default 12 7 6 30 3260 _ign 2013 10 galab com sqa030 prt9 Preparation gt Disk Activation Installation urrent User s rcon 10 8 18 60 Step 12 Select onboot from drop list F e View Macros Toots Power MextGoct Vi ea Media Help I SCSI Initiator Discovery Disk Activation Password Step 13 Click Next to exit the discovery screen Mellanox Technologies 263 Rev 2 2 1 0 1 Step 14 Go to the Connected Targets tab again to confirm iSCSI connection with target Fee View Macros Toots Power Neatfoot Vela Medea Help I SCSI Initiator Overview Earrent Vegetal 0000 008 96 127 Step 15 Click OK Step 16 Click Next back at the Disk Activation screen Step 17 Select New Installation File View Macros Toots Power NMextBoot Virtual Media Help Installation Mode Select Mode A e O Update an Existing System urrent lena mat 10 0 46 157 Step 18 Click Next Step 19 Complete Clock and Time Zone configuration 264 Mellanox Technologies m 2 2 1 0 1 Step 20 Select Physical Machine F e View Macros Toots Power Mext oo Virtual Media Hep T Server Base Scenario Step 21 Click Next
294. r this partition Only low 15 bits will be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of this partition defmember full limited specifies default membership for port guid list Default is limited Currently recognized flags are ipoib indicates that this partition may be used for IPoIB as a result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate MTU and scope should be specified as defined in the IBTA specifica tion for example mtu 4 for 2048 To use 4K MTU edit that entry to mtu 5 5 indicates 4K MTU to that specific partition PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters 5
295. rbose mode t topo file Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system Specifies the local device s port number used to connect to the IB fabric Specifies the directory where the output files will be placed default tmp Specifies the expected link width 6l lt a 92 1939 ooo c lt count gt 1 lt dev index gt p lt port num gt o lt out dir gt lw Ix 4x 12x ale 2555107 Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm 90 Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values 194 Mellanox Technologies m 2 2 1 0 1 Output Files Table 24 ibdiagpath Output Files Output File Description ibdiagpath log A dump of all the application reports generated according to the provided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links Error Codes 1 The path traced is un healthy 2 Failed to parse command line options 3 More th
296. rdware IB InfiniBand iSER iSCSI RDMA Protocol LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE RDMA over Converged Ethernet 16 Mellanox Technologies m 2 2 1 0 1 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description SDP Sockets Direct Protocol SL Service Level SRP SCSI RDMA Protocol MPI Message Passing Interface EoIB Ethernet over Infiniband QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane vHBA Virtual SCSI Host Bus adapter uDAPL User Direct Access Programming Library Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man agers in particular It 1s included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Glossary Sheet 1 of 2 Channel Adapter An IB device that terminates an IB link and executes transport CA Host Channel functions This may be an HCA Host CA or a TCA Target Adapter HCA CA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant commu
297. rees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algo rithm allows leaf switches to have any number of CAs the closer the tree is to be fully popu lated the more effective the shift communication pattern will be In general even if the root list is provided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will match the routing tables 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn_guid_file OpenSM options 154 Mellanox Technologies m 2 2 1 0 1 8 5 4 1 Routing between non CN Nodes The use of the cn guid file option allows non CN nodes to be located on different levels in the fat tree In such case it is not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around Spinel Spine2
298. rface API provide a programming model for exchanging data between cooperating parallel processes The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program The SHMEM parallel programming library is an easy to use programming model which uses highly efficient one sided communication APIs to provide an intuitive global view interface to shared or distributed memory systems SHMEM s capabilities provide an excellent low level interface for PGAS applications A SHMEM program is of a single program multiple data SPMD style All the SHMEM pro cesses referred as processing elements PEs start simultaneously and run the same program Commonly the PEs perform computation on their own sub domains of the larger problem and periodically communicate with other PEs to exchange information on which the next communi cation phase depends The SHMEM routines minimize the overhead associated with data transfer requests maximize bandwidth and minimize data latency the period of time that starts when a PE initiates a transfer of data and ends when a PE can use the data SHMEM routines support remote data transfer through put operations data transfer to a different PE get operations data transfer from a different PE and remote pointers allowing direct references to data objects owned by another PE Additional supported operations are collective broadcast and reduction barrier synchr
299. rget it detects and then exit P srp_daemon will follow the configuration it finds in etc srp_daemon conf Thus it will ignore a target that is disallowed in the configuration file e To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric To execute SRP daemon as a daemon on all the ports e srp daemon sh found under usr sbin srp daemon sh sends its log to var log srp daemon log Start the srpd service script run service srpd start It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRP DAEMON ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Section 4 1 2 6 Mellanox Technologies 51 Rev 2 2 1 0 1 Driver Features For the changes in openib conf to take effect run etc init d openibd restart 4 1 2 5 Multiple Connections from Initiator InfiniBand Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port 1 e SRP connections use the same path the
300. rmation please see add kernel support option m below 2 3 2 Installation Script Mellanox OFED includes an installation script called n1nxofedinstall Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Procedure on page 33 Usage mnt mlnxofedinstall OPTIONS 30 Mellanox Technologies rev 2 2 1 0 1 Options Mellanox Technologies 31 Rev 2 2 1 0 1 Installation add kernel support Add kernel support Run mlnx add kernel support sh skip distro check Do not check MLNX OFED vs Distro matching hugepages overcommit Setting 80 of MAX MEMORY as overcommit for huge page allocation q Set quiet no messages will be printed without lt package gt Do not install package with fabric collector Install fabric collector package 2 3 2 1 mlnxofedinstall Return Codes Table 2 lists the minxofedinstall script return codes and their meanings Table 2 mInxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration This can occur when the required hardware is not present on the system 172 Prerequisites are not met For example missing the required software installed or the hardware is not configured c
301. routes When the fabric contains switches connected with multiple parallel links routes are distributed in a round robin fashion across such links and so changing the order that CA ports are visited changes the distribution of routes across such links This may be advantageous for some specific traffic patterns The default is to visit CA ports in increasing port order on destination switches Duplicate values in the list will be ignored EXAMPLE Look for a 2D since x radix is one 4x5 torus torus 14 5 y is radix 4 torus dimension need both ym link and yp link configuration yp link 0x200000 0x200005 sw y 0 z 0 gt sw 8 y 1 z 0 ym link 0x200000 0x20000f sw y 0 z 0 gt sw 8 y 3 2 0 z is not radix 4 torus dimension only need one of zm link or zp link configuration zp link 0x200000 0x200001 sw 8 y 0 z 0 gt sw y 0 z 1 next seed yp link 0x20000b 0x200010 sw y 2 z 1 gt sw 8 y 3 z 1 ym link 0x20000b 0x200006 sw y 2 z 1 gt sw 8 y 1 z 1 zp link 0x20000b 0x20000c sw 8 y 2 z 1 gt sw y 2 2 2 y dateline 2 Move the dateline for this seed z dateline 1 back to its original position If OpenSM failover is configured for maximum resiliency one instance should run on a host attached to a switch from the first seed and another instance should run on a host attached to a switch from the second seed Both instances should use this torus 2005 conf to ensure path SL values do not change in the event of SM f
302. rrr rer rer 124 5 1 5 Running ScalableSHMEM Application sseessererrrsrsrerr rr rr rea 124 5 2 Message Passing Interface 0 cee eh 124 5 231 OVervie Mco asar riales 124 5 22 Prerequisites for Running MPI 00 eee rer rer reses 125 5 2 3 MPI Selector Which MPI Runs 00 0 ce Ie 126 5 2 4 Compiling MPI Applications 0 0 0 cece cect rer sr rea 127 Mellanox Technologies 5 Rev 2 2 1 0 1 5 3 MellanoX Messaging 0 cece eee en 127 5 3 1 Compiling OpenMPI with MXM o ooooocococcocoor eee 127 5 3 2 Enabling MXM in OpenMPI 0c eee cece I 128 5 3 3 Tumng MXM Settings ita eee SI Ak ERS 128 5 3 4 Configuring Multi Rail Support 200 0 cece cece 129 5 3 5 Configuring MXM over the Ethernet Fabric 0 0 0 0 0 ee eee 129 5 4 Fabric Collective Accelerator 0 0 0 ccc ce eet 129 5 5 ScalableW PGs 7 scents as e e b A da 130 5 5 1 Installing ScalableUPC 0 cet eee 131 5 5 2 FCA Runtime Parameters 00 eene 131 5 5 3 Various Executable Examples 0 00 cece eect teenies 132 Chapter 6 Working With VPL soci sei coro ri A 6 1 Port Type Management oo oooooconcr s 133 6 2 Auto Sensigz so te URS EE ESSET nae Oe e pU ed ep p cues 133 6 2 1 Enabling Auto Sensing ooooocooorrrrrr ee 133 Chapter 7 Performance AAA Daw aa x s 135 Chapter 8 OpenSM Subnet Manager leeren 136 S8 OVA
303. rt num gid idx 0 The value returned will present which guid index to modify on Domo Step 2 Modify the physical GUID table via the admin guids sysfs interface To configure the GUID at index n on port port num cd sys class infiniband mlx4 0 iov ports port num admin guids echo your desired guid gt n Example cd sys class infiniband mlx4 0 iov ports l admin guids echo O0x002fffff8118 gt 3 1 echo 0x0 means let the SM assign a value to that GUID echo Oxffffffffffffffff means delete that GUID echo any other value means request the SM to assign this GUID to this index Step3 Read the administrative status of the GUID index To read the administrative status of GUID index m on port n cat sys class infiniband mlx4 0 iov ports n admin guids m Step 4 Check the operational state of a GUID sys class infiniband mlx4 0 iov ports n gids where n 1 or 2 The values indicate what gids are actually configured on the firmware hardware and all the entries are R O Step 5 Compare the value you read under the admin guids directory at that index with the value under the gids directory to verify the change requested in Step 3 has been accepted by the SM and programmed into the hardware port GID table If the value under admin guids m is different that the value under gids lt m gt the request is still in progress 4 14 6 2 3Multi GUID Support in InfiniBand As of MLNX OFED v2 2 1 0 0 Inf
304. rts are configured as Ethernet When an InfiniBand port exist only probe vf a syntax is valid where a is a single value that represents the number of VFs The second parameter in a triplet is valid only when there are more than 1 physical port Every value either a value in a triplet or a single value should be less than or equal to the respective value of num vfs parameter The example above loads the driver with 5 VFs num vfs The standard use of a VF is a single VF pera single VM However the number of VFs varies upon the working mode requirements The protocol types are Port 1 IB Port 2 Ethernet port type array 2 2 Ethernet Ethernet port type_array 1 1 IB IB port type array 1 2 VPI IB Ethernet NO port type array module parameter ports are IB Step 9 Reboot the server If the SR IOV is not supported by the server the machine might not come out of boot load 96 Mellanox Technologies rev 2 2 1 0 1 Step 10 Load the driver and verify the SR IOV is supported Run lspci grep Mellanox 03 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 03 00 1 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 2 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 3 InfiniBan
305. rus 2QoS cannot route without deadlock two failed switches adjacent in a dimension that is not the last dimension routed by DOR here the failed switches are O and T 5 SS p I I I I I I 4 I I I I I I 3 teet o o Do I I I I I I 2 4 4 1 q r 4 I I I I I I 1 m S n O T p I I I I I I y 0 I I I I I I x 0 1 2 3 4 5 In a pristine fabric torus 2QoS would generate the path from S to D as S n O T r D With failed switches O and T torus 2QoS will generate the path S n I q r D with illegal turn at switch I and with hop I q using a VL with bit 1 set In contrast to the earlier examples the second hop after the illegal turn q r can be used to construct a credit loop encircling the failed switches 8 5 7 2 Multicast Routing Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically available in current switches there is no way to use SL VL values to separate multicast traffic from unicast traffic Thus torus 2QoS must generate multicast routing such that credit loops can Mellanox Technologies 159 Rev 2 2 1 0 1 OpenSM Subnet Manager not arise from a combination of multicast and unicast path segments It turns out that it is possi ble to construct spanning trees for multicast routing that have that property For the 2D 6x5 torus Wee example
306. rx usecs N rx Sets the interrupt coalescing settings when the adaptive frames N moderation is disabled Note usec settings correspond to the time to wait after the last packet is sent received before triggering an inter rupt ethtool a eth lt x gt Queries the pause frame settings ethtool A eth lt x gt rx on off tx Sets the pause frame settings onjoff ethtool g eth lt x gt Queries the ring size values ethtool G eth lt x gt rx lt N gt tx Modifies the rings size lt N gt ethtool S eth lt x gt Obtains additional device statistics ethtool t eth lt x gt Performs a self diagnostics test ethtool s eth lt x gt msglvl N Changes the current driver message level ethtool T eth lt x gt Shows time stamping capabilities ethtool 1 eth lt x gt Shows the number of channels ethtool L eth lt x gt rx lt N gt tx Sets the number of channels lt N gt ethtool show priv flags eth lt x gt Shows driver private flags and their states on off Private flags are pm qos request low latency e mlx4 rss xor hash function e qen disable 32 14 4 e ethtool set priv flags eth lt x gt Enables disables driver feature matching the given private priv flag lt on off gt flag 4 18 Dynamically Connected Transport Service Dynamically Connected transport DCT service is an extension to transport services to enable a higher degree of scalability while maintaining h
307. s Interface for making ib srp connect to a new target One can request ib srp to connect to a new target by writing a comma separated list of login parameters to this sysfs attribute The supported parameters are id ext A 16 digit hexadecimal number specifying the eight byte identifier extension in the 16 byte SRP target port identifier The target port iden tifier is sent by ib srp to the target in the SRP LOGIN REQ request Mellanox Technologies 47 ioc guid dgid pkey service id max sect max cmd per lun io class initiator ext cmd sg entries allow ext sg sg tablesize comp vector Rev 2 2 1 0 1 Driver Features A 16 digit hexadecimal number specifying the eight byte I O controller GUID portion of the 16 byte target port identifier A 32 digit hexadecimal number specifying the destination GID A four digit hexadecimal number specifying the InfiniBand partition key A 16 digit hexadecimal number specifying the InfiniBand service ID used to establish communication with the SRP target How to find out the value of the service ID is specified in the documentation of the SRP target A decimal number specifying the maximum number of 512 byte sectors to be transferred via a single SCSI command A decimal number specifying the maximum number of outstanding commands for a single LUN A hexadecimal number specifying the SRP I O class Must be either Oxff00 rev 10 or 0x0100 rev 16a The I O clas
308. s directly with the driver thus does not require setting up a DCBX daemon on the system The minx qos tool enables the administrator of the system to Inspect the current QoS mappings and configuration 68 Mellanox Technologies m 2 2 1 0 1 The tool will also display maps configured by TC and vconfig set egress map tools in order to give a centralized view of all QoS mappings Set UP to TC mapping Assign a transmission algorithm to each TC strict or ETS Set minimal BW guarantee to ETS TCs Set rate limit to TCs For unlimited ratelimit set the ratelimit to 0 de Usage mlnx qos i interface options Options version show program s version number and exit h help show this help message and exit p LISI prio gelbe maps UPs to TCs LIST is 8 comma seperated TC numbers Example 0 0 0 0 1 1 1 1 maps UPs 0 3 to TCO and UPs Z io NOI s LIST tsa LIST Transmission algorithm for each TC LIST is comma seperated algorithm names for each TC Possible algorithms strict etc Example ets strict ets sets TCO TC2 to ETS and TC1 to strict The rest are unchanged t LIST tcbw LIST Set minimal guaranteed BW for ETS TCs LIST is comma Seperated percents for each TC Values set to TCs that are not configured to ETS algorithm are ignored but must be present Example if TC0 TC2 are set to ETS then 10 0 90 will set TCO to 10 and TC2 to 90 Percents must sum to 100 r LIST ratelimit LIST Rate
309. s defines the format of the SRP initiator and target port identifiers A 16 digit hexadecimal number specifying the identifier extension por tion of the SRP initiator port identifier This data is sent by the initiator to the target in the SRP_LOGIN_REQ request A number in the range 1 255 that specifies the maximum number of data buffer descriptors stored in the SRP_CMD information unit itself With allow ext sg 0 the parameter cmd sg entries defines the maxi mum S G list length for a single SRP_CMD and commands whose S G list length exceeds this limit after S G list collapsing will fail Whether ib_srp is allowed to include a partial memory descriptor list in an SRP_CMD instead of the entire list If a partial memory descriptor list has been included in an SRP CMD the remaining memory descrip tors are communicated from initiator to target via an additional RDMA transfer Setting allow_ext_sg to increases the maximum amount of data that can be transferred between initiator and target via a single SCSI command Since not all SRP target implementations support par tial memory descriptor lists the default value for this option is 0 A number in the range 1 2048 specifying the maximum S G list length the SCSI layer is allowed to pass to ib srp Specifying a value that exceeds cmd sg entries is only safe with partial memory descriptor list support enabled allow_ext_sg 1 A number in the range 0 n 1 specifying the MSI X completion vector
310. s for additional verbosity vvv or v v v V ersion Optional Show version info a 11 Optional Show all LIDs in range including invalid entries n o dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is acomma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The param eters lt startlid gt and lt endlid gt specify the MLID range s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries C Optional Use the specified channel adapter or router ca name P ca port Optional Use the specified port Mellanox Technologies 211 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 33 ibportstate Flags and Options Optional Da Flag Mandato If Not Description ay Specified t Optional Override the default timeout for the solicited lt timeout_ms MADs msec gt lt destdr path Optional Destination s directed path LID or GUID lid guid gt lt startlid gt Optional Starting LID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of
311. s infiniband lt infiniband device iov Under this directory the following subdirectories can be found ports The actual physical port resource tables Port GID tables ports n gids n where 0 n 127 the physical port gids ports n admin guids lt n gt where 0 lt n lt 127 allows examining or changing the administrative state of a given GUID gt ports n pkeys n where 0 n 126 displays the contents of the physical pkey table pci id directories one for Dom0 and one per guest Here you may see the map ping between virtual and physical pkey indices and the virtual to physical gid 0 Currently the GID mapping cannot be modified but the pkey virtual to physical mapping can These directories have the structure pci id port m gid idx 0 where m 1 2 this is read only and Mellanox Technologies 101 Rev 2 2 1 0 1 Driver Features e pci id gt port lt m gt pkey idx lt n gt where m 1 2andn 0 126 For instructions on configuring pkey idx please see below 4 14 6 2 2Configuring an Alias GUID under ports lt n gt admin_guids Step 1 Determine the GUID index of the PCI Virtual Function that you want to pass through to a guest For example if you want to pass through PCI function 02 00 3 to a certain guest you ini tially need to see which GUID index is used for this function To do so cat sys class infiniband iov 0000 02 00 3 port po
312. sages to multicast group with lt num of qps gt qps attached to it h help Show this help screen 1 ib port lt port gt Use port lt port gt of IB device default 1 L inline_size lt size gt Max size of message to be sent in inline post_list lt list size gt Post list of WQEs of lt list size gt size instead of single post m mtu lt mtu gt M MGID lt multicast_gid gt MTU size 256 4096 default port mtu In multicast uses lt multicast_gid gt as the group MGID n iters lt iters gt N no peak bw Number of exchanges at least 5 default 1000 Cancel peak bw calculation default with peak O dualport p port lt port gt Run test in dual port mode Listen on connect to port lt port gt default 18515 q qp lt num of qp s gt Q cq mod Num of qp s default 1 Generate Cqe only after lt cq mod gt completion r rx depth lt dep gt Rx queue size default 512 If using srq rx depth con trols max wr size of the srq R rdma_cm Connect QPs with rdma_cm and run test on those QPs S S1Ze lt size gt Size of message to exchange default 65536 S sl lt sl gt SL default 0 t tx depth lt dep gt Size of tx queue default 128 T tos lt tos value gt Set tos value to RDMA CM QPs available only with R flag values 0 256 default off
313. secret node session timeo replacement timeout 120 node conn 0 timeo login timeout 15 node conn 0 timeo logout timeout 15 node conn 0 timeo noop out_interval 10 node conn 0 timeo noop out timeout 15 node session initial login retry max 10 node session cmds max 128 node session queue depth 32 node session iscsi InitialR2T No node session iscsi ImmediateData Yes node session iscsi FirstBurstlength 262144 node session iscsi MaxBurstLength 16776192 node conn 0 iscsi MaxRecvDataSegmentLength 131072 discovery sendtargets iscsi MaxRecvDataSegmentLength 32768 node session iscsi FastAbort No Step 3 Start the iSCSI Initiator service root sqa070 service iscsi start Step 4 Discover the iSCSI target host In the example below the IP address 12 7 6 30 is the iSCSI target root sqa070 iscsid start root sqa070 iscsiadm m discovery t st p 12 7 6 30 Starting iscsid O T 12 7 6 30 3260 1 ign 2013 10 qalab com sqa030 prt9 Achieving a successful target discovery at this stage is mandatory for proceeding with the process of iSCSI boot A failure at this stage is probably a result of an erroneous target or network configuration and troubleshooting that is out of the scope of this document Step 5 Log into the target root sqa070 4 iscsiadm m node p 12 7 6 30 T ign 2013 10 qalab com sqa030 prt9 login Logging in to iface default target iqn 2013 10 qalab com sqa030 prt9 port
314. sed tx pause prio lt i gt The total number of PAUSE frames sent to the far end port tx pause duration prio lt i gt The total time in microseconds that transmission of packets has been paused Mellanox Technologies 117 Rev 2 2 1 0 1 Driver Features Table 10 Port Pause where lt i gt is in the range 0 7 Counter Description tx pause transition prio i gt The number of transmitter transitions from XON state paused to XOFF state non paused Table 11 VPort Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF Counter Description vport lt i gt _rx_unicast_packet S Unicast packets received successfully vportci rx unicast bytes Unicast packet bytes received successfully vportci rx multicast pack ets Multicast packets received successfully vportci rx multicast byte S Multicast packet bytes received successfully vport lt i gt rx broadcast pac kets Broadcast packets received successfully vportci rx broadcast byte S Broadcast packet bytes received successfully vport lt i gt rx dropped vportci rx errors Received packets discarded due to out of buffer condition Received packets discarded due to receive error condition vport lt i gt tx unicast packet S Unicast packets sent successfully vport lt i gt tx unicast bytes Unicast packet bytes s
315. sent in inline 234 Mellanox Technologies m 2 2 1 0 1 Table 47 ib write bw Flags and Options Flag Description post_list lt list size Post list of WQEs of list size size instead of single post m mtu lt mtu gt MTU size 256 4096 default port mtu h iters lt iters gt Number of exchanges at least 5 default 5000 N no peak bw Cancel peak bw calculation default with peak O dualport Run test in dual port mode p port lt port gt Listen on connect to port lt port gt default 18515 q qp lt num of qp s gt Num of qp s default 1 Q cq mod Generate Cqe only after lt cq mod gt completion R rdma cm Connect QPs with rdma cm and run test on those QPs S Size size Size of message to exchange default 65536 S sl lt sl gt SL default 0 t tx depth lt dep gt Size of tx queue default 128 T tos lt tos value gt u qp timeout lt timeout gt Set tos value to RDMA CM QPs available only with R flag values 0 256 default off QP timeout timeout value is 4 usec 2 timeout default 14 V version Display version number w limit_bw Set verifier limit for bandwidth x gid index lt index gt Test uses GID with GID index Default IB no gid ETH 0 y limit_msgrate Set verifier limit
316. sociate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 3 2 Configuring the DHCP Server on page 259 4 3 3 1 1 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate config uration file needs to be created By default the DHCP server looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf The DHCP server must run on a machine which has loaded the IPoIB module 56 Mellanox Technologies m 2 2 1 0 1 To run the DHCP server from the command line enter dhcpd IB network interface name d Example hostl dhcpd ib0 d 4 3 3 1 2 DHCP Client Optional A DHCP client can be used 1f you need to prepare a diskless machine with an IB driver See Step 8 under Example Adding an IB Driver to initrd Linux de In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file usi
317. states that the mapping is from the sk prio to the TC number the mlx4 en driver interprets this as a sk prio to UP mapping Mapping the sk prio to the UP is done by using tc wrap py i dev name u 0 1 2 35 4 5 6 7 4 The UP is mapped to the TC as configured by the minx qos tool or by the 11dpad daemon if DCBX is used Socket applications can use setsockopt SK PRIO value to directly set the sk prio of the socket In this case the ToS to sk prio fixed mapping is not needed This allows Adi the application and the administrator to utilize more than the 4 values possible via ToS In case of VLAN interface the UP obtained according to the above mapping is also used in the VLAN tag of the traffic 4 5 4 RoCE Quality of Service Mapping Applications use RDMA CM API to create and use QPs The following 1s the RoCE QoS mapping flow 1 The application sets the ToS of the QP using the rdma set option option RDMA OPTION ID TOS value 2 ToS is translated into the Socket Priority sk prio using a fixed translation TOS 0 lt gt sk prio 0 30S 9 lt gt sk prio 2 TOS 24 lt gt sk prio 4 TOS 16 lt gt sk prio 6 3 The Socket Priority is mapped to the User Priority UP using the tc command Incase ofa VLAN device where the parent real device is used for the purpose of this map ping Ifthe underlying device is a VLAN device and the parent real device was not used for the mapping the VLAN device s egress map is used
318. sts by various ULPs and applications running on top of these ULPs 8 6 2 Advanced QoS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by e Port GUID Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group Mellanox Technologies 165 Rev 2 2 1 0 1 OpenSM Subnet Manager Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port II QoS Setup denoted by qos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts III QoS Levels denoted by qos levels Each QoS Level defines Service Level SL and a few optional fields e MTU limit Rate limit e PKey e Packet lifetime When path s search is performed 1t is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAU
319. switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination POC Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 2 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 ibroute 2 3 7 212 Mellanox Technologies m 2 2 1 0 1 Unicast lids 0x3 0x7 of switch Lid 2 guid 0x0002c
320. swl GUID These keywords are used to seed the torus mesh topology For example xp link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the positive x direction while xm link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the negative x direction All the link keywords for a given seed must specify the same from switch In general it is not necessary to configure both the positive and negative directions for a given coordinate either is sufficient However the algorithm used for topology discovery needs extra information for torus dimensions of radix four see TOPOLOGY DISCOVERY in torus 2QoS 8 For such cases both the positive and negative coordinate directions must be specified Based on the topology specified via the torus mesh keyword torus 2QoS will detect and log when it has insufficient seed configuration x dateline position y dateline position z dateline position In order for torus 2QoS to provide the guarantee that path SL values do not change under any conditions for which it can still route the fabric its idea of dateline position must not change rel ative to physical switch locations The dateline keywords provide the means to configure such behavior The dateline for a torus dimension is always between the switch with coordinate 0 and the switch with coordinat
321. t t lt milliseconds gt This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds retries lt number gt This option specifies the number of retries used for transactions Without retries OpenSM defaults to 3 retries for transactions maxsmps n number This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs Without maxsmps OpenSM defaults to a maximum of 4 outstanding SMPs rereg on guid migr This option if enabled forces OpenSM to send port info with client reregister bit set to all nodes in the fabric when alias Guid migrates from one physical port to another aguid inout notice This option enables sending GID IN OUT notices on Alias GUIDs register delete request to registered clients sm assign guid func uniq count base port Specifies the algorithm that SM will use when it comes to choose SM assigned alias GUIDs The default is uniq count console q off local This option activates the OpenSM console default off ignore guids i lt equalize ignore guids file gt This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm hop weights file w path to file This option provides the means to define a wei
322. t rather than by just srpd runing sys class infiniband mad umad lt N gt where lt N gt is a digit The srpd service script allows automatic activation and termination of the srp daemon utility on all system live InfiniBand ports srp daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsr pdm functionality described above srp daemon can also Establish an SRP connection by itself without the need to issue the echo command described in Section 4 1 2 2 Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit Enable High Availability operation together with Device Mapper Multipath Have a configuration file that determines the targets to connect to srp daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c a o is equivalent to ibsrpdm c These srp daemon commands can behave differently than the equivalent 2 ibsrpdm command when etc srp daemon conf is not empty srp daemon extensions to ibsrpdm 50 Mellanox Technologies m 2 2 1 0 1 To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port port num gt and to generate output suitable for echo you may execute
323. t timeout value is 4 usec 2 timeout default 14 implies H print out unsorted results default sorted V version x gid index lt index gt Display version number Test uses GID with GID index Default IB no gid ETH 0 Z com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 50 Additional ib_write_lat Flags and Options Flag Description output lt units gt Set verbosity output level bandwidth message rate latency_typical pkey_index lt pkey index gt PKey index to use for QP 9 5 7 ib atomic bw Calculates the BW of RDMA Atomic transactions between a pair of machines One acts as a server and the other as a client The client RDMA sends atomic operation to the server and calcu late the BW by sampling the CPU each time it receive a successful completion The test supports features such as Bidirectional in which they both RDMA atomic to each other at the same time change of MTU size tx size number of iteration message size and more Using the a flag pro vides results for all message sizes Mellanox Technologies 237 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Synopsis Server ib atomic bw options Client ib atomic bw options hostname Options The table below lists the various flags of the command Table
324. t types values e g 0000 04 00 0 1 2 002b 1c 0b a 1 1 Valid port types 1 ib 2 eth 3 auto 4 N A If only a single port is available use the N A port type for pio Gog 1 4 log maximum number of QPs per HCA default 19 int log maximum number of SRQs per HCA default 16 int log number of RDMARC buffers per QP default 4 int log maximum number of CQs per HCA default 16 int log maximum number of multicast groups per HCA default 13 int log maximum number of default 19 int log maximum number of memory translation table segments per HCA default max 20 2 MTTs for register all of the host mem ory limited to 30 int Enable Quality of Service support in the HCA default off bool Reset device on internal errors if non zero default is 0 int memory protection table entries per HCA Threshold for using inline data int Default and max value is 104 bytes Saves PCI read operation transaction packet less then threshold size will be copied to hw buffer directly Enable RSS for incoming UDP traffic uint On by default Once disabled no RSS for incoming UDP traffic will be done Priority based Flow Control policy on TX 7 0 Per priority bit mask uint Priority based Flow Control policy on RX 7 0 Per priority bit mask uint Rev 2 2 1 0 1 Appendix C mlx5 Module Parameters The mlx5 ib module supports a single parameter used to select the profile which d
325. tc_wrap py The tc tool is used to setup sk prio to UP mapping using the nqprio queue discipline In kernels that do not support maprio such as 2 6 34 an alternate mapping is created in sysfs The tc_wrap py tool will use either the sysfs or the tc tool to configure the sk prio to UP mapping Usage tc wrap py i interface options Options version frein show program s version number and exit show this help message and exit u SKPRIO UP skprio up SKPRIO UP maps sk prio to UP LIST is 16 comma separated UP index of element is sk prio i INTF interface INTF Interface name Example set skprio 0 2 to UPO and skprio 3 7 to UP1 on eth4 UP skp skp skp skp skp skp skp skp skp skp skp skp We 1 skp skp skp skp rio 198 rio BLO rio rio GOs d eroe iae i ios 1 ol aio 1 AOR rio ELO 108 tos 8 e Eo ak i e e ta BS dq c tos 24 o UM A CO tos 16 72 Mellanox Technologies m 2 2 1 0 1 UP UP UP UP UP UP Oy OT ws Co BO 4 5 8 3 Additional Tools tc tool compiled with the sch_mqprio module is required to support kernel v2 6 32 or higher This is a part of iproute2 package v2 6 32 19 or higher Otherwise an alternative custom sysfs interface is available e mlnx qos tool package ofed scripts requires python gt 2 5 tc wrap py package ofed scripts requires python gt 2 5 4 6 Ethernet Time
326. therwise SR IOV cannot be loaded Step 5 Install the MLNX OFED driver for Linux that supports SR IOV SR IOV can be enabled and managed by using one of the following methods Burn firmware with SR IOV support where the number of virtual functions VFs will be set to 16 enable sriov e Run the mlxconfig tool and set the SRIOV EN parameter to 1 without re burning the firmware SRIOV EN 1 For further information please refer to section mlxconfig Changing Device Configuration Tool in the MFT User Manual www mellanox com gt Products gt Software gt Firmware Tools Step 6 Verify the HCA is configured to support SR IOV root selene mstflint dev PCI Device dc 1 Verify in the HCA section the following fields appear HCA num p s 1 coral vis sio en tite lt 0 126 gt Parameter Recommended Value num pfs total vfs 1 Note This field is optional and might not always appear When using firmware version 2 31 5000 and above the recommended value is 126 When using firmware version 2 30 8000 and below the recommended value is 63 Note Before setting number of VFs in SR IOV please make sure your system can support that amount of VFs Setting number of VFs larger than what your Hardware and Software can support may cause your system to cease working sriov en true 92 Mellanox Technologies Rov 2 2 1 0 1 2 Add the above f
327. tic settings all values provided by this file IPADDR ib0 11 4 3 175 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 IPADDR ib 11 4 ETMASK ib0 255 255 0 0 ETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on the first eth n interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 IPADDR ib0 11 4 tan ETMASK ib0 255 255 0 0 ETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 4 3 3 3 Manually Configuring IPoIB This manual configuration persists only until the next reboot or driver restart To manually configure IPoIB for the default IB partition VLAN perform the following steps Step 1 To configure the interface enter the ifconfig command with the following items The appropriate IB interface ib0 ibl etc The IP address that you want to assign to the interface 58 Mellanox Technologies m 2 2 1 0 1 The netmask keyword The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 Step 2 Optional Verify the configuration by entering the i con ig command with the appropriate interface identifier ib argument The foll
328. tion queue entry and manipulated by the driver Valid for RC transport Default is 1 otherwise disabled 1 3 8 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and kernel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 3 4 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Unreliable Datagram UD by default but it may also be configured to be Reliable Connected RC The interface supports unicast multicast and broadcast For details see Chapter 4 3 IP over InfiniBand SER iSCSI Extensions for ROMA iSER extends the iSCSI protocol to RDMA It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies For further information please refer to Chapter 4 2 iSCSI Extensions for ROMA SER SRP SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI soft ware to
329. tionally it can route around multiple failed fabric links or a single failed fabric switch without introduc ing deadlocks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled by A or ucast cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being down A very common case that is handled by the unicast routing cache is host reboot which otherwise would cause two full routing recalculations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destina tion LID through neighbor switches For Up Down routing a BFS from every target is used Mellanox Technologies 151 Rev 2 2 1 0 1 OpenSM Subnet Manager
330. tly 8 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits config F lt file name gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists create config c lt file name gt OpenSM will dump its configuration to the specified file and exit This is a way to generate OpenSM configuration file template guid g lt GUID in hex This option specifies the local port GUID value with which OpenSM should bind OpenSM may be 136 Mellanox Technologies m 2 2 1 0 1 bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port MOS his option specifies the subnet s LMC value he number of LIDs assigned to each port is 2 LMC C values gt 0 allow multiple paths between ports lt T E The LMC value must be in the range 0 7 L L C values gt 0 should only be used if the subnet topology actually provides multiple paths between ports i e multiple interconnects between switches Without 1 OpenSM defaults to IMC 0 which allows one path between any two ports priority p PRIORITY This option specifies the SM s PRIORITY This will effect the handover cases where master ee S chosen by priority and GUID Range goes rh rom 0 lowest priority to 15 highest smkey k SM
331. tries per VF per on the Hypervisor under sys class infiniband mlx4 X iov lt b d f gt ports lt 1 or 2 These entries are displayed only for VFs not for the PF and only for IB ports not ETH ports The first entry enable smi admin is used to enable SMI on a VF By default the value of this entry is zero disabled When set to 1 the SMI will be enabled for the VF on the next rebind or openibd restart on the VM that the VF is bound to If the VF is currently bound it must be unbound and then re bound The second sysfs entry smi_enabled indicates the current enablement state of the SMI 0 indi cates disabled and indicates enabled This entry is read only When a VF is initialized bound during the initialization sequence the driver copies the requested smi state enable smi admin for that VF port to the operational SMI state smi_enablea for that VF port and operate according to the operational state Thus the sequence of operations on the hypevisor is Step 1 Enable SMI for any VF port that you wish Step2 Restart the VM that the VF 1s bound to or just run etc init d openibd restart on that VM The SMI will be enabled for the VF port combinations that you set in step 2 above You will then be able to run network diagnostics from that VF 4 14 7 3 Installing MLNX OFED with Network Diagnostics on a VM gt To install mInx ofed on a VF which will be enabled to run the tools run the following on the VM
332. ts Number of received 512 to 1023 octet frames Ix 1518 bytes packets Number of received 1024 to 1518 octet frames rx 1522 bytes packets Number of received 1519 to 1522 octet frames rx 1548 bytes packets Number of received 1523 to 1548 octet frames rx gt 1548 bytes packets Number of received 1549 or greater octet frames Table 8 Port OUT Counters Counter Description tx packets Total packets successfully transmitted tx bytes Total bytes in successfully transmitted packets tx multicast packets Total multicast packets successfully transmitted tx broadcast packets Total broadcast packets successfully transmitted tx errors Number of frames that failed to transmit tx dropped Number of transmitted frames that were dropped tx It 64 bytes packets Number of transmitted 64 or less octet frames tx 127 bytes packets Number of transmitted 65 to 127 octet frames tx 255 bytes packets Number of transmitted 128 to 255 octet frames tx 511 bytes packets Number of transmitted 256 to 511 octet frames tx 1023 bytes packets Number of transmitted 512 to 1023 octet frames 116 Mellanox Technologies Table 8 Port OUT Counters m 2 2 1 0 1 Counter Description tx 1518 bytes packets Number of transmitted 1024 to 1518 octet frames tx 1522 bytes packets Number of transmitte
333. ts and others h Shows the usage message V Increases the application verbosity level Can be used several times vv or v v v V Shows the version info Addressing Flags Description D Uses directed path address arguments The path is a comma separated list of out ports Examples o self port 0 1 2 1 4 out via port 1 then 2 G Uses GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Uses smlid as the target lid for SM SA queries Flags Description C lt ca_name gt Uses the specified ca_name P lt ca_port gt Uses the specified ca_port t timeout ms Overrides the default timeout for the solicited mads Multiple CA Multiple Port Support When no IB device or port is specified the port to use is selected by the following criteria 1 The first port that is ACTIVE 2 If not found the first port that is UP physical link up If a port and or CA name is specified the user request is attempted to be fulfilled and will fail if it is not possible 202 Mellanox Technologies m 2 2 1 0 1 Examples Direct Routed Examples smpdump D 0 1 2 3 5 16 NODE DESC smpdump D 0 1 2 0x15 2 PORT INFO port 2 LID Routed Examples smpdump 3 0x15 2 PORT INFO lid 3 port 2 smpdump 0xa0 0x11 NODE INFO lid 0xa0 9 4 9 ibv devices Lists InfiniBand devices available for use from userspace including node GUIDs
334. ual Function InfiniBand Ports on page 101 Section 4 17 Ethtool on page 111 Appendix D Lustre Compilation over MLNX_OFED page 255 Removed the following section Ibcheckerrs FlexBoot Package Mellanox FlexBoot Burning Firmware with SR IOV 2 1 1 0 6 March 04 2014 Nochanges Mellanox Technologies 13 Rev 2 2 1 0 1 Table 1 Document Revision History Release Date Description 2 1 1 0 0 February 18 2014 Updated the following section e Section 2 3 3 Installation Procedure on page 33 Section 4 14 2 Setting Up SR IOV on page 91 Section 8 6 1 Overview on page 165 December 2013 Added the following sections Section 2 3 6 Installation Logging on page 39 Section 4 6 2 RoCE Time Stamping on page 76 and its subsections Section 4 19 PeerDirect on page 114 Section 4 20 Inline Receive on page 114 Section 4 21 Ethernet Performance Counters on page 115 Section 4 22 Memory Window on page 119 Section 4 1 2 1 1 SRP Module Parameters on page 46 Section 4 1 2 1 2 SRP Remote Ports Parameters on page 46 Section 4 1 2 2 1 SRP sysfs Parameters on page 47 srpd on page 50 Section 4 6 1 3 Querying Time Stamping Capabilities via ethtool on page 76 Updated the following sections Section 1 5 RDMA over Converged Ethernet RoCE on page 26 Section 2 3 3 Installation Procedure on page 33 e
335. ueryerrors options Options Table 27 ibqueryerrors Flags and Options Flags Description s lt errl err2 gt Suppresses the errors listed in the comma separated list pro vided Mellanox Technologies 197 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 27 ibqueryerrors Flags and Options Flags Description Suppresses some of the common side effect counters These counters usually do not indicate an error condition and can be usually be safely ignored G lt port_guid gt S lt port_guid gt port guid Report results for the port specified For switches results are printed for all ports not just switch port 0 S same as G D lt direct_route gt Provided only for backward compatibility Reports results for the port specified For switches results are printed for all ports not just switch port 0 r Reports the port information This includes LID port exter nal port if applicable link speed setting remote GUID remote port remote external port if applicable and remote node description information data threshold file Includes the optional transmit and receive data counters Specifies an alternate threshold file The default is opt ufm files conf infiniband diags error thresholds switch Prints data for switches only ca Prints data for CA s only router Prints data for routers only clear errors k Clear err
336. ult 1000 N no peak bw Cancel peak bw calculation default with peak O dualport Run test in dual port mode p port lt port gt Listen on connect to port lt port gt default 18515 q qp lt num of qp s gt Num of qp s default 1 Q cq mod Generate Cqe only after lt cq mod gt completion r rx depth lt dep gt Rx queue size default 512 If using srq rx depth con trols max wr size of the srq R rdma_cm Connect QPs with rdma_cm and run test on those QPs S Size lt size gt Size of message to exchange default 65536 230 Mellanox Technologies m 2 2 1 0 1 Table 42 ib send bw Flags and Options Flag Description S sl lt sl gt SL default 0 t tx depth lt dep gt Size of tx queue default 128 T tos lt tos value gt Set lt tos_value gt to RDMA CM QPs available only with R flag values 0 256 default off u qp timeout lt timeout gt QP timeout timeout value is 4 usec 2 timeout default 14 V version Display version number w limit_bw Set verifier limit for bandwidth x gid index lt index gt Test uses GID with GID index Default IB no gid ETH 0 y limit_msgrate Set verifier limit for Msg Rate Z com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command Table 43 Additional ib_s
337. uman readable form Sample output 10 Unit Info port LID 0103 port GID e800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 IO class 0100 ID LSI Storage Systems SRP Driver 200400a0b81146a1 service entries 1 service 0 200400a0b81146al SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the following command ibsrpdm d lt umad device gt 2 Assistance in creating an SRP connection Mellanox Technologies 49 J Rev 2 2 1 0 1 Driver Features To generate output suitable for utilization in the echo command of Section 4 1 2 2 add the c option to ibsrpdm ibsrpdm c Sample output id ext 200400A0881146A1 ioc quid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 To establish a connection with an SRP Target using the output from the ibsrpdm c example above execute the following command echo n id ext 200400A0B81146A1 ioc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband srp srp mlx4 0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command 3 Discover reachable SRP Targets given an InfiniBand HCA name and por
338. uration OK FW image verification succeeded Image is bootable Mellanox Technologies 223 Rev 2 2 1 0 1 InfiniBand Fabric Utilities 9 4 18 9 4 19 ibv_asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv_asyncwatch Examples 1 Display asynchronous events gt ibv_asyncwatch mlx4 0 async event FD 4 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX family adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analysis The following describes a work flow for local HCA adapter sniffing Run ibdump with the desired options Run the application that you wish its traffic to be analyzed Stop ibdump CTRL C or wait for the data buffer to fill in mem mode e Open Wireshark and load the generated file How to Get Wireshark Download the current release from www wireshark org for a Linux or Windows environment See the ibdump release notes txt file for more details Although ibdump is a Linux application the generated pcap file may be analyzed on either operating system In order for ibdump to function with RoCE Flow Steering must be enabled To do so add the following to etc modprobe d minx conf file An options mlx4 core log num mgm entry size 1 and then restart the drivers Synopsis ibdump options Output Files d ib dev lt dev
339. urn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g LC MTUSB 1 nofs burn Burn image in a non failsafe manner skip_is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sector difference is detected byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages y es All Non interactive mode Assume the answer is yes to all ques tions no All Non interactive mode Assume the answer is no to all ques tions Mellanox Technologies 221 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 36 mstflint Switches Sheet 3 of 3 Affected Switch Relevant Description Commands vsd burn Write this string of up to 208 characters to VSD upon a burn lt string gt command burn Burn vsd as it appears in the given image do not keep existing use image p s VSD on Flash dual image burn V Make the burn process burn two images on Flash The current default failsafe burn process burns a single image in alternat ing locations Print version info Table 37 mstflint Commands Comman
340. us 2QoS routing engine can provide the following functionality on a 2D 3D torus Free of credit loops routing Two levels of QoS assuming switches support 8 data VLs Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL values e Very short run times with good scaling properties as fabric size increases 8 5 7 1 Unicast Routing Torus 2 QoS is a DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension It encodes into a path SL which datelines the path crosses as follows ail n for d 0 d torus dimensions d path crosses dateline d returns 0 or 1 sl path crosses dateline d lt lt d For a 3D torus that leaves one SL bit free which torus 2 QoS uses to implement two QoS levels Torus 2 QoS also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows ioe sil OF sil lt p Sil sem cdir port reports which torus coordinate direction a switch port e Moines sii amd sesos O dL oe 2 wj sl2vl iport oport sl 0x1 amp sl gt gt cdir oport Mellanox Technologies 157 Rev 2 2 1 0 1 OpenSM Subnet Manager Thus on a pristine 3D
341. w Specific Parameters 0 0 ccc ccc ea 89 Table 10 ethtool Supported Options o ooooooooororrnr II 112 Table 11 Port IN Conti pr e ORE EVeN exer exe PIPER PIE 115 Table 12 Port OUT Counters i oer AERE MR E Rc DR Eae Ep ege 116 Table 13 Port VLAN Priority Tagging where lt i gt is in the range 0 7 o ooo o 117 Table 14 Port Pause where lt i gt is in the range 0 7 2 6 eee teens 117 Table 15 VPort Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF 118 Table 162 SW Statistics ic nues 5 deed ib herr D WERE ne We ere Ewes 119 Table 17 Per Ring SW Statistics where lt i gt is the ring I per configuration 119 Table 18 Useful MPL Links tirera oheESpHPCRUERTCCNSHDOVYCI Cher DPI NP PCI IDEE 125 Table 19 Runtime Parameters ooocoocooccocoo I n 131 Table 20 Adaptive Routing Manager Options File 00 cece eee rer resa 180 Table 21 Adaptive Routing Manager Pre Switch Options File esses 181 Table 22 Congestion Control Manager General Options File ooooooooooo o 184 Table 23 Congestion Control Manager Switch Options File o o ooooooomomoo 184 Table 24 Congestion Control Manager CA Options File o o ooooooooomomomom 184 Table 25 Congestion Control Manager CC MGR Options File oooooooooo o 185 Table 26 ibdiagnet of ibutils2 Output Files 2 0 0 0 cece cette 190 Table
342. w time stamps are generated SOF TIMESTAMPING RAW SYS determine how they are reported To enable time stamping for a net device Admin privileged user can enable disable time stamping through calling ioctl sock SIOCSH WTSTAMP amp ifreq with following values Send side time sampling Enabled by ifreq hwtstamp config tx type when possible values for hwtstamp config gt tx type enum hwtstamp tx types No outgoing packet will need hardware time stamping should a packet arrive which asks for it no hardware time stamping will be done i HWTSTAMP TX OFF Enables hardware time stamping for outgoing packets the sender of the packet decides which are to be time stamped by setting SOF TIMESTAMPING TX SOFTWARE before sending the packet sy HWTSTAMP TX ON Enables time stamping for outgoing packets just as HWTSTAMP TX ON does but also enables time stamp insertion directly into Sync packets In this case transmitted Sync packets will not received a time stamp via the socket error queue si HWTSTAMP TX ONESTEP SYNC be Note for send side time stamping currently only HWTSTAMP TX OFF and HWTSTAMP TX ON are supported 74 Mellanox Technologies rev 2 2 1 0 1 Receive side time sampling Enabled by ifreq hwtstamp config rx filter when possible values for hwtstamp config gt rx filter enum hwtstamp rx filters time stamp no incoming packet at
343. which indicates that if enabled the atomic operation replied value is big endian and contradicts the host endianness To enable atomic operation with this endianness contradiction use the ibv exp create qp to create the QP and set the IBV EXP QP CREATE ATOMIC BE REPLY flag on exp create flags 4 8 2 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those defined by the IB spec Atomicity guarantees Atomic Ack generation ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec 4 8 2 1 Masked Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic response va if compare add va amp compare add mask then Mellanox Technologies 79 Rev 2 2 1 0 1 Driver Features 4 8 2 2 4 9 80 Mellanox Technologies tva va swap mask swap amp swap mask return atomic response The additional operands are carried in the Extended Transport Header Atomic response genera tion and packet format for MskCmpS w
344. will not have SR IOV enabled Triplets and single port VFs are only valid when all ports are configured as Ethernet When an InfiniBand port exists only num vfs a syntax is valid where a is a single value that represents the number of VFs The second parameter in a triplet is valid only when there are more than 1 physical port In a triplet x z lt 63 and y z lt 63 the maximum number of VFs on each physical port must be 63 94 Mellanox Technologies m 2 2 1 0 1 Parameter Recommended Value port type array Specifies the protocol type of the ports It is either one array of 2 port types t1 t2 for all devices or list of BDF to port type array bb dd f tl t2 string Valid port types 1 ib 2 eth 3 auto 4 N A If only a single port is available use the N A port type for port2 e g 1 4 probe vf fabsent or zero no VF interfaces will be loaded in the Hypervisor host If num vfs is a number in the range of 1 63 the driver run ning on the Hypervisor will itself activate that number of VFs All these VFs will run on the Hypervisor This number will apply to all ConnectX HCAs on that host e fitsa triplet x y z applies only if all ports are configured as Ethernet the driver probes xsingle port VFs on physical port 1 ysingle port VFs on physical port 2 applies only if such a port exist zn port VFs where n is the number of physical ports on device Those VFs are attache
345. xit Status If a failure to scan the fabric occurs return 1 If the scan succeeds without errors beyond thresh olds return 0 If errors are found on ports beyond thresholds return 1 Files opt ufm files conf infiniband diags error thresholds Define threshold values for errors File format is simple name val Comments begin with Example Define thresholds for error counters SynbolErrorCounter 10 LinkErrorRecoveryCounter 10 VL15Dropped 100 9 4 7 saquery saquery issues the selected SA query Node records are queried by default Synopsis saouery Ei 5 Sal EN Elise l FS 24 Su SU Sel 1 0 01 sel El eel sol e Sc ea mel ER ca pul Eel val t imeout lt msec gt src to dst lt src dst gt sgid to dgid lt sgid dgid gt node name map lt node name map gt lt name gt lid lt guid gt Options Table 28 saquery Flags and Options Flags Description p Gets PathRecord info Mellanox Technologies 199 Rev 2 2 1 0 1 InfiniBand Fabric Utilities Table 28 saquery Flags and Options Flags Description N Gets NodeRecord info list D Gets NodeDescriptions of CAs only S Gets ServiceRecord info I Gets InformInfoRecord subscription info L Returns the Lids of the name specified l Returns the unique Lid of the name speci fied G Returns the Guids of the name specified O Returns the name for the Lid specified U Returns the na
346. ying Inline Receive capability and Inline Receive activation the feature is trans parent to user application When Inline Receive is active user application must provide a valid virtual address for the receive buffers to allow the driver moving the inline received message to these buf fers The validity of these addresses is not checked therefore the result of providing non valid virtual addresses is unexpected Connect IB supports Inline Receive on both the requestor and the responder sides Since data is copied at the poll CQ verb Inline Receive on the requestor side is possible only if the user chooses IB V SIGNAL ALL WR 4 20 1 Querying Inline Receive Capability User application can use the ibv_exp query device function to get the maximum possible Inline Receive size To get the size the application needs to set the IBV EXP DEVICE ATTR INLINE RECV SZ bitin the ibv exp device attr comp mask 4 20 2 Activating Inline Receive To activate the Inline Receive you need to set the required message size in the max inl recv field in the ibv exp qp init attr struct when calling ibv exp create qp function The value returned by the same field 1s the actual Inline Receive size applied Setting the message size may affect the WQE CQE size 114 Mellanox Technologies m 2 2 1 0 1 4 21 Ethernet Performance Counters Counters are used to provide information about how well an
347. ys and GUIDs under SR IOV 4 14 6 1 Port Type Management Port Type management is static when enabling SR IOV the connectx port config script will not work The port type is set on the Host via a module parameter port type array in mlx4 core This parameter may be used to set the port type uniformly for all installed Con nectX HCAs or it may specify an individual configuration for each HCA This parameter should be specified as an options line in the file etc modprobe d mlx4 core conf For example to configure all HCAs to have Portl as IB and Port2 as ETH insert the following line options mlx4 core port type array 1 2 To set HCAs individually you may use a string of Domain bus device function x y For example if you have a pair of HCAs whose PFs are 0000 04 00 0 and 0000 05 00 0 you may specify that the first will have both ports as IB and the second will have both ports as ETH as follows options mlx4 core port type array 0000 04 00 0 1 1 0000 05 00 0 2 2 Mellanox Technologies 99 J Rev 2 2 1 0 1 Driver Features Only the PFs are set via this mechanism The VFs inherit their port types from their asso ciated PF 100 Mellanox Technologies m 2 2 1 0 1 4 14 6 2 Virtual Function InfiniBand Ports Each VF presents itself as an independent vHCA to the host while a single HCA is observable by the network which is unaware of the vHCAs No changes are required by the InfiniBand sub system
Download Pdf Manuals
Related Search
Related Contents
Energy Sistem 4100 Samsung 225UW Owner`s Manual - Maxx Racing Performances MANUAL DEL USUARIO EStUfAS DE pELLEt Dichloroisocyanurate de sodium (NaDCC) – 8,5 Instrucciones de uso CONCENTRADOR DE OXIGENO Kröber O2 INSCRIPTIONS ALSH VACANCES D`HIVER 2015 MODE D`EMPLOI Copyright © All rights reserved.
Failed to retrieve file