Home

here - Mellanox

image

Contents

1. 185 8 17 ibdump 186 Appendix A Mellanox FlexBoot 189 A l Overview 189 A 2 Burning the Expansion ROM Image 191 A 3 Subnet Manager OpenSM 196 A 4 TFTP Server 196 A 5 BIOS Configuration 196 A 6 Operation 196 A 7 Command Line Interface CLI 198 A 8 Diskless Machines 200 A 9 iSCSI Boot 205 A 10 WinPE 222 Mellanox Technologies T Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Appendix B SRP Target Driver 223 B 1 Prerequisites and Installation 223 B 2 How to run 223 B 3 How to Unload Shutdown 226 Appendix C mlx4 Module Parameters 227 C mlx4 core Parameters 227 C 2 mlx4 ib Parameters 228 C 3 mlx4 en Parameters 228 C 4 mlx4 fc Parameters 228 Appendix D ib bonding Driver for Systems using SLES10 SP3 229 D 1 Using the ib bonding Driver 229 8 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 List of Tables Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12 Table 13 Table 14 Table 15 Table 16 Table 17 Table 18 Table 19 Table 20 Table 21 Table 22 Table 23 Table 24 Table 25 Table 26 Table 27 Mellanox Technologies 9 Typographical Conventions 0 0 cc rr 12 Abbreviations and AChONY IMS lt 4 A A e
2. 151 Sol SYNOPSYS ap et bicis sad dt oder ee ted ecd 152 32 cONUIDUL BIOS isa ti a a ds 153 5 3 95 RCM Od S uss a UE SA o lat da 153 8 4 ibdiagnet of ibutils IB Net Diagnostic 153 Sel SVNORS Yo riattare hio qa feti 153 Bete SOUP 010104 s EE TE T EI T EO DOT QU TO T eo 155 SES ERROR CODES yaaa a ao 155 8 5 ibdiagpath IB diagnostic path 156 Sl SYNOPSYS chute eh eit ae ee el da 156 SO Opal Elles caeco a A Seed SE ote alte A a t 157 395 ERROR CODES srt Ses Ced a e ed 157 8 6 ibv_devices 158 8 7 ibv_devinfo 158 8 8 ibdev2netdev 160 Sl SYNORSXS A o 160 8 0 ibstatus 160 8 10 ibportstate 163 8 11 ibroute 168 8 12 smpquery 172 8 13 perfquery 175 8 14 ibcheckerrs 179 8 15 mstflint 181 8 16 ibv asyncwatch
3. 93 4 2 InfiniBand Driver 94 43 Ethernet Driver 94 DEAN A AA IA Pa STI ce 94 432 Loadme the Ethernet DAVE bed dad ad ete ve laa 94 4 53 Unloadinp the Iriver sci e LCS aa ie 94 4 3 4 Ethernet Driver Usage and Configuration 0 0 0 0 0 ccc een eens 95 Chapters Perlormance iaa aia ia 97 5 1 General System Configurations 97 SLL PEL Express POE Capabilities aa vague yu ED dee wate the ek eee ees 97 5 1 2 BIOS Power Management Settings 97 5 1 3 Intel Hyper Threading Technology 97 5 2 Performance Tuning for Linux 97 Mellanox Technologies o Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 5 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 97 5 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 98 5 23 JInfertupt Modera cu sciiti ila Beis wade Lae alee med Adler oi ae BAe 98 2 24 Merapi AHY AA aee oe gue cite eue Rae XU b led eet eae di 99 5 2 4 1 Example Script for Setting Interrupt Affinity LL 99 5 2 5 Preserving Your Performance Settings After A Rebo0t ooooooooooooo 100 5 3 Performance Troubleshooting 100 5 3 1 PCI Express Performance Trou
4. IPoIB configuration Section 3 8 3 How to create and remove subinterfaces Section 3 8 4 How to verify IPoIB functionality Section 3 8 5 The ib bonding driver Section 3 8 6 3 8 2 IPolB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Connected mode This can be changed to become Datagram mode by editing the file etc infiniband openib conf and setting SET IPOIB CM no After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode 3 8 3 IPolB Configuration Unless you have run the installation script n1nxofedinstall with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port like any other network adapter card 1 e you need to prepare a file called 1fcfg 1b lt n gt for each port The first port on the first HCA in the host is called interface 1b0 the second port is called 1b1 and so on An IPoIB configuration can be based on DHCP Section 3 8 3 1 or on a static configuration Sec tion 3 8 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 3 8 3 3 3 8 3 1 IPolB Configuration Based on DHCP Setting a
5. Mellanox Technologies 65 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Note These srp_daemon commands can behave differently than the equivalent ibsrpdm command when etc srp daemon conf is not empty 2 srp daemon extensions to ibsrpdm To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port lt port num gt and to generate output suitable for echo you may execute hostl srp daemon c a o i lt InfiniBand HCA name p port number Note To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run 1s sys class infiniband To both discover the SRP Targets and establish connections with them just add the e option to the above command e Executing srp daemon over a port without the a option will only display the reachable tar gets via the port and to which the initiator 1s not connected If executing with the e option it is better to omit a t is recommended to use the n option This option adds the initiator ext to the connecting string See Section 3 6 2 5 for more details srp daemon has a configuration file that can be set where the default is etc srp daemon conf Use the f to supply a different configuration file that configures the tar gets srp daemon is allowed to connect to The configuration file can also be used to set val ues for additional parameters e g max cmd per
6. perror listen failed SS EXIT FAILURE accept the client connection StTUeL sockgsgdr xp e Lrent ddr socklen E cliente addr len 81280t Client ado Mt GSUOc goceputsd Struecr 30Ckaddt echen t addr client ddr Len iu Oo AO A perror accept failed SxXLe EAL EATBLURE T printf accepted connection from s u n net toga client addr Sin sadn DUtonsqolrenu addr Sp POLE 9 Seuze Ne Tego Cd EX burter IRXBURSA EE REAU s Mellanox Technologies 61 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features perror read failed EXIC BXIT RALLURE else if nr 0 printf socket was closed by remote hostin print read 2da bytes n me peint end of testa ye close cd close sd return 0 3 5 6 BZCopy Zero Copy Send BZCOPY mode is only effective for large block transfers By setting the sys parameter sdp zcopy thresh toa non zero value a non standard SDP speedup is enabled Messages longer than sdp zcopy thresh bytes in length cause the user space buffer to be pinned and the data to be sent directly from the original buffer This results in less CPU usage and on many systems much higher bandwidth Note that the default value of sdp zcopy thresh is 64KB but is may be too low for some systems You will need to experiment with your hardware to find the best value 3 5 7 Using RDMA for Small B
7. It is possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bonding master Note Restarting openibd does no keep the bonding configuration via Network Scripts You have to restart the network service in order to bring up the bonding master After the configuration is saved restart the network service by running etc init d network restart 3 9 Quality of Service 3 9 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Mellanox Technologies 85 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Figure 2 l O Consolidation Over InfiniBand Servers J J Unified I O f in Du QoS Filer eS UD IB Eifiernet 5 Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 7 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for sup
8. gt modprobe 8021q Add a VLAN device gt vconfig add eth2 7 Assign an IP address to the VLAN interface This should create a new entry in the GID table as index 1 gt ifconfig eth2 7 7 10 11 12 Verbs test On server gt ibv rc pingpong g 1 On client gt ibv rc pingpongs g 1 server For rdma cm applications the user needs only to specify an IP address of a VLAN device for the traffic to go with the VLAN tagged frames 3 1 8 Reading Port Counters Statistics It is possible to read port statistics in the same way it 1s done for regular InfiniBand ports The information is available from the sysfs at sys class infiniband lt device gt ports port number counters and the supported counters are port rcv packets port xmit packets port rcv data and port xmit data These counters count InfiniBand data only and do not account for Ethernet traffic For example to read the number of transmited packets run gt cat sys class infiniband lt device gt ports lt port number gt counters port xmrt packets Note RoCE traffic is not shown in the associated Etherent device s counters since 1t 1s offloaded by the hardware and does not go through Ethernet network driver Mellanox Technologies 39 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features 3 1 9 A Detailed Example This section provides a step by step example of using InfiniBand over Ethernet RoCE Installation and Dri
9. ibroute on page 168 e Section 8 12 smpquery on page 172 e Section 8 13 perfquery on page 175 e Section 8 14 ibcheckerrs on page 179 e Section 8 15 mstflint on page 181 e Section 8 16 ibv_asyncwatch on page 185 e Section 8 17 ibdump on page 186 Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation syn opsis and options descriptions error codes and examples Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options Mellanox Technologies Confidential InfiniBand Fabric Diagnost
10. A C ca name P ca port gt t timeout ms gt N dest dr path lid guid lt startlid gt lt endlid gt Table 17 lists the various flags of the command Table 17 ibportstate Flags and Options Default If Not Description Specified d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a ll Optional Show all LIDs in range including invalid entries ee S Increase verbosity level May be used several times for additional verbosity vvv or v v v Pr Oni O Sein NNNM rat omni Sat mme mi ie NN mero oe ET Optional Mandatory D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases 1t is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The parameters lt star tlid gt and lt endlid gt specify the MLID range s lt smlid gt s lt smlid gt Me L Use lt smlid gt as the target LID for SM SA queries Mellanox Technologies Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 17 ibportstate Flags and Options Optional qeu TE If Not Description 2 Specified t Optional Override the default timeout f
11. B On Initiator Machines On Initiator machines manually perform the following steps 224 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 1 Run modprobe ib srp 2 Run ibsrpdm c d dev infiniband umadX to discover a new SRP target umad0 port 1 of the first HCA umsdis port 2 or the first HCA umad2 port 1 of the second HCA 3 echo new target info gt sys class infiniband srp srp mthca0 1 add target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in the system 1 e mthca0 root lab104 ibsrpdm c d dev infiniband umadO 1d ext 00070c902002200f4 100 guid 000ZC90200226C 4 deord fes000000000000000020902002260Gf5 pkKey srtrfrt serwrce 10 00026902002 ZOCTA root lab104 echo id ext 0002c90200226c 4 10c guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c902002 26cf4 gt sys class infiniband srp srp mthca0 1 add target OR You can edit etc infiniband openib conf to load the SRP driver and SRP High Availability HA daemon automatically that is set SRP LOAD yes and SRPHA ENABLE yes e To set up and use the HA feature you need the dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of
12. IV Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the following fields e SRC and DST to lists of port groups e Service ID to a list of Service ID values or ranges QoS Class to a list of QoS Class values or ranges 3 9 4 CMA Features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma resolve add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 3 9 4 1 IPolB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 3 9 4 2 SDP SDP uses CMA for building its connections The Service ID for SDP is 0x000000000001 PPPP where PPPP are 4 hexadecimal digits holding the remote TCP IP Port Number to connect to 3 9 4 3 RDS RDS uses CMA and thus it is very close to SDP The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hexadecimal digits holding the TCP IP Port Number that the protocol connects to The default port number for RDS is 0x48CA which makes a default
13. Note If you purchased a standard Mellanox Technologies network adapter card please download the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches command parameters Mellanox Technologies 181 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities Table 21 lists the various switches of the utility and Table 22 lists 1ts commands Table 21 mstflint Switches Sheet 1 of 2 Affected Relevant Description Commands Print the help menu Print an extended help menu Specify the device to which the Flash is connected Switch d evice device guid lt GUID gt GUID base value 4 GUIDs are automatically assigned to the following val ues guid gt node GUID guid 1 gt port guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value ignores this value It can be set to 0x0 MAC address base value Two MACS are automatically assigned to the fol lowing values mac gt portl mac l gt port2 Note This switch is applicable only for Mellanox Technologies Ethernet products blank_guids No co
14. The VLAN implementation used by EoIB uses operating systems unaware of VLANs This is in many ways similar to switch tagging in which an external Ethernet switch adds strips tags on traf Mellanox Technologies 13 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features fic preventing the need of OS intervention EoIB does not support OS aware VLANS in the form of vconfig Configuring VLANs To configure VLAN tag for a vNic add the VLAN tag property to the configuration file in host administrated mode or configure the vNic on the appropriate vHub in network administered mode In the host administered mode when a vHub with the requested VLAN tag is not available the vNIC s login request will be rejected Host administered VLAN configuration in centralized configuration file can be modified as fol low Add vid lt VLAN tag gt or remove vid property for no VLAN Host administered VLAN configuration with ifcfg ethX configuration files can be modified as follow Add VNICVLAN lt VLAN tag gt or remove VNICVLAN property for no VLAN Note Using a VLAN tag value of 0 is not recommended because the traffic using it would not be separated from non VLAN traffic Note For Host administered vNics VLAN entry must be set in the BridgeX first For fur ther information please refer to BridgeX documentation 3 7 2 4 EolB Multicast Configuration Configuring Multicast for EoIB interfaces is identical t
15. help 3 7 3 2 ethtool ethtool application is another method to retrieve interface information and change its configura tion EoIB interfaces support ethtool similarly to hardware Ethernet interfaces The supported Ethtool options include the following options c C Show and update interrupt coalesce options Mellanox Technologies 75 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features g Query RX TX ring parameters k K Show and update protocol offloads i Show driver information S Show adapter statistics For more information on ethtool run ethtool h 3 7 3 3 Link State An EoIB interface can report two different link states e The physical link state of the interface that is made up of the actual HCA port link state and the status of the vNics connection with the BridgeX f the HCA port link state is down or the EoIB connection with the BridgeX has failed the link will be reported as down because without the connection to the BridgeX the EoIB protocol cannot work and no data can be sent on the wire The mlx4 vnic driver can also report the status of the external BridgeX port status by using the mlx4 vnic info script If the eport state enforce module parameter is set then the external port state will be reported as the vNic interface link state If the connection between the vNic and the BridgeX is broken hence the external port state is unknown the link will be reported as dow
16. http www mellanox com downloads ofed mpi pinned zip gt 4 Please make sure that OpenSM is running in your subnet prior to running the mpi pinned application gt Mellanox Technologies 47 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features This application is running over 2 hosts using OPEN MPI To run the test perform the following steps Run export LD LIBRARY PATH LD LIBRARY PATH usr mpi gcc openmpi 1 4 2 1ib64 usr local cuda lib64 since the application is dynamically linked to openmpi and CUDA libraries Where usr local cuda lib64 is the directory in which the CUDA librarie filed are located 2 Run export IB USE GPU 1 3 Run the test on one of the hosts using the following command usr mpi gcc openmpi 1 4 2 bin mpirun x LD LIBRARY PATH x IB USE GPU host HOST 1 HOST 2 mpi pinned Where HOST 1 and HOST are the hosts on which you want to run the application Both hosts must have the hardware and the software required for GPUDirect support as described in Section 3 2 2 GPUDirect Installation on page 46 If the GPUDirect is not functional or not enabled the test will no longer respond after the follow ing line Process 0 is on l reg 6107 Process 1 is on l reg 6108 Host gt device bandwidth for process 0 5999 160118 MB sec Host gt device bandwidth for process 1 5928 385108 MB sec If the GPUDirect is functional and enabled the test s output would b
17. ibsrpdm c sample output id ext 200400A0B81146A1 10c guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ff f service id 200400a0b81146al b To establish a connection with an SRP Target using the output from the 11bsrpdm c example above execute the following command echo n id ext 200400A0B81146A1 ioc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband srp srp mthca0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command srp_daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsrpdm functionality described above srp daemon can also Establish an SRP connection by itself without the need to issue the echo command described in Section 3 6 2 2 Continue running in background detecting new targets and establishing SRP connections with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where N is a digit Enable High Availability operation together with Device Mapper Multipath Havea configuration file that determines the targets to connect to l srp daemon commands equivalent to ibsrpdm Sp daemon Sa o is equivalent to ibsrpam orp daemon uc a lt 07 TS equivalent to bs 6pon GC
18. partiformand hawig jdev sda 502 0 MB FLinux swap use them you might wantto go back and select automatic partitioning Idev sda3 7 4 GB F Linux native Reiser Please note that nothing will be written to your hard disk until you confirm Really delete device dev sdaz the entire installation ee in the last installation j No dialog Until that point o i you can safely abort the installatian Far LVM setup using a non LVM root device and a non LVM swap device is recommended Other than the root and swap devices you should have partitions managed by LVM The table to the right showsthe current partitions on all your hard disks Hard disks are Back Abort Finish Step 13 In the pop up window click No to approve deleting the swap partition You will be returned to Installation Settings window See image below 216 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Partition your hard 4 Expert Partitioner disks This is intended far experts If you are not Device Size Mount Mount By Start End Used By Label familiar with the idevisda 8 0 GB IET VIRTUAL DISK concepts of hard disk Idev sdal 705MB FLinuxnative Ext2 boot partitions and how to use them you might wantto go back and select automatic partitioning dev sda2 7 4 GE F Linux native Reiser 73 1045 Pleas
19. 4 LASH Routing Algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distrib uting the paths between layers LASH is an alternative deadlock free topology agnostic routing algorithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node 5 LASH Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric 1s cabled as a hypercube and for meshes when cabled as a mesh 6 Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2QoS provides deadlock free routing while supporting two quality of service QoS levels Additionally it can route around multiple failed fabric links or a single failed fabric switch without introducing deadlocks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled by A or ucast cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being down A very comm
20. 9 94 edo Ts dA o id ld 2rd qos ca Sl12vl 0 1 2 953 4 5 0 7 9 9 10 11 12 135 14 7 qos swe max vls 15 qos swe high limit 0 dos swe vlansb high dd E cedono UU Sor TUO sU oV xe T0140 qos swe vlarb low 0 0 174 7 4 3 4 48 4 0 4 0 4 7 4 9 4 9 4 10 42 115 4 12 4 135 4 14 4 qos swe slZvl 0 1 2 3 4 5 6 8 9 10 11 12 13 14 7 VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs Mellanox Technologies 141 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager 7 6 8 A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos type high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sentis high limit times 4K bytes A high limit value of 255 indicates
21. Drbe DZ 569 value ci bl b2 co value amp 2 return value 1 define MASK IS SET mask attr 1 mask amp attr carry 0 atomic response 0 DEC POSICION 1 for i 0 To 03 d TE Dun de 0 Dst posrtson PIE POSICION e L bit add res DEL adder carry MASK ES ST lve Di Position MASK IS SET compare add bit position amp new carry EE Dic ada res atomic Le sponse qe Dre postico carry new carry amp amp MASK IS SET compare add mask Dit position TEturn atomic response 90 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 11 Socket Acceleration 3 11 1 Overview The socket acceleration will accelerate the latency of the supported functionalities recv poll select epoll of all the sockets which match the policy rules mechanism 3 11 2 Software Dependencies To use MLX4 Socket Acceleration module over Mellanox ConnectX hardware you must load the mlx4_en driver 3 11 3 MLX4 Socket Acceleration Module Configuration To configure MLX4 Socket Acceleration module perform the following steps l Load the mlx4 en kernel module modprobe mlx4 en enable rx accl 1 2 Load the mlx4 acceleration modules modprobe mlx4 accl sys modprobe mlx4 accl 3 Set the policy rules according to the usa
22. Note Only when the mlxfc service is stopped and the mlx4_en module is removed can the mlx4 core module be removed as well 3 3 2 4 Enabling Disabling FCoE and FColB Services To enable disable FCoE and or FCoIB upon boot please edit the file etc mlxfc mlxfc conf andset the following variables to either YES or NO Start FCoE FCOE Yes No Start FCoIB FCOIB Yes No 3 3 3 FCOE Advanced Usage Advanced usage will probably be needed when connected to FCoE switches that do not support the Cisco like FCoE DCBX auto negotiation 3 3 3 1 Manual vHBA Control Manual control allows creating and destroying vHBAs and signaling link up and link down to existing VHBAs This is done using sysfs operations When using the pre T11 stack the sysfs directory is located at sys class mlx4 fc 50 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 When using the T11 stack the sysfs directory is located at sys module fcoe Both directories contain the same entries In the following the sysfs directory will be referred to as SFCSYSES To create a new vHBA on an Ethernet interface e g eth3 run gt echo eth3 gt SECSYSFS create To destroy a previously created vHBA on an interface e g eth3 run gt echo eth3 gt SFCSYSFS destroy To signal link up to an existing vHBA e g on eth3 run gt echo eth3 gt SFCSYSFS link up To signal l
23. an optional field If it exists the vNic will be assigned the VLAN ID specified This value must be between 0 and 4095 If no vid is specified or value 1 is set the vNic will be assigned to the default vHub associated with the GW vnic id A unique number per vNic between 0 and 32K bx The BridgeX box system GUID or system name string eport The string describing the eport name vNic Specific Configuration Files ifcfg ethX EoIB configuration can use the ifcfg ethX files used by the network service to derive the needed configuration In such case a separate file is required per vNic Additionally you need to update the ifcfg ethX file and add some new attributes to it On Red Hat Linux the new file will be of the form DEVICE eth2 HWADDR 00 30 48 7d de e4 BOOTPROTO dhcp ONBOOT yes BXADDR BX001 BXEPORT A10 VNICIBPORT m1x4 0 1 VNICVLAN 3 Optional field The fields used in the file for vNic configuration have the following meaning Table 3 Red Hat Linux mlx4 vnic conf file format Field Description DEVICE An optional field The name of the interface that 1s displayed when running ifconfig If it is not present the trailer of the configuration file name e g ifcfg eth47 eth47 is used instead HWADDR The mac address to assign the vNic BXADDR The BridgeX box system GUID or system name string BXEPORT The string describing the eport name VNICVLAN An optional field If it exists the v
24. dmesg 92 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 4 Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth Ifa ConnectX port is configured as Eth it may also function as a Fibre Channel HBA 4 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet or Fibre Channel over Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx port config script after the driver is loaded Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved configuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are eth Ethernet e ib Infiniband auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds Table 4 lists the ConnectX port configurations supported by VPI Table 4 Supported ConnectX Port Configurations Port 1 Configuration Port 2 Configuration 1b eth Note that the configuration Port eth and Port2 ib is not supported Also note that FCoE can run only on a port configured as
25. lt version gt rom IB Port 1 e IHOST3EX PORT2 ROM lt version gt rom IB Port 2 InfiniHost III Lx image e IHOST3LX ROM lt version gt rom 2 Additional documents under docs e dhcpd conf sample DHCP configuration file e dhcp patch patch file for DHCP v3 1 3 190 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 A 2 Burning the Expansion ROM Image H 17 4 Burning the Image on ConnectX ConnectX amp 2 Note This section is valid for ConnectX ConnectX 2 devices with firmware versions 2 7 000 or later For earlier firmware versions please follow the instructions in Sec tion H 17 5 on page 191 Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot release notes txt 2 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package version 2 6 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under www mellanox com gt Downloads Image Burning Procedure To burn the composite image perform the following steps l Obtain the MST device name Run mst start met status The device name will be of the form mt dev id pci crO conf0 2 Create and burn the composite image Run flint dev mst device name brom expansion ROM image Exam
26. s1 2 ipoib ALL full 7 9 Congestion Control 7 9 1 Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in i e it is a shared library libce mgr so that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Con gestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches 7 9 2 Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so per form the following 1 Create the file Run opensm c lt options file name gt 2 Findthe event plugin name option in the file and add cemgr to it Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt Congestion Control Manager can provide options file to fine tune Congestion Control mechanism and Congestion Control Manager behavior To do so perform the following 1 Find the event plugin options option in the file and add the following conf file lt cc mgr options file name gt Options str
27. 1 1 1000 and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 7 3 2 osmtest has the following test flows e Multicast Compliancy test e Event Forwarding test e Service Record registration test e RMPP stress test e Small SA Queries stress test 7 3 1 Syntax osmtest OPTIONS where OPTIONS are Ly FLOW This option directs osmtest to run a specific flow Flow Description C create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file S run service registration deregistration and lease test e run event forwarding test f flood the SA with queries according to the stress mode m multicast flow a QoS info dump VLArb and SLtoVL tables t run trap 64 65 flow this flow requires running of external tool Default all flows except Qos W wait This option specifies the wait time for trap 64 65 in sec onds It is used only when running f t the trap 64 65 flow Default 10 sec q debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows GET Description d0 Ignore other SM nodes d1 Force single threaded dispatching A Force log flushing after each log message E Disable multica
28. 1 5 2 2 1 0 1 1 1000 2 Query SwitchInfo by GUID Pp IS MPaquUery G sv porche rats o0 O0 sc 0 AiO an dupl s pan elo cio qM E E d 197 AO OO C TR M E AUS e Kg 0 Mea oO EE n 1024 Menear E OEOD ee KU tee ga aera atenery 8 DE e E RR M a A 0 Doret EDEME nn 0 BerMSas ENOC ET dE o M rre sare ere 0 A DM Rn E S E RO SS O er EU HIE we ve NUIT UM O e POT EOE E ZOO RA E M 0 AS O A A A e A E A shes DN eee ii Osa c ET E HER A A EE il 3 Query NodelInfo by direct route gt smpquery D nodeinfo 0 Teo o RS adi ceo SOS NO SA a I CISSE o I MO EE E FN LN Channel Adapter NEM EOT A e UNE EUN Z E REIR OP dre p E Tee one EN O OOO e OUO OCA EBS A O EN 050 0102 ORSDIDNDIDNINDADIS PO AE CLES ei UTEM I IUE T PN O OCOC OO o OOOO ECCO RE RIE E 152245 o NM MET CN IRA A M NOR LETO 0x634a REV EO NOD em EI LEA I A EN 0x000000a0 8 13 perfquery Applicable Hardware All InfiniBand devices Mellanox Technologies 175 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Description Queries InfiniBand ports performance and error counters Optionally it displays aggregated coun ters for all ports of a node It can also reset counters after reading them or simply reset them Synopsys perfquery h d G a 1 r C ca name gt P ca port gt R t timeout ms V lt lid guid gt port reset mask Table 19 lists the various flags of the command Table 19 perfquery Flags and O
29. 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities Table 20 ibcheckerrs Flags and Options Default If Not Description Specified Optional Mandatory Mandatory Use the specified port without G flag Examples 1 Check aggregated node counter for LID 0x2 gt tocheckerrs 2 Wero counter Symoolbrrors 65535 usines checa Jake ZOO ee oS Weasley ens 625 5 classe lo ol LON haul ZI DO AD warn counter LinkDowned 12 threshold 10 lid 2 port LND Wan c eui ot ReyErrors 65 earesno la 0 lid oo t ZOO ena ore ame e Oscos ZI threshold 100 2 Check port counters for LID 2 Port 1 gt gJbcheckerrs v 2 1 bierokeneclkaoton lr m DOSE Jg sd Eqs leo Peehnol ogies ports OE 180 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 Check the LID2 Port 1 using the specified threshold file gt eat n es Symbol tons 100 LinkRecovers 10 LinkDowned 10 RcvErrors 10 RcvRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDlscards 100 ne Consi DO Ec OBS umb qm ous LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VLl5Dropped 100 8 15 mstflint Applicable Hardware Mellanox InfiniBand and Ethernet devices and network adapter cards Description Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access
30. Driver Version 1 5 2 0 0 5 PEA eden a eraut Host Driver RPM Check HCA Firmware on HCA 0 HCA Firmware Check on HCA 0 Host Driver Initialization Number of HCA Ports Active Port State of Port 1 on HCA 0 Port State of Port 2 on HCA 0 Error Counter Check on HCA 0 Kernel Syslog Check Node GUID on HCA 0 O02 O22 CF 203700 00s 0 amp 0 Note OREDSL D 2 20101014 13929 5 MLNX OFED LINUX 1 5 2 After the installer completes information about the Mellanox OFED installation such as prefix kernel version and installation parameters can be retrieved by running the command etc infiniband info 2 3 4 Installation Results Software e The OFED and MFT packages are installed under the usr directory e The kernel modules are installed under InfiniBand subsystem 32 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 lib modules uname r updates kernel drivers infiniband e mlx4 driver Under lib modules uname r updates kernel drivers net mlx4 you willfind mlx4_core ko mlx4 en ko mlx4 ib ko mlx4 vnic ko and mlx4 fc ko e IPoIB lib modules uname r updates kernel drivers infiniband ulp ipoib ib ipoib ko e SDP lib modules uname r updates kernel drivers infiniband ulp sdp ib sdp ko e SRP 1ib modules uname r updates kernel drivers infiniband ulp srp ib stp ko 1ib modules uname r updates kernel driver
31. IPoIB SDP uses the same IP addresses and interface names as IPoIB see IPoIB configuration in Section 9 3 3 and Section 9 3 3 3 In case of RoCE SDP use the same IP addresses and interface names of the correspond ing mlx4 en interfaces see mlx4 en configuration in Section 4 3 and Section 4 3 4 3 5 3 1 How to Know SDP Is Working Since SDP is a transparent TCP replacement it can sometimes be difficult to know that it is work ing correctly To check whether traffic 1s passing through SDP or TCP monitor the file proc net sdpstats and see which counters are running Alternative Method Using the sdpnetstat Program The sdpnetstat program can be used to verify both that SDP is loaded and is being used The following command shows all active SDP sockets using the same format as the traditional net stat program Without the S option it shows all the information that netstat does plus SDP data host1 sdpnetstat S Assuming that the SDP kernel module is loaded and is being used then the output of the command will be as follows host1 sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address sdp 0 0 193 169 10 144 54710 19 ID LS 20605 sdp O 904720 TL935 l69 10 144747 724 193 168 10 filenet rmi The example output above shows two active SDP sockets and contains details about the connec tions If the SDP kernel module is not loaded then the output of the command will be something like the following Mell
32. In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostil ifconfig 160 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command hostl ping c 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 04 bytes From LLid 5 176 emp seg 0 CCL 04 Time 0 079 ms 64 bytes from 11 4 3 176 icmp seq 1 ttl 64 time 0 044 ms 64 bytes trom l11 4 5 176 1Cmp seg 2 ttl 64 time 0 055 ms 04 bytes from 11 4 3 1 760 emp seg 3 Et1 64 Time 0 049 ms 64 bytes from 11 4 3 176 icmp seq 4 ttl 64 time 0 065 ms lots 110 pino Sta Listig CO 0 CO W CO 5 packets transmitted 5 received 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 3 8 6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver e Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg 1b0 The only meaningful bonding policy in IPoIB is High Availability bonding mode number 1 or active backup Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only sup ported value
33. InfiniBand devices Description Validates an IB port or node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold file using the same format as the dump can be specified using the t lt file gt option Synopsis ibcheckerrs h b v G T lt threshold file gt s N nocolor C ca name P ca port t timeout ms lt lid guid gt lt port gt Table 20 lists the various flags of the command Table 20 ibcheckerrs Flags and Options Optional peut Flag EO e If Not Description y Specified h help Print the help menu Optional Print in brief mode Reduce the output to show only if errors are present not what they are Increase verbosity level May be used several times for additional verbosity vvv or v v v Use specified threshold file Optional threshold file gt t Optional Override the default timeout for the solicited MADs timeout ms msec lt lid guid gt Mandatory with Use the specified port s or node s LID GUID with G G flag option Mellanox Technologies 179 G uid Optional Use GUID address argument In most cases 1t is the Port GUID Example 0x08f1040023 EE Mellanox Technologies Confidential 1
34. License Agreement sp Disk Activation e System Analysis Connected Targets e Time Zone iSCSI Initiator Overview Portal Address Target Mame Start Up Installation Installation Summary Perform Installation Configuration Root Password Hostname Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration 208 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Step 3 Click the Add tab in the iSCSI Initiator Overview window An iSCSI Initiator Discovery win dow will pop up Enter the IP Address of your iSCSI target and click Next Preparation af Language y License Agreement sp Disk Activation System Analysis Time Zone E iSCSI Initiator Discovery Installation Installation Summary Add BH Perform Installation 3 a a 10 4 3 7 3260 Configuration Root Password Hostname a Network No Authentication Customer Center Online Update a Service Sie rai F Incoming Authentication Users x e Clean Up Username Password Release Mates Hardware Configuration Outgoing Authentication Username Password Help Back Abort Next Mellanox Technologies 209 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Step 4 Details of the discovered iSCSI target s will be displayed in the iSCSI Initiator Discovery win dow Selec
35. M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC Multiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is performed This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 mil liseconds This option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout 116 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 v verbose This option increases the log verbosity level The ex option may be specified multiple times to further increase the verbosity level See the vf option for more information about log verbosity V This option sets the maximum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about og verbosity Vf This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting
36. P1 end port group using partitions defined in the partition policy POCE d EQU name Partitions partition Partl pkey 0x1234 end port group t using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for ali the nodes ir the subnet Do Ec FOUR name CAs and SM node type CA SELF eGnosdportedgroQup end port groups dqos setup This section of the policy file describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED the section is parsed and ignored SL2VL and VLArb tables should be configured in the OpenSM options file by default var cache opensm opensm opts end qos setup qos levels Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules qos level name DEFAULT use default QoS Level eds end qos level 136 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 the whole set SL MTU Limit Rate Limit PKey Packet Life time qos level name WholeSet ele I meus Limites 4 rate lLlimuts 5 pkey 0x1234 packet life 8 end qos level end qos levels Match rules are scanned in order of their apperance in the policy file First matched rule takes precedence qos match rules i Matching by single criterias QOS Glass qos match rule use by Qos cla
37. Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 2 The SM analyzes the provided policy to see 1f it is realizable and performs the necessary fabric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up parti tions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the tar get IP and port number ULPs might also provide QoS Class The CMA then creates Service ID for the ULP and passes this ID and optional Q
38. Test ibv_rc_pingpong Start the server first taby re pingpong g 0 SL 2 local address LID 0x0000 OPN Ox00U0D04t PSN VX33L5f6 GILID TESUR ZUZECIEES Leo 267 99 42 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 remote address LID 0x0000 OPN 0x04004f PSN 0x2cdede GID LleocUur202T0o9rrt TteUoreoll 8192000 bytes in 0 01 seconds 4730 13 Mbit sec 1000 iters in 04201 seconds 13 95 usec Xitef it Then start the client t Lov ve pingpong eg O 2 Sw4L9 local address LID 0x0000 OPN 0x04004f PSN 0x2cdede GID Tecor ZOZ COTTET ege reon remote address LID Ox0000 OPN 0x00004f PSN 0x3315f6 GID FESO S707 CULE bev ote Oo 8192000 bytes in 0 01 seconds 4787 84 Mbit sec 1000 teers an 0 01 seconds 19 609 usec rter it Add VLANs Make sure that the 8021 q module is loaded modprobe 8021q Add the VLAN device vconfig add eth2 7 Added VLAN with VID 7 to IF eth2 Configure an IP address for it dLeconfigeth2 T 7443 220 Fxamine the GID table t cat SVs Class intiniband smix4 O ports 2 gLdsy 0 fe80 0000 0000 0000 0202 c9ff fe08 e811 cat JSys elass rintrniband mls4 0 ports 2 9gL1ds 1 fe80 0000 0000 0000 0202 c900 0708 e811 According to the output we now have two entries Run the Example Again Now on VLAN On Server t by rc pingpoeng g 1 1 2 Mellanox Technologies 43 Mellan
39. Webpage rrr 15 Chapter 1 Mellanox OFED Overview oooooooooooocccrrroroccccrnrrcr n ng 16 1 1 Introduction to Mellanox OFED 16 1 2 Introduction to Mellanox VPI Adapters 16 1 3 Mellanox OFED Package 16 ESTE Oda De P C cH rn 16 13 2 SOlMware Componehis vas eit tn Stee REP Bid defe Sede iei Ead te 17 ECCO ME AVENIR A A As 18 L5 Directory Suc uie cou acere reb bed ene RAE bee dq E E 18 1 4 Architecture 18 dol m tica CAS DAVE ims ct ti iter 19 142 mid V PLDEV6ts eem A Esa uv iuste freie ee Peek Ru doge 19 P SEE cS TU PRETI odd boos lele aisi Bees 20 lib ObencbO Ol ies sso O Lini sd c eM o dese haha 20 kde UI Satin ata que efe cop Ir a 20 40 SNIBD S eiii esas ace pel tomate datei fui ai e 21 I 17 InftmiBand Subnet Manacery acon ille piove iia 22 as Diagnostie Utilities piso lio e M AMI duc Tou LIU ds 22 1 4 9 Mellanox Firmware Tools 4s ades ett d e ss 22 1 5 Quality of Service 23 Chapter 2 Installation iaa ee e dE ECCE oce x 24 2 1 Hardware and Software Requirements 24 AECL dqWiatdwate Requirements srta ada C PAPIER VE ades 24 2 12 Software Requirements s sene a x EE Liebe te pe the
40. a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OFED See Chapter 7 OpenSM Subnet Manager Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data cen ter managers ibutils Mellanox Technologies diagnostic utilities e infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image Querying for firmware information Burning a firmware image to a single InfiniBand node MFT includes the following tools mlxburn This tool provides the following functions Generation of a standard or customized Mellanox firmware image for burning in bin binary or img format e Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device Querying the firmware version loaded on an HCA board Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mellanox network adapter bridge switch device It includes query functions to the burnt firmware image and to th
41. ae aE aE ae aE aE aE aa EA AEE ERE TEE T iat ae at at ae aE aE ae aE aE ae aE aE aE a aE EA AE EAE EAP aaa aE aaa aaa itt ae a at ae aE aE ae aE aE ae aE aE aE a ae EA AEE AE EAE EAE aa aa aaa iit ae at at ae ae aE ae aE aE ae aE aE aE a aE EAE EAE aaa aaa aaa aa aa iat ae at at ae ae aE ae aE aE ae aE aE ae a ae fi AEE AE EE aaa aa aa aaa iat ae a at EET TOTO ETE ERE TETTE TECTORIO ERE EET iit ar aE at ae ae aE Ae aE aE ae aE aE ae a ae Ea AEE AEE AEE aa aaa aaa iat ae aE at ae aE aE ae aE aE ae ae aE ae aa EAE EAE EEE EE aa aaa iit ae at at ae aE aE Ae aE aE ae aE aE aE a ae TEEN iat ae iP at ae aE aE ae aE aE ae ae aE TETTE ETE AEE EEE aa aa aaa iit ae aE at ae ae aE ae aE aE ae ae aE aE a ae EAE EAE EEE aaa aaa aaa EE TE TE TE at ae ae aE ae E TE TE aE aE aE a E EAE EAE E E E E E TETTE ETE EET iit ae aE at ae aE aE Ae AP aE ae aE aE ae a ETC AE EA AE EAE EE aaa aaa itt ae at at ae AP at ae aE aE ae ae aE aE a aE EA AEE AE EEE a aE aaa aaa iit ae aE at ae aE at Ae aE aE ae aE aE ae a ae EAE AEE AEE EAE aaa aa iit TE TE at ae ae at ae E E E E aE ae a E EEE E E E E E E EEE aa aaa TETTE ECTETUR ERE EET iit ae at at ae aE aE ae aE ae ETE ETC TET TETTE TETTE ETE TEE TEE iit TE at at ae aE FE ae E aE TE aE aE aE a E EA E E AEE E E E E E E E aaa ae itt ae aE at ae ae aE ae aE aE ae aE aE aE a aE EA AEE AEE AEE AEE aa aaa 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX VPI PCIe 2 0 2 5GT s IB DDR 10GigE rev a0 Link Width 8x 30 Mel
42. ae iP at ae AP ECTETUR ERE ERE ET ETE iit He at at ae aE aE ae AP aE ae aE aE aE a aE EA AEE AEE APE aaa aaa ERE TERT iit ae at at ae aE aE ae aE aE ae aE aE aE a ae EAE EAE EAE aaa EET ET ET ETT iit He at at ae aE aE ae aE aE ae aE aE ae a ae EA AEE AEE APE aaa aaa aaa iat ae aE at ae aE aE ae ae aE ae aE aE ae a aE EA AEE AEE aE a aE aaa aa aaa iit TE aE at ae aE aE ae aE E E E aE ae a E EE E EA E E AEE aaa ETE ERES iit ae at at ae aE aE ae aE aE ae aE aE aE a ae EAE EAE EAP itt ae at at ae aE at ae aE aE ae aE aE aE a ae EA AE EAE EE aaa aa aa aaa iat E E at ae ae aE ae aE E E E aE aE a E EEE E E E E E E EEE aa aaa iit ae at at ae ae aE ae aE aE ae aE aE ae a aE AEE AEE EEE aa aaa iit at at at ae aE aE ae AP aE Ae aE aE aE a ETE AE EAE EEE EEE aaa aaa iit ae at at ae aE ae ae aE aE ae aE aE ae a aE EA AEE AEE AEE AEE aaa aaa iit ae aE at ae ae aE ae aE aE ae ae aE aE a ae EA AEE AEE AE EEE aa aaa iat ae aE at ae ae aE ae aE aE ae aE aE aE a aE EA AEE AEE APE aE ERE ET ET ETT iit ae at at ae aE aE ETE ae ae EAE EAE EEE EE aaa aaa iit ae at at ae aE aE ae aE ETE TOTO AEE AEE EE aaa aaa iat ae iP at ae aE aE ae AP aE ae aE aE aE a ae EA AEE AEE AEE aaa aaa aaa iit He at at ae aE aE CTETUR TEE iit ae at at ae ae aE ae ae aE ae aE aE aE aa EA AEE AE METETE iit ar aE at ae HP aE ae aE aE ae aE aE ae ETE AEE AEE EAE EEE aaa iit TE FE at ae ae FE ae E TE TE ae aE ae a E EAE EEE EEE E E E aaa aaa iat ae at at ae aE aE ae aE aE ae aE aE ae a aE fi AEE AEE APE a
43. an SRP Target setup file ACACKCKCkC CK Ck CK KCk Kok Kk k ck kok kk k kk srpt sh ACKCKCKCK KCk Kok Ck Kok Kok Kok Kok Ck kk Kok ko kk kk kk bin sh modprobe sest SCST threads l modprobe sest vdrsk sost vdrsk ED L00 echo open vdisk0 dev cciss cld0 BLOCKIO gt proc scsi tgt vdisk vdisk echo open vdiskl dev sdb BLOCKIO gt proc scsi tgL warsk wvalsk echo open vdisk2 dev sdc BLOCKIO proc scsi tgt vdisk vdisk echo open vdisk3 dev sdd BLOCKIO proc scsi tgt vdisk vdisk echo add vdisk0 0 gt proc scsi tgt groups Default devices echo add vdiskl 1 gt proc scsi tgt groups Default devices Mellanox Technologies 225 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 echo add vdisk2 2 gt proc scsi tgt groups Default devices echo add vdisks 3 gt proc sesi tgt gr oups Detault devices modprobe ib srpt echo add mgmt gt proc scsi tgt trace level echo add mgmt dbg proc scsi tgt trace level echo add out of mem proc scsi tgt trace level ACACKCKCkC CK Ck Ck KkCk Ck ok Kk k ck kock kk k kk End SEDE sh ACKCKCKCK CK Kk KC KkCK KCKk Kok Kok Kok Kk Kok Kok kk kk B 3 How to Unload Shutdown 1 Unload ib srpt quodprobe P 1D SrIpt 2 Unload scst and its dev handlers first gt Modprobe e sest volsk SCST 3 Unload ofed etc rc d openibd stop 226 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1
44. and Ethernet to Fibre Channel gateways Fibre Channel over Ethernet and Fibre Channel over InfiniBand A single firmware image for dual port ConnectX ConnectX 2 adapters that supports indepen dent access to different convergence networks InfiniBand Ethernet or Data Center Ethernet per port A unified application programming interface with access to communication protocols including Networking TCP IP UDP sockets Storage NFS CIFS iSCSI SRP Fibre Channel Clus tered Storage and FCoE Clustering MPI DAPL RDS sockets and Management SNMP SMI S Communication protocol acceleration engines including networking storage clustering virtu alization and RDMA with enhanced quality of service RDMA over Converged Ethernet RoCE The following ULPs can be used over RoCE uDAPL SDP RDS MPI 1 3 Mellanox OFED Package 1 3 1 ISO Image Mellanox OFED for Linux MLNX OFED LINUX is provided as ISO images one per a sup ported Linux distribution that includes source code and binary RPMs firmware utilities and doc umentation The ISO image contains an installation script called m1nxofedinstall that performs the necessary steps to accomplish the following Discover the currently installed kernel 16 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 e Uninstall any InfiniBand stacks that are part of the standard operating system distr
45. bytes to use are Sysctl w net Lpvd tep rmem 4096 07390 16777216 Sysctl w netorpv4 tcp wmem 4096 65556 16777216 5 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance Disable the TCP timestamps option for better CPU utilization SVSCEL Sw netorpvdotocp taimestamps o Disable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp sack 0 5 2 3 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface ethl then turns off Rx interrupt moderation and last shows the new setting ethtool c ethl Coalesce parameters for ethl 98 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 rx usecs 16 rx frames 88 rx useosecrrg 0 rx ftrames r ros 0 gt ethtool ethl adaptive rx
46. continue y N y Removing OFED RPMs Running mkrisofs Created tmp MLNX OFED LINUX 1 5 1 rhel5 4 iso Installation Script Mellanox OFED includes an installation script called minxofedinstall Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Pro cedure on page 28 Usage mnt mlnxofedinstall OPTIONS Note If no options are provided to the script then all available RPMs are installed Options ec esponrfrg packages config file Example of the configuration file can be found under docs n net network config file Example of the network configuration file can be found under docs p print available Print available packages for the current platform and cre ate a corresponding ofed conf file The installation script exits after creating ofed conf withour 32bit Skip 32 bit libraries installation without depcheck Skip Distro s libraries check without fw update Skip firmware update force fw update Force firmware update eforce Force installation without querying the user all Install all kernel modules libibverbs libibumad librd macm mft mstflint diagnostic tools OpenSM ib bonding MVAPICH Open MPI MPI tests MPI selector perftest sdpnetstat and libsdp srptools rds tools static and dynamic libraries LHoc Install all kernel modules libibverbs libibumad librd macm Mit mstrlint diagnostico tools Open
47. e Database Cluster traffic e RDS Min BW of 3096 e SRP Min BW 30 Bottleneck at storage nodes Administration e OpenSM QoS policy file Note In the following policy file example replace SRPT with the real SRP Initiator port GUIDs qos ulps default 0 ipoib pkey 0x8001 1 ipoib pkey 0x8002 Wc rds 3 srp tafget port gurd SRPII SRBEI2 SRET3 4 end qos ulps e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 96 3 96 4 96 qos_vlarb low 0 1 qos Zw dd bolo doro Pls 9 Partition configuration file Default Ox7fff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL full Mellanox Technologies 145 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager 7 8 Adaptive Routing 7 8 1 Overview Note Adaptive Routing is at beta stage Adaptive Routing AR enables the switch to select the output port based on the port s load AR supports two routing modes e Free AR No constraints on output port selection e Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets 7 8 2 Running OpenSM With AR Manager To enable AR Manager in OpenSM run opensm ar AR Manager scans all the fabric switches figures out which switches support AR and configures the AR functionality on these switches Note that if some switches do not support AR they will slow down the AR Ma
48. folowing com mand Ismod grep mlx4 en If the module is loaded the mlnx4 en should be displayed as shown in the example below lsmod grep mlx4 en mlx4 en 75276 0 e Run iby devinfo There is a new field named link layer which can be either Ethernet or IB If the value is IB then you need to use connectx port config to change the ConnectX ConnectX 2 ports designation to eth see mlx4 release notes txt for details Configure the IP address of the interface so that the link will become active e All IB verbs applications which run over IB verbs should work on RoCE links as long as they use GRH headers that 1s as long as they specify use of GRH in their address vector 3 1 5 Ported Applications The following applications are ported with RoCE Mellanox Technologies 37 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features 3 1 6 1 e ibv pingpong examples are ported The user must specify the GID of the remote peer using the new g option The GID has the same format as that in sys class infiniband mlx4 0 ports 1 g1ds 0 Note Care should be taken when using ibv ud pingpong The default message size 1s 2K which is likely to exceed the MTU of the RoCE link Use ibv devinfo to inspect the link MTU and specify an appropriate message size All rdma cm applications should work seamlessly without any change libsdp works without any change Performance tests GI
49. for Lustre MDS end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vrarb Nigh 231 qos vlarb low 0 96 1 224 qos SLA Olio io Lato Loto Lo 19419549 7 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels e Application traffic e IPoIB UD and CM and SDP Isolated from storage Min BW of 50 SRP Min BW 50 Bottleneck at storage nodes Administration e OpenSM QoS policy file Note In the following policy file example replace SRPT with the real SRP Target port GUIDs qos ulps default ipoib sdp N rr O srp target port guid SRPT1 SRPT2 SRPT3 end qos ulps e OpenSM options file qos max vls 8 qos high limit 9 qos vlarb high 1 32 2 32 qos vlarb low 0 1 SC A 20 5 6 7 15 T5 15 T5215 15 1515 144 Mellanox Technologies Mellanox Technologies Confidential 7 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels Management traffic ssh PoIB management VLAN partition A Min BW 10 Application traffic PoIB application VLAN partition B solated from storage and database Min BW of 3096
50. gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2vl lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt dest dr path Optional Destination s directed path LID or GUID lid guid Mellanox Technologies 173 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Examples InfiniBand Fabric Diagnostic Utilities 1 Query PortInfo by LID with port modifier ie Tortue porto lew O O io dal VIE A DI MTM 0x0000000000000000 E e E Oxfe80000000000000 IR me NU eaters RII ME 00004 SS I mg ty USO HL O ER US INOKENS 2 ESSM succ MIR cl SAO mato MO aio SIS ported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagement oupported IsVendorClassSupported IsCapabilityMaskNoticeSup ported ts lient Rog Strat LON Up ported igo e cle A E Uere E aera er dx T 0x0000 MRC Vii Sas PO odes erc septate sae eae O EOS EOR RR A UU PEE i Tips Ad O O LA o IX Orie incite Sto pote ds ee A ee on 1X or 4X BIE E E aN e MM A 4X Ms pe OC SUP Ome Clin PM ae ete Zu 61995 Oia G0 loos ee tie IRE RM Active Pian Sev ii ome e are ude O et LinkUp A ASS dei weer Tore E Pom Ego bec E E M Ri re E EM eno 0 A LINEE N e E 0 O Cs Ted TANS Occ SO E reete a tered Co ae eects LESSONS SA OA EI A MT 2048 MERITA e ET tA DE O vA AN E AE N VIROS 174 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual
51. iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the 1SCSI Target ign line Lun 0 Path dev sda5 Type fileio Tip The following is an example of an 1SCSI Target iqn line Target rgugiZ0D7 098 amp 7 994 10 515G081bO00t Step 4 Start your iSCSI Target Example hostl etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section 9 3 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI target Filename OPELOl XOOL puLth LSCs17L5GSL target Ipirs 156s1 larger gr The following is an example for configuring an IB ETH device to boot from an iSCSI target host hostli filename TT For a ConnectX device with ports configured as InfiniBand comment out the following line option dhcp client identifier ff 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 206 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 For a ConnectX device with ports configured as Ethernet comment out the following line i hardware ethernet 00 02 c9 00 00 bb H For an InfiniHost III Ex comment out the follo
52. ibdev2netdev Mx 0 port gt eth Down mlx4 0 port 1 gt 1b0 Down mlx4 0 port 2 gt aibi Down mlx4 Lport l gt etbz Down mlx4 1 port 2 gt eth3 Down ibstatus Applicable Hardware All InfiniBand devices Description Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h lt device name gt lt port gt 160 Mellanox Technologies Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Mellanox OFED for Linux User s Manual Table 15 lists the various flags of the command Table 15 ibstatus Flags and Options Optional Perault Flag Fo aor If Not Description y Specified lt device gt Optional All devices Print information for the specified device May specify more than one device lt port gt Optional but All ports of the Print information for the specified port only of the spec requires specify specified device ified device ing a device name Mellanox Technologies 161 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities Examples 1 List the status of all available InfiniBand devices and their ports cc US SEDIS to dd eat Sai O AS ca Ss der ault copie 15 519 O O OO O ONO 5995 base lid US So le Us SUE 4 ACTIVE phys state Saa rate 20 Gb sec AX DDR IUe dti SEO d
53. ibdiagnet2 h help Print this help message V version Print the version of the tool 152 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 8 3 2 Output Files Table 11 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 11 ibdiagnet of ibutils2 Output Files An ibdiagnet run performs the following stages e Fabric discovery e Duplicated GUIDs detection e Links in INIT state and unresponsive links detection e Counters fetch Error counters check Routing checks Link width and speed checks 8 3 3 Return Codes O Success l Failure with description 8 4 bdiagnet of ibutils IB Net Diagnostic Note This version of ibdiagnet is included in the ibutils package and it is run by default after installing Mellanox OFED To use this ibdiagnet version run ibdiagnet ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 8 4 1 SYNOPSYS ibdiagnet c lt count gt v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pc P lt lt PM gt lt Value gt gt lw lt 1x 4x 12x gt 1s lt 2 5 5 10 gt skip lt ibdiag c
54. initrd ib lib modules ib infiniband ulp ipoib ib ipoib ko tmp initrd ib lib modules ib IB requires loading an IPv6 module If you do not have it in your initrd please add it using the following command cp lib modules uname r kernel net ipv6 ipv6 ko Y tmp initrd ib lib modules Step 6 Bos Step 7 Host Ls Step hostl hostl hostl hostl hostl hostl hostl hostl To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command Gp sbhbin ansmod 7 tmpyianierd 3bysbiny If you plan to give your IB device a static IP address then copy ifconfig Otherwise skip this step Gp sBin ttcontig tmp initrda 1by SOLA If you plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below cp path to DHCP client v3 1 3 gt dhclient tmp initrd ib sbin cp lt path to DHCP client v3 1 3 gt dhclient script tmp initrd ib sbin mkdir p CMp united 1bveristste dhop touch tmp initrd ib var state dhcp dhclient leases cp bin uname tmp initrd ib bin ep usr bin expr tmp initrd 1b bin cp y sbin ifconfig Emp aniltsd Ibybun c
55. is 2 link local 118 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Note that values for rate mtu and scope should be specified as defined in the IBTA specifica tion for example mtu 4 for 2048 PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from Ox decimal numbers are accepted too full or Limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet e SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters The line can be wrapped after after a Partition Definition and between A PartitionName does not need to be unique but PKey does need to be unique f a PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note e Itis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those defi nitions Examples Default Ox7fff ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 1limi 0x2134afT23
56. matching criteria besides Service ID 7 6 3 Simple QoS Policy Definition simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a match rule has only one criterion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below 7 6 4 Policy File Syntax Guidelines Leading and trailing blanks as well as empty lines are ignored so the indentation in the exam ple is just for better readability e Comments are started with the pound sign 7 and terminated by EOL e Any keyword should be the first non blank in the line unless it s a comment Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules Any section subsection of the policy file is optional 134 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 7 6 5 Examples of Advanced Policy File As mentioned earlier any section of the policy file is optio
57. option defines maximal log file size in MB When specified the log file will be truncated upon reaching this limit 6rase Log tiie This option will cause deletion of the log file if it previously exists By default the log file is accumulative PCODnLIGd P Epartlition cont1ig t1 les This option defines the optional partition configuration file The default name is etc opensm partitions conf 10 Par SOTTO v This option disables partition enforcement on switch external ports gaes This option enables Adaptive Routing Manager in OpenSM ar Config file lt patin to file This option specifies the optional Adaptive Routing config file The default name is etc opensm osm ar conf Mellanox Technologies 111 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager segs O This option enables QoS setup qoe DOITOV Filoj 005 POLICY adas This option defines the optional QoS policy file The default name is etc opensm qos policy conf stay on Fatal y This option will cause SM not to exit on fatal initialization issues if SM discovers duplicated guids or 12x link with lane reversal badly configured By default the SM will exit on these errors daemon B Run in daemon mode OpenSM will run in the background inactive I Start SM in inactive rather than normal init SM state Pete poOULeS file Pati to TLE This option specifies the
58. otf rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX Off pkt rate low 400000 pkt rate high 450000 rx usecs O rx frames O rx usecs wro 0 rx frames r irgs 0 5 2 4 Interrupt Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalancer stop The following command assigns the affinity of a single interrupt vector gt echo hexadecimal pit mask gt proc irg lt irg vector gt smp affinity where bit i in lt hexadecimal bit mask indicates whether processor core i is in lt irq vector gt s affinity or not 5 2 4 1 Example Script for Setting Interrupt Affinity Note On systems that support NUMA it is recommended to set IRQs from different net work devices to processor cores that reside on different physical CPU sockets N any bash Mellanox Technologies 99 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Performance 5 2 5 5 3 5 3 1 CORES cat proc cpuinfo grep processor tail 1 awk print 3 1 mi L1 while SC
59. prefix routes file Prefix routes control how the SA responds to path record queries for off subnet DGIDs Default file is etc opensm prefix routes conf SConsolidace Ipo Snm req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P Key 109 pberrix Sprefnx text Prefix to syslog messages from OpenSM verbose v This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the D option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing 112 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 The yV 18s quivalent tOo D OxPE d 21 See the D option for more information about log verbosity D D lt flags gt This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without D OpensM defaults to ERROR IN
60. produces the following files in the output directory which is defined by the o option described below 8 3 1 SYNOPSYS ibdiagnet i lt dev name gt p lt port num gt pm pc P lt lt PM gt lt Value gt gt r u lw lt 1x 4x 8x 12x gt ls lt 2 5 5 10 gt skip lt ibdiag stage gt o lt out dir gt h V OPTIONS i device lt dev name gt Specify the name of the device of the port used to connect to the IB fabric in case of multiple devices on the local System pl cecpopt poco fum Specify the local device s port number used to connect to the IB fabric pi Dump all pmCounters values into ibdiagnet pm pe Reset all the fabric links pmCounters P counter lt lt PM gt lt Value gt gt Print any provided pm that is greater than its provided value F Ou Lang Provide a report of the fabric qualities c peerat tree Indicate that UpDown credit loop checking should be done against automatically determined roots lw ls 4x lox 12x Specify the expected link width ale 9T 0S Specify the expected link speed skip ibdiag check Skip the execution of the given stage Applicable to the following stages dup gurdejlrds birinks senjnodes ntola ll default None 360 output padth cout drr Specify the directory where the output files will be placed screen num errs Specify the threshold for printing errors to screen default 5 Placed default var tmp
61. prop erly A 9 SCSI Boot Mellanox FlexBoot enables an 1SCSI boot of an OS located on a remote iSCSI Target It has a built in 1SCSI Initiator which can connect to the remote 1SCSI Target and load from it the kernel and initrd There are two instances of connection to the remote 1SCSI Target the first 1s for get ting the kernel and initrd via FlexBoot and the second is for loading other parts of the OS via Wi e Mellanox Technologies 205 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Note Linux distributions such as SuSE Linux Enterprise Server 10 SPx and Red Hat Enter prise Linux 5 1 or above can be directly installed on an iSCSI target At the end of this direct installation initrd is capable to continue loading other parts of the OS on the 1SCSI target Other distributions may also be suitable for direct installation on iSCSI targets If you choose to continue loading the OS after boot through the HCA device driver please verify that the initrd image includes the HCA driver as described in Section A 7 H 17 15Configuring an iSCSI Target in Linux Environment Prerequisites Step1 Make sure that an iSCSI Target is installed on your server side Tip You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 3 Configure your
62. recommended that both the FCoE switch and the m1x4 en driver be con figured to use link pause regular flow control Otherwise any FCoE packet drop will trigger SCSI errors and timeouts 3 3 2 1 FCoE Configuration After installation please edit the file etc mlxfc mlxfc conf and set the following vari ables e FC SPEC set to T11 or pre T11 as supported by your FCoE switch Mellanox Technologies 49 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Note Only pre T11 format is offloaded in hardware e DCBX IFS provide a space separated list of Ethernet devices to monitor the use of the DCBX protocol for the FCoE feature availability vHBAs are automatically created on these interfaces if the FCoE switch is configured for automatic FCoE negotiation e MTU if MTU of the Ethernet device is changed from the default 1500 put the correct value here Configure the mlx4 en Ethernet driver to support PFC Add the following line to the file etc modprobe conf and restart the network driver options mlx4 en pfctx 0xff pfcrx 0xff 3 3 2 2 Starting FCoE Service Make sure the network is up modprobe mlx4_en Then run gt etc init d mlxfc start VHBAs will be instantiated on DCBX monitored interfaces and SCSI LUNs will get mapped For Manual instantiation of vHBAs please see Section 3 3 3 1 Manual vHBA Control 3 3 2 3 Stopping FCoE Service Run gt etc init d mlxfc stop
63. s Manual 1 5 2 2 1 0 1 1 1000 Each vHub belongs to a specific gateway BridgeX eport and each gateway has one default vHub and zero or more VLAN associated vHubs A specific gateway can have multiple vHubs distinguishable by their unique VLAN ID Traffic coming from the Ethernet side on a specific eport will be routed to the relevant vHub group based on its VLAN tag or to the default vHub for that GW if no vLan ID is present 3 7 1 3 Virtual NIC vNic A virtual NIC is a network interface instance on the host side which belongs to a single vHub on a specific GW The vNic behaves similar to any regular hardware network interface The host can have multiple interfaces that belong to the same vHub 3 7 2 EolB Configuration mlx4 vnic module supports two different modes of configuration which is passed to the host mlx4 vnic driver using the EoIB protocol host administration where the vNic is configured on the host side network administration where the configuration is done by the BridgeX Both modes of operation require the presence of a BridgeX gateway in order to work properly The EoIB driver supports a mixture of host and network administered vNics 3 7 2 1 EolB Host Administered vNic In the host administered mode vNics are configured using static configuration files located on the host side These configuration files define the number of vNics and the vHub that each host administered vNic will belong to 1 e the vN
64. so use it carefully ssucast Cache A This option enables unicast routing cache to prevent routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g in case of host reboot This option becomes very handy when the cluster size is thousands of nodes 108 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 selrd Matris rile Meirle nane This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded LES tile PU STELLE mane This option specifies the name of the LFTs file from where switch forwarding tables will be loaded gt Sadb Pile e SELL name gt This option specifies the name of the SA DB dump file from where SA database will be loaded COUPOOT quid tales Sa pata to ile Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one to a line eccnoguad EL bey u pati Co tiles Set the compute nodes for the Fat Tree routing algorithm to the quids provided in the given file one to a line cO qucd File path bo ELLE Set the I O nodes for the Fat Tree routing algorithm t the quids provided in the given file one to a Line MaXx reverse hops H shop Count gt Set
65. specific functions and plugs into the InfiniBand midlayer mlx4 en A 10GigE driver under drivers net mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer mix4 fc Handles the FCoE functions using ConnectX ConnectX 2 Fibre Channel hardware offloads 1 4 3 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and ker nel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 4 4 Open FCoE The FCOE feature is based on and interacts with the Open FCoE project Mellanox OFED includes the following open fcoe org modules libfc and fcoe See Section 3 3 Fibre Channel over Ethernet 1 4 5 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Reliable Connected RC by default but it may also be configured to be Unreliable Datagram UD The interface supports unicast multicast and broadcast For details see Chapter 3 8 IP over InfiniBand RoCE RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Eth
66. stuff moderate volume 0x08 DEBUG diagnostic high volume Ox10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INEO 0x3 Specs ifying vf O0 disables all messages Specifying vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 7 3 2 Running osmtest To run osmtest in the default mode simply enter host1 osmtest The default mode runs all the flows except for the Quality of Service flow see Section 7 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file hostl osmtest f c Immediately afterwards run the following command to test opensm hostl osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed Mellanox Technologies 117 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager 7 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm partitions conf To change this filename yo
67. such as node info node description switch info and port info Synopsys smpquery h d e v D G s lt smlid gt V C ca name P lt ca port gt t lt timeout ms gt node name map lt node name map gt op dest dr path lid guid op params Table 18 lists the various flags of the command Table 18 smpquery Flags and Options Default If Not Description Specified Optional Mandatory d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d Optional NENNEN Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases 1t is the Port GUID Example 0x08f1040023 Pme mm mm SHE eil the default timeout for the solicited MADs Pm ms gt msec Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 18 smpquery Flags and Options 9 Default Flag SUONA If Not Description Mandatory Speied lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr
68. that the byte limit is unbounded Note If the 255 value is used the low priority VLs may be starved A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to achieve effective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet GOs Ca max Vis LS qos ca nigh Limit 6 qos oca viarb high 0 4 qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 064 6 64 7 04 qos ca s12vl 0 1 2 3 4 5 0 7 8 9 10 11 12 13 14 7 qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 qos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 0 04 7 064 GOS swe SX Durs Zo OO O OEA A id In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x 4KB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off Deployment Example Figure 4 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service le
69. the link layer parameter of each port In this case port 1 1s IB and port 2 is Ethernet Nevertheless port 2 appears in the list of the HCA s ports You can also run the following commands to obtain the link layer of the two ports cat sys class infiniband mlx4 0 ports 1 link layer EntiniBana cat sys eclass intiniband mix4 0 ports 2 11nk layer Ethernet it 3 The firmware version is 2 7 700 appears at the top You can also run the following command to obtain the firmware version cat sys class infiniband mlx4 0 fw ver 2 14400 il 4 The IB over Ethernet s Port MTU is 2K byte at maximum however the actual MTU cannot exceed the mlx4 en interface s MTU Since the mlx4 en interface s MTU is 1560 port 2 will run with MTU of IK Please note that RoCE s MTU are subject to IB MTU restrictions The RoCE s MTU values are 256 byte 512 byte 1024 byte and 2K Association of IB Ports to Ethernet Ports It 1s useful to know how IB ports associate to network ports ibdev2netdev mlx4 0 port 2 lt gt eth2 mlx4 0 port 1 gt 100 Since both RoCE and mlx4 en use the Ethernet port of the adapter one of the drivers must carry the task of controlling the port state In this implementation it is the task of the mlx4 en driver The mlx4 ib driver holds a reference to the mlx4 en net device for getting notifications about the state of the port as well as using the mlx4 en driver to resolve IP addresses to MAC that are requir
70. tree that torus 2QoS will construct where x is the root switch and each is a non root switch 4 3 I 2 ho HR A I 1 y 0 x 0 1 z 3 4 T For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns However to construct a credit loop the union of multicast routing on this spanning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop In addition if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link or switch failures that result in a fabr
71. with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst b modprobe scst vdisk c echo open vdisk0 dev md0 BLOCKIO gt proc scsi tgt vdisk vdisk d echo open vdisk1 dev sda BLOCKIO gt proc scsi tgt vdisk vdisk e echo open vdisk2 dev cciss c1d0 BLOCKIO gt proc sesi tgt vdisk vdisk f echo add vdisk0 0 gt proc scsi tgt groups Default devices g echo add vdiskl 1 gt proc scsi_tgt groups Default devices h echo add vdisk2 2 gt proc scsi tgt groups Default devices Example 2 working with scst vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst b modprobe scst vdisk c echo open vdisk0 dev md0 gt proc scsi tgt vdisk vdisk d echo open vdisk1 10G file gt proc sesi tgt vdisk vdisk echo add vdisk0 0 gt proc scsi tgt groups Default devices mh echo add vdisk1 1 gt proc scsi_tgt groups Default devices 2 Run ESE alL sorcDHLPTOnSexcepb Lbs lb 2 modprobe Ib Sepa FOr OLES Lis gt modprobe E LD SEDE Note For SLES 11 please ignore the following error messages in var log messages when loading ib srpt to SLES 11 distribution s kernel II Sipe mo Symbol Version fOr SES E uUInbeorcSter Ib Srpt Unknown symbol sScst unregister XD SEDES no Symbol vercion Tor GCSE register lb Srptr Unknown symbol sScst register Ib S pt no Symool version for ScSL unregister target template ib Stpus Unknown symbol Scst unregister target template
72. 0 02 c9 02 00 23 13 92 Step 4 The resulting client identifier is the concatenation from left to right of 20 the QP_ Number the subnet prefix and the Port GUID In the example above this yields the following DHCP client identifier ZU200 2552042012 7re 200 200200200200 00 002 0020205022000 0zZ gt 3 L gt 392 Extracting the Client Identifier Method Il An alternative method for obtaining the 20 bytes of QP Number and GID involves booting the cli ent machine via FlexBoot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 20 bytes can be captured from the boot session as shown in the fig ure below Starting Mellanox Boot over IB for InfiniHost 111 Ex ver 1 0 0 Loading via IB Port 2 Waiting for Infiniband link up ok gPXE 0 9 3 Open Source Boot Firmware Featur amp gze TrT 15051 AoE PXE PXEXT meti 00550401 fe800000 00000000 0002c902 00231392 oy PCIO5 00 0 open TX 0 TXE 0 RAS teil DHCP 4 OK leto 11 4 3 130 255 255 255 0 Concatenate the byte 20 to the left of the captured 20 bytes then separate every byte two hexa decimal digits with a colon You should obtain the same result shown in Step 4 above Mellanox Technologies 195 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client
73. 002290200 Notes 1 Itis recommended to use the n flag for all srp daemon invocations 2 1bsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the user or automati cally at startup by setting SRPHA ENABLE to yes 3 6 2 6 High Availability HA Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each 1n1 tiator 1s connected to the same target from several ports HCAs The DM multipath 1s responsible for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fabric and send the ib srp module requests to connect to them as well Mellanox Technologies 67 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Operation When a path from portl to a target fails the ib_srp module starts an error recovery process If this process gets to the reset_host stage and there is no path to the target from this port ib_srp will remove this scsi host After the scsi host is removed multipath switches to another pa
74. 05e Oa 25 Ox0Z0fe4 DDR 0K ADENTRO Del EQUO NIE che DPR SAA Ok APA IONES CORO AS MIDO 8 16 ibv asyncwatch Applicable Hardware All InfiniBand devices Description Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv asyncwatch Mellanox Technologies 185 Mellanox Technologies Confidential 8 17 186 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities Examples 1 Display asynchronous events gt OS Me ano ini Tas y Ne even Epod ibdump Applicable Hardware Mellanox ConnectX ConnectX 2 adapter devices Description Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX ConnectX 2 adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analysis The following describes a work flow for local HCA adapter sniffing e Run ibdump with the desired options e Run the application that you wish its traffic to be analyzed e Stop ibdump CTRL c or wait for the data buffer to fill in mem mode Open Wireshark and load the generated file How to Get Wireshark Download the current release from www wireshark org for a Linux or Windows environment See the ibdump release notes txt file for more details Note Although ibdump is a Linux application the generated pcap file may be analyzed on either operating system Synopsis ibdump options Table 23 lists the var
75. 095 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited SharelO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited oharelO 0xS0 0x123453 0x123454 0x125455 tfull 0x123456 0x123457 will be limited SharelO 0x80 defmember limited 0x123456 UxT29549 Ts 0x123458 2full SharelO 0x80 defmember full 0x123459 0x12345a ShareIO 0x80 defmember full Ux12345b 0x12345c limited OT o d907 Mellanox Technologies 119 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager Note The following rule is equivalent to how OpenSM used to run prior to the partition manager Default 0x7 fff ipoib ALL full 7 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm Based on the minimum hops to each node where the path length is optimized 2 UPDN Algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat tree Routing Algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is constrained to rank ing rules
76. 1 1000 Appendix C mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modpobe conf options mlx4 core parameter lt value gt and or options mlx4 ib parameter lt value gt and or options mlx4 en parameter lt value gt and or options mlx4 fc parameter lt value gt The following sections list the available m1x4 parameters C 1 mlx4 core Parameters set d Meu Attempt ito set AK MTU t all Connectx ports int debug level Enable debug tracing if gt 0 default 0 block loopback Block multicast loopback packets if gt 0 default 1 msi x Attempt to use MSI X if nonzero default 1 log num mac log maximum number of MACs per ETH port 1 7 int use PELO Enable steering by VLAN priority on ETH ports 0 1 default 0 bool log num qp log maximum number of QPs per HCA default is 17 max is 20 log num srq log maximum number of SROs per HCA default is 16 max is 20 log rdmarc per qp log number of RDMARC buffers per QP default is 4 max is 7 log num cq log maximum number of CQs per HCA default is 16 max is 19 log num mcg log maximum number of multicast groups per HCA default is 13 max is 21 log num mpt log maximum number of memory protection table entries per HCA default is 17 max is 20 log num mtt log maximum number of memory translation table segments per HCA default is 20 max is 20 log mtts per seg log number of MIT entries per segment 1 5 int enab
77. 1 1000 Driver Features 3 7 4 2 vNic Interface Naming The mlx4 vnic driver enables the kernel to determine the name of the registered vNic By default the Linux kernel assigns each vNic interface the name eth lt N gt where lt N gt is an incremental num ber that keeps the interface name unique in the system The vNic interface name may not remain consistent among hosts or BridgeX reboots as the vNic creation can happen in a different order each time Therefore the interface name may change because of a first come first served kernel policy In automatic network administered mode the vNic MAC address may also change which makes it difficult to keep the interface configuration persistent To control the interface name you can use standard Linux utilities such as IFRENAME 8 IP 8 or UDEV 7 For example to change the interface eth2 name to eth bx01 a10 run ifrename i eth2 n eth bx01 al0 To generate a unique vNic interface name use the mlx4 vnic info script with the u flag The script will generate a new name based on the scheme eth lt pci id gt lt ib port num gt lt gw port id vlan id For example if vNic eth2 resides on an InfiniBand card on the PCI BUS ID 0a 00 0 PORT 1 and is connected to the GW PORT ID 3 without VLAN its unique name will be mlx4 vnic info u eth2 eth2 eth10 1 3 You can add your own custom udev rule to use the output of the script and to rename the vNic interfaces automatically To
78. 3 0 000 qPXE Chttp etherboot org 02 090 09 CB80O PCIS 00 PnP BBS PMHMO6040842t8 CREO Press Ctri B to configure MLN FlexBoot 3 0 000 CPCI b2 00 6 _ Alternatively you may skip invoking CLI right after POST and invoke it instead right after Flex Boot starts booting Once the CLI is invoked you will see the following prompt GE AB 198 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 H 17 11 Operation The CLI resembles a Linux shell where the user can run commands to configure and manage one or more PXE port network interfaces Each port is assigned a network interface called neti where i is 0 1 2 lt Fof interface Some commands are general and are applied to all network inter faces Other commands are port specific therefore the relevant network interface is specified in the command H 17 12Command Reference H 17 12 1 ifstat Displays the available network interfaces in a similar manner to Linux s ifconfig gPXE gt ifstat neto 00 02 c9 00 00 060 aa bc on PCIOGZ 00 0 closed Link down TX 0 TXE 0 RX 0 RXE 0 Link status Unknown neti 00 02 c9 00 124 35 on PCIOZ 00 60 closed Link down TX 0 TKE 5 RxX 0 RXE 0 Link status Unknown Ox1a066001 qPxE gt H 17 12 2 ifopen Opens the network interface net lt x gt The list of network interfaces 1s available via the ifstat com mand Examp
79. 3wNCUg6J2X3G uiuSWXeu bZmbXcMrP w4IWByfH8a jwo6A5W10NbFZElbYeeNfPZf4UNcgMOAMWNp64sL58tkt32F RGmyLX0Q0NWZL27Synsn6dHpxMqgBorXNC0ZBe4 kTnUqm63nQ2ziqVMdL9FrCmalxlIOu9 SQJAjwONevaMzFKEHe7YHg oYrNfXunfdbEurzB524TpPcrodZlfCQ lt user name gt hostl Step 4 Now you need to add the public key to the authorized keys2 file on the target machine hostl cat id rsa pub xargs ssh host2 A echo gt gt home lt username gt ssh authorized keys2 lt username gt host2 s password Enter password host1 For a local machine simply add the key to authorized keys2 host1 cat id rsa pub gt gt authorized keys2 Step 5 Test host1 ssh host2 uname Linux MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add 104 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 MPI selector support for each MPI that it installs Additional MPI s not known by the Mellanox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI envi
80. 60 8000 1b0 8000 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 4 As can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 3 8 3 3 Step 5 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 7 OpenSM Subnet Manager 3 8 4 2 Removing a Subinterface To remove a child interface subinterface run echo lt subinterface PKey gt sys class net lt ib interface gt delete child Using the example of Step 2 echo 0x8000 gt sys class net ib0 delete child Mellanox Technologies 83 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example above 3 8 5 Verifying IPolB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality
81. 68 1 0 24 A A family program role address port range Use SDP by ttcp when it connects to port 5001 of any machine use sdp listen ttcp 5001 Use TCP for any program with name starting with ttcp serving ports 22 to 25 use tcp server ttcp 22 25 e Listen on both TCP and SDP by any server that listen on port 8080 use both server 8080 Connect ssh through SDP and fallback to TCP to hosts on 11 4 8 port 22 use both connect 11 4 8 0 24 22 Explicit Non transparent Conversion Use explicit conversion 1f you need to maintain full control from your application while using SDP To configure an explicit conversion to use SDP simply recompile the application replacing PF INET or PF INET with AF INET SDP or AF INET SDP when calling the socket system call in the source code The value of AF INET SDP is defined in the file sdp socket h or you can define it inline define AF INET SDP 27 define PF INET SDP AF INET SDP You can compile and execute the following very simple TCP application that has been converted explicitly to SDP Compilation gcc sdp server c o sdp server gcc sdp client c o sdp client Usage Server hostl sdp server Client host1 sdp client server IP addr gt Example Server Mellanox Technologies 57 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features hostl sdp server accepted connection from 15 2 2 42 48710 read 2048 bytes en
82. AE EAE a aE aa aaa ae iat ar at at ae ae aE ae aE aE ae a aE ae a ae EAE EAE EEE aaa aaa iit ae a at ae ae at ae AP aE ae aE aE aE a aE EA AEE A AE EAE aaa aaa ERE ETE FE TE TE AE at ae ae aE E E TE TE ae aE aE a E EA E EAE EAE E E EE E E E EEE R aa iit ae a at ae ae aE ae aE aE ae aE aE ae a METETE iit ae iP at ae aE aE ae AP aE ae aE aE ae a aE fi AE EAE EAE aaa aaa aaa aa mvapich intel libibverbs devel Spernmpl intel libmverbs opensm libs libmthca libmlx4 libibcm libibmad libsdp dapl dapl ibutalsz libmverbs ofed scripts libibverbs devel libibverbs devel static libibverbs devel static libibverbs utils libmthca libmthca devel static libmthca devel static libmlx4 libmlx4 devel libmlx4 devel libmverbs devel libmverbs devel libmqe libmqe Deb oem libibcm devel libibcm devel libibumad devel lxbrbumad sctstac libibumad static libibmad devel libibmad devel iba bmad s tat be libibmad static ibsim Isbrdmacm utzls librdmacm devel librdmacm devel Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 iit ae at at ae aE aE ae aE aE ae aE aE aE a aE EA AEE AE EEE AEE aaa aaa iit ae iP at ae OT ae ae aE ae aE EA AEE EEE Ea aaa TETTE ae aE aE ae aE aE ae AP aE aE a aE EAE EAE EEA Ea aE aaa iit ae at at ae aE aE ae AP aE ae aE aE ae aE EAE EA AEE aaa aa aaa aaa iit ae at at ae aE aE ae aE aE ae aE aE ae a aE EA AEE AEE AEE Ea aa aaa iat ae at at ae aE aE ae aE aE ae ae aE aE a aE EAE EEE aE aaa aaa aaa ae iat
83. AMAGE Mellanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway PO Box 586 Hermon Building Sunnyvale CA 94085 Yokneam 20692 U S A Israel www mellanox com Tel 972 4 909 7200 Tel 408 970 3400 Fax 972 4 959 3245 Fax 408 970 3403 O Copyright 2011 Mellanox Technologies Inc All Rights Reserved Mellanox BridgeX ConnectX CORE Direct amp InfiniBlast InfiniBridge InfiniHost InfiniRISC InfiniScale InfiniPCI PhyX Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd FabricIT and SwitchX are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 3481 Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table of Contents Table of Contents 0 ASS AAA AA AAA ARS 3 Ir gii rc A a 9 R vision TIS ONY de rrr AA 10 IPR CRACC 5g 11 Intended Andiencee s sia eee S a t dare rab b elut ee rund cin dere id cs ded da 11 Documentation Convelitioris lt a ad a A e eek 12 yposraphicaLConvenlonis iss css ses he s oe lad oh ere en Cal tios darent 12 Common Abbreviations and Acronyms o 12 CHOSSaPUS cook snc Suit ae mM RR ILE LL aes ayn IL A 14 Related DOCUMENTACION AA tne ee A bk ae 15 Support and Updates
84. ARO AS UA ead POr e tbi CIS 05019 0 To 00 MPA Oo kris Cale A TI Mellano nechneistogmres 0x0006 007 Channel Adapter portguid AAA IM os valo O a Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 170 Mellanox Technologies Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 4 Dump all Lids with valid out ports of the switch with portguid 0x000b8cffff004016 reb oue 05s 0I oie GRUT DING Ulisse a siero Quse OP erro MIA SS anse ae Technologies too der Destination Port Bauer 0500024020002 O e a O MPAT 3 96 AFIN Scalen tn ip Merlan TECNO L ogean DEIS IS EE E Oa Ee pao 9709 1019 ae 00 MEPIT 96 Sink ns CITI Mn oo DUMAS Oman Ad cive BD ne OOOO SI 03 0000089 ese a Ox0007 1050 Channel Adaprer portgula CxO MOV CIO OOS oA CA 5 Dump all non empty mlids of switch with Lid 3 gt ibroute M 3 MUERE shes label POSES Obes cman os e OS ential ol OOOPS rra io DIEA SSCS rane lei ll Melvanex Technologies O il 2 ETS NUN oe MEINE A RT EI DE e MLid 0xc000 Oo D 0xc002 053 005 DESEAN Ok OZI oco 2 DESC S 0xc024 x Xx Xx OX x X XN Xx OX Mellanox Technologies 171 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 8 12 smpquery 1 7 2 Applicable Hardware All InfiniBand devices Description InfiniBand Fabric Diagnostic Utilities Provides a basic subset of standard SMP queries to query Subnet management attributes
85. CT MF ET CC CCC tw ues Emphasized words These are emphasized words Pop up menu sequences menul gt menu2 gt gt item Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description Capital B is used to indicate size in bytes or multiples of bytes e g 1KB 1024 bytes and 1MB 1048576 bytes Small b is used to indicate size in bits or multiples of bits e g IKb 1024 bits Fibre Chanel over Elem 12 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Mellanox Technologies 13 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Manag ers in particular It is included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Clossary Channel Adapter CA An IB device that terminates an IB link and executes transport functions Host Channel Adapter This may be an HCA Host CA or a TCA Target CA HCA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant communication IB Cluster Fabric A set of IB devices connecte
86. ConnectX family or gPXE for InfiniHost III family has the highest priority in the BIOS boot sequence 220 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Preparation w Language wv License Agreement wv System Analysis Time Zone a Finishing Basic Installation Copy files to installed system Installation v Installation Summary save configuration p Perform Installation Install boot manager Configuration Save installation settings Hostname Root Password Metwork Customer Center Online Update Service Users Clean Up Release Notet The system will reboot now Prepare system for initial boot Hardware Configuration 8 Finished d C Step 20 Once the boot is complete the Startup Options window will pop up Select SUSE Linux Enter prise Server 10 SP2 then press Enter SUSE Linux Enterprise Server 10 Floppu SUSE Linux Enterprise Server 10 Failsafe Boot Options Mellanox Technologies 221 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 A 10 Step 21 The Hostname and Domain Name window will pop up Continue configuring your machine until the operating system 1s up then you can start running the machine in normal operation mode Step 22 Optional If you wish to have the second instance of connecting to the iSCSI Target go through the IB driver copy the initrd file under boot to a new locati
87. D Tables With RoCE there may be several entries in a port s GID table The first entry always contains the IPv6 link s local address of the corresponding Ethernet interface The link s local address is formed in the following way gid 0 7 fe80000000000000 gid 3 mac Oy 9 2 ada 9 5 S mee Y gid 10 mac 2 gro Li SE gid 12 fe c ec bss mes gid 14 mac 4 gid 15 mac 5 If VLAN is supported by the kernel and there are VLAN interfaces on the main Ethernet interface the interface that the IB port is tied to then each such VLAN will appear as a new GID in the port s GID table The format of the GID entry will be identical to the one described above except for the following change gid 11 VLAN ID high byte 4 MS bits gid 12 VLAN ID low byte Please note that VLAN ID is 12 bits wide Priority Pause Frames Tagged Ethernet frames carry a 3 bit priority field The value of this field 1s derived from the IB SL field by taking the 3 least significant bits of the SL field Using VLANs In order for RoCE traffic to use VLAN tagged frames the user needs to specify GID table entries that are derived from VLAN devices when creating address vectors Consider the example below 38 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 e Make sure VLAN support is enabled by the kernel Usually this requires loading the 802 1q module
88. DAPL over RDMA CM over RoCE devices To add the missing entries perform the following Step1 Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the IB devices ports Mellanox Technologies 45 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Step 2 Add a new entry line according to the format below to the dat conf file for each output line of the ibdev2netdev utility lt IA Name gt u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 lt ethX gt lt port gt Parameter Description Example lt IA Name gt The device s IA name The name must ofa v2 ethx be unique lt ethX gt The associated Ethernet device used by eth3 RoCE lt port gt The port number l The following is an example of the ibdev2netdev utility s output and the entries added per each output line Example sw419 ibdev2netdev mlx4 0 port 2 lt gt eth2 mlx4 0 port L lt gt eth3 ofa v2 eth2 u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 eth2 2 ww ofa v2 eth3 u2 0 nonthreadsafe default libdaplofa so 2 dapl 2 0 eth3 1 ww 3 2 GPUDirect 3 2 1 GPUDirect Overview GPUDirect provides Linux Kernel modifications to support sharing pinned pages between differ ent drivers thus allowing direct communication between drivers The support of Linux Kernel Memory Manager MM allows NVIDIA and Mellanox drivers to share the host memory a
89. DESI Opa InfiniBand Fabric Diagnostic Utilities 166 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 Change the speed of a port as quie mole Etienne O mite loa O oe RAV pon INTOS romano repair M e DN A E e E Initialize AN E eq t Mone M CENE MC UE LinkUp ASES Ie cts eer ERE TX corc4X Jon ACE en Toce dx IA AAN Re teu UN ER A ETT 4X ASS peso SIDO cane ein e zr cbe n E Ele ors Speed ale da a a E 2 OS ese ae Dr EOS A LSO AN A NE xine a 50 Gbps Now change the enabled link speed Mo rece ni Reed Togni ice ICE RO Se Devine aren E HE GS rosee Dodo CS Ib or Mss IE After PortInfo set ipm ou mosse as AS eee SO Speed ale otte e ei DO Gbps EA Olten Show the new configuration aloe te OI POCET INTOS PEPEO EAE Dike Path ys Js oe AE 0 r ea Moi Coe EE OR I egies uretra lize EZO A LE O LinkUp Mellanox Technologies 167 Mellanox Technologies Confidential 8 11 1 6 1 5 2 2 1 0 1 1 1000 8 ibroute Applicable Hardware InfiniBand switches Description Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop Synopsis ibroute h d v V a n D G M s lt smlid gt
90. Description Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv devi ces Examples 1 List the names of all available InfiniBand devices gt DOUG MS device node GUID mthcao0 0002c9000101dq4a150 mlx4 0 0000000000073895 8 7 bv_devinfo Applicable Hardware All InfiniBand devices Description Queries InfiniBand devices and prints about them information that 1s available for use from user space Synopsis ibv devinfo d lt device gt i lt port gt 1 v Table 14 lists the various flags of the command Table 14 ibv devinfo Flags and Options Default If Not Description Specified Optional Mandatory d device Optional First found Run the command for the provided IB device device ib device dev lt device gt 158 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 14 ibv devinfo Flags and Options Default If Not Description Specified Optional Mandatory j lt port gt Optional All device ports Query the specified device port lt port gt ib port lt port gt Optional Only list the names of InfiniBand devices Optional Inactive Print all available information about the InfiniBand verbose device s Examples 1 List the names of all available InfiniBand devices pu mor ca miro cdi 2 Hee ere bine mthca0 mlx4 0 2 Query t
91. FO 0x3 Specifying D 0 disables all messages Specifying D OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option debug d number This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support d10 Put OpenSM in testability mode Without d no debug options are enabled h help vem 2 Display this usage info then exit Mellanox Technologies 113 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager 7 2 2 Environment Variables The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet lst opensm fdbs and opensm mcfdbs By default this directory is var log OSM CACHE DIR opensm stores certain data to the disk such that subsequent runs are consistent The default directory used is var cache opensm The following file is included in it e guid2lid stores the LID range assigned to each GUID 7 2 3 Signaling When opensm receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topolo
92. GID value pkey ffff service id service 0 value gt sys class infiniband srp srp mthca hca number port number add target See Section 3 6 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes tis possible to include additional parameters in the echo command Mellanox Technologies 63 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features B max cmd per lun Default 63 NW max sect short for max sectors sets the request size of a command NW io class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 M initiator ext Please refer to Section 9 Multiple Connections To list the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk 1 This command lists all devices the new devices are included in this list ing e Execute dmesg or look at var log messages to find messages with the names of the new devices 3 6 2 3 SRP Tools ibsrpdm and srp daemon To assist in performing the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above
93. Hop algorithm and so uses shortest paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic 1s balanced across them unless port equalization 1s turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimen sion Use R dor option to activate the DOR algorithm Torus 2QoS Routing Algorithm Torus 2QOoS is a routing algorithm designed for large scale 2D 3D torus fabrics The torus 2QoS routing engine can provide the following functionality on a 2D 3D torus Free of credit loops routing Two levels of QoS assuming switches support 8 data VLs Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL
94. InfiniHost III Ex starting lellanox Boot over IB for InfiniHost III Ex ver 1 0 0 Loading via IB Port 2 Waiting for Infiniband link up ok After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from Mellanox Technologies 197 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 For ConnectX InfiniBand Mel lanox Connects FlexBoot GPXE 0 9 9 Open Source Boot Firmware netk 00 02 c9 00 00 00 aa bc on PCIOZ 50 0 Copen Link down TK 59 TRE RX H RAE 0 Link status Not connected Ox38086001 3 Waiting for link up on net ok DHCP neto 00 02 c9 00 600 bH aa bc ok neth 11 4 3 130 255 255 255 0 gw 0 0 0 0 Booting from filename pxelinux 0 tftp 11 4 3 7 pxelinux B For InfiniHost III Ex SPXE 0 9 3 Open Source Boot Firmware Features TFIP iSCSI AoE PXE PXEXT 3 909 a PU M NEP Te Pon nn leto 11 4 3 130 255 255 255 0 Next FlexBoot attempts to boot as directed by the DHCP server A 7 Command Line Interface CLI H 17 10Invoking the CLI When the boot process begins the computer starts its Power On Self Test POST sequence Shortly after completion of the POST the user will be prompted to press CTRL B to invoke Mel lanox FlexBoot CLI The user has few seconds to press CTRL B before the message disappears see figure Mel lanox Connect FlexBoot v
95. LOr Lopolocy C han Ges s excea wee as ee eee hae ka eee ewe ae 121 Ja IM OA LSO A uo ke ERE S oe em e ee aet ed Asi 121 155 A A A O 122 759 PDN AOMP USA Ceo atc rra ola 122 75 4 Fat tree Routine Algorithm i 244 24 e e covets bes NAR RR 123 7 5 4 1 Routing between non CN Nodes 000s 124 7 5 4 2 Activation through OpensSM cepa rre eb OE 124 130 ASA Ronin AISODUIIT iberica peli 124 7 5 6 DOR Routing Algorithm e eens 126 7 5 7 Torus 2QoS Routing Algorithm LL 126 4 53 54 Unicas ROUNE ta Mh m iusso ila 126 7 2 Niulticast ROWING S sss boo RSS OD IER ORE dotes ate dedi iuis 129 7 5 73 Torus Topology Discovery cascos a ete e se 131 7 5 7 4 Quality Of Service Configuration LL 131 7 5 7 5 Operational Considerations o ooooooor elles 131 7 6 Quality of Service Management in OpenSM 132 OL OVENI Wei tas a alados DE pd 132 10 2 Advanced QoS Policy Ple a a E EU LU 133 ios Simple Ooo Policy Denon tal ia edo tos 134 TOA Policy File syntax Guide nes cistitis seba bsos tar as 134 LOS Examples ot Advanced Policy Fe is ds 135 7 6 6 Simple QoS Policy Details and Examples LL 138 TOOGA JPOLIBu ti A da 140 TOOL DR sace setis eode agunt atus dus ili peas 140 10 09 RDS Steven Deba QM am 140 ow A O A IN 140 10 05 A A ad ek cans te end bla 141 6 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linu
96. Mellanox TECHNOLOGIES Mellanox OFED for Linux GPUDirect User Manual 1 5 2 2 1 0 1 1 1000 www mellanox com Mellanox Technologies Confidential NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH D
97. Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Mellanox OFED Overview This driver was not tested by Mellanox Technologies Documentation 1 3 3 Firmware The ISO image includes the following firmware items Firmware images mlx format for all Mellanox standard network adapter devices Firmware configuration INI files for Mellanox standard network adapter cards and custom cards FlexBoot for ConnectX ConnectX 2 InfiniHost III Ex in Mem free mode and Infini Host III Lx HCA devices ConnectX EN PXE gPXE boot for ConnectX EN and ConnectX 2 EN devices 1 3 4 Directory Structure The ISO image of MLNX OFED LINUX contains the following files and directories minxofedinstallThis is the MLNX OFED LINUX installation script uninstall shThis is the MLNX OFED LINUX un installation script CPU architecture folders gt Directory of binary RPMs for a specific CPU architecture firmware Directory of the Mellanox IB HCA firmware images including Boot over IB src Directory of the OFED source tarball and the Mellanox Firmware Tools MFT tarball docs Directory of Mellanox OFED related documentation 1 4 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protocols ULPs interface with the hardware and with the kernel and user spaces The application level also shows the versatility of markets that Mellanox OFED applies to 18 Mellanox Technologies Mellanox Technol
98. Nic will be assigned the VLAN ID specified This value must be between 0 and 4095 72 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 3 Red Hat Linux mlx4 vnic conf file format Field Description VNICIBPORT The device name and port number in the form device name port number The device name can be retrieved by running ibv_devinfo and using the output of hca id filed The port number can have a value of 1 or 2 Other fields available for regular eth interfaces in the ifcfg ethX files may also be used mlx4 vnic confd Once the configuration files are updated the host administered vNics can be created To manage the host administrated vNics run the following script Usage etc init d mlx4 vnic confd start stop restart reload status To retrieve general information about the vNics on the system including network administrated vNics refer to Section 3 7 3 1 mlx4 vnic info on page 75 3 7 2 2 EoIB Network Administered vNic In network administered mode the configuration of the vNic is done by the BridgeX If a vNic is configured for a specific host it will appear on that host once a connection is established between the BridgeX and the mlx4 vnic module This connection between the mlx4 vnic modules and all available BridgeX boxes is established automatically when the mlx4 vnic module is loaded If the BridgeX is confi
99. ORES gt 0 do Tambo harm 2 CORES 5 CORES 1 done if z S1 then IROS cat proc interrupts grep eth mlx awk print 1 sed s ft else LBOSSS cat J proc interrupts rep SL awk forint SLI sed s 7 I echo Discovered irgs SIRQS mask 1 for IRQ in IROS do echo printf x Smask gt proc irg SIRQ smp_ affinity mask mask 2 if Smask ge limit then mask 1 fi done echo irqs were set OK Preserving Your Performance Settings After A Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows lt sysctl namel gt lt valuel gt lt sysctl name2 gt lt value2 gt lt sysctl name3 gt lt value3 gt lt sysctl name4 gt lt value4 gt For example Section 5 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Perfor mance listed the following setting to disable the TCP timestamps option sysctl w net 1pv4 tcp timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 Performance Troubleshooting PCI Express Performance Troubleshooting For the best performance on the PCI Express interface the adapter card should be installed in an x8 slot with the following BIOS configuration parameters Max Read Req the maximum read request size is 512 or higher MaxPayloadSize the maximum pa
100. P that did not appear in the qos ulps section 7 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are e Max VLs the maximum number of VLs that will be on the subnet e High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 e VLArb low table Low priority VL Arbitration table IBA 7 6 9 template e VLArb high table High priority VL Arbitration table IBA 7 6 9 template e SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by qos type string Here is a full list of the currently supported sets e qos ca QoS configuration parameters set for CAs qos rtr parameters set for routers e qos sw parameters set for switches port 0 qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos Ga mas ELS LS qos cog mre Lama O gos ga wlarb high gud Ls sz Sr 20 0955 WP A o OO b e o LIs Le qos ca vlarb low One Las zada dedos 9554 05 4 72909 4
101. SH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote computers and or servers 6 2 1 SSH Configuration The following steps describe how to configure password less access over SSH Step1 Generate an ssh key on the initiator machine host1 Mellanox Technologies 103 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 MPI Message Passing Interface 6 3 hostl ssh keygen t rsa Generating public private rsa key pair Enter file in which to save the key home lt username gt ssh id rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home lt username gt ssh id rsa Your public key has been saved in home lt username gt ssh id rsa pub The key fingerprint is 38 1b 29 df 4f 08 00 4a 0e 50 0 05 44 e7 9f 05 lt username gt hostl Step2 Check that the public and private keys have been generated host1 cd home lt username gt ssh hostl 1s hostl ls la total 40 deis 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 rw 1 root root 1675 Mar 5 04 57 id rsa rw r r 1 root root 404 Mar 5 04 57 id rsa pub Step 3 Check the public key hostl cat id rsa pub ssh rsa AAAAB3NzaC1lyc2HAAAABIWAAAQEA1zZVY8VBHOh90kZN70A11bU074RXm4zHeczyVxpYHaDPyDmgezbYMKrCIVzd10bH 2 kCOrpLYviU0oUHd3fvNTfMs0gcGg08PysUf 12FyYjira2P1xyg6mkHLGGgVut fEMmABZ
102. Service ID 0x00000000010648CA 3 9 4 4 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP 88 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 3 9 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 7 can be split into two main parts l Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the dis covered fabric elements Il PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined match rules such that the target QoS Level definition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Note QoS in OpenSM is described in detail in Chapter 7 3 10 Atomic Operations 3 10 1 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those define
103. Specifically adapter products responding to the following PCI Device IDs are supported ConnectX Connectx 2 devices Decimal 25408 Hexadecimal 6340 Decimal 25418 Hexadecimal 634a Decimal 26418 Hexadecimal 6732 Decimal 26428 Hexadecimal 673c Decimal 26438 Hexadecimal 6746 Decimal 26488 Hexadecimal 6778 InfiniHost III Ex devices Decimal 25218 Hexadecimal 6282 Mellanox Technologies 189 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 InfiniHost III Lx devices Decimal 25204 Hexadecimal 6274 H 17 2 Tested Platforms see the Mellanox FlexBoot Release Notes FlexBoot release notes txt H 17 3 FlexBoot in Mellanox OFED The FlexBoot binary files are provided as part of Mellaox OFED for Linux The following binary files are included l A PXE ROM image file for each of the supported Mellanox network adapter devices Specifi cally the following images are included ConnectX ConnectX 2 images ConnectX FlexBoot 25408 ROM lt version gt rom ConnectX FlexBoot 25418 ROM lt version gt rom ConnectX FlexBoot 26418 ROM lt version gt rom ConnectX FlexBoot 26428 ROM lt version gt rom ConnectX FlexBoot 26438 ROM lt version gt rom ConnectX FlexBoot 26488 ROM lt version gt rom where the number after the ConnectX FlexBoot prefix indicates the corresponding PCI Device ID of the Con nectX ConnectX 2 device InfiniHost III Ex images e IHOST3EX PORTI ROM
104. TXBUESZ gt if nw lt 0 4 perror write failed exit EXIT FAILURE else if nw 0 printf socket was closed by remote hostin printf sent zd bytes n nw Mellanox Technologies 59 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features close sd return sdp_server c Code Usage sdp server xy tinclude lt stdio h gt tinclude lt stdlib h gt tunelude lt stdinte n gt include lt unistd h gt include lt sys types h gt include lt sys socket h gt include lt netinet in h gt include lt arpa inet h gt include lt sys epoll h gt include lt errno h gt tinelude lt assert n gt define RXBUFSZ 2048 Unus D ox DUET Sr I RXBUiS 4 define DEF PORT 22222 define AF INET SDP 27 define PF INET SDP AF INET SDP INE main int argc char argv Int sc Socket BE INET DDP TI Sd Oy 4 60 Mellanox Technologies Mellanox Technologies Confidential SOCK STREAM 0 Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 perror socket failed SxIt EXLE FALLURE St LUCcE Soc Kadad ny my adeb 4 sin family AF INET LEX Port cnuonscpEE PORTS sin Addr ss addr INADDR ANY jur rnt retbind bind sd struct sockaddr amp mv addr srzeotfimy addr i E if retbind lt 0 4 perror bind failed exit EXIT FAILURE int retlisten listen sd 5 backlog if retlisten lt 0
105. Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination e Using port LIDs Tool option 1 In this mode the source and destination ports are defined by means of their LIDs If the fabric is configured to allow multiple LIDs per port then using any of them is valid for defining a port Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the l option 8 3 ibdiagnet of ibutils2 IB Net Diagnostic Note This version of ibdiagnet is included in the ibutils2 package and it is not run by default after installing Mellanox OFED To use this ibdiagnet version and not that of the ibutils package you need to specify the full path opt bin ibdiagnet Mellanox Technologies 151 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities Note Please see ibutils2 release notes txt for additional information and known issues ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then
106. a cm ko sbin insmod lib modules ib rdma ucm ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4 ib ko sbin insmod lib modules ib ib mthca ko Note The following command loading ipoib helper ko is not required for all OS kernels Please check the release notes sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib ipoib ko Mellanox Technologies 203 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Note In case of interoperability issues between iSCSI and Large Receive Offload LRO change the last command above as follows to disable LRO sbin insmod lib modules ib ib ipoib ko lro 0 Step 10 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin dhclient cf sbin dhclient conf ibl Step 11 Save the init file Step 12 Close initrd hostl cd tmp initrd ib hostl find cpio H newc o gt tmp new initrd ib img hostl gzip tmp new init ib img Step 13 Atthis stage the modified initrd including the IB driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it prop erly H 17 14Case Il Ethernet Ports The Ethernet driver requires loading the following modules in th
107. aa aa aaa aaa iit He iP at ae AP at ae aE ae ae aE aE aE a aE EAE EAE EAE aa aaa aaa TEE aE aE ae aE aE ae ae aE aE a AE EA AEE AEE aE aaa aa aaa ERE TET TT iit TE FE at ae ae aE ae E ETC AEE AE EEE aaa E E E E E EE EEE E E TETTE ae aE aE ae aE aE ae EREEREER TEME iit ae at at ae ae aE ae aE aE ae aE aE ETE AEE AEE AEE AEE aa aaa iit E E at ae aE aE ae E E E E aE aE a E E E E E E E E E E EEE E EE EEEE REE EEEE REEE TETTE ECTETUR TETTE iit ae at at ae aE aE ae aE aE ae aE aE aE a ae EAE EAE EEE EE aaa aaa iit ae aE at ae aE aE Ae aE aE ae a aE ae a ae EAE EAE EEE Ea aaa iit ae at at ae aE aE ae aE aE ae aE aE ae a ae EAE EAE EAE EAE aa aaa iat ae at at ae aE aE ae AP aE ae aE aE aE a aE EA AEE AE EAE Ea aaa aa iat ar at at ae aE at ae AP aE ae aE aE ae a aE EA AE EAE AEE aE aa aaa aaa aa iit ae aE at ae aE aE ae aE ETE AEE EEE aa aaa ae iit ae at at ae aE aE ae aE aE OTT AE EAE EEA Ea aaa ae TETTE ae aE aE ae aE aE ae aE aE aE a aE EA AEE AEE EEE EEE aa Mellanox Technologies 29 Mellanox Technologies Confidential libsdp libsdp devel libsdp devel opensm opensm devel opensm devel opensm static opensm static compat dapl compat dapl compat dapl devel compat dapl devel dapl devel dapl devel dapl devel static dapl devel static dap m EI5s perftest mstrlint mft sdpnetstat srptools Puds btoods Xbutibs OC mgr ibdump infiniband diags qperf mlnxofed docs Invapaoh Gee mvapich pgi openmpa qee openmpi pgi mpi
108. able in order datagram deliv ery between sockets over RC or TCP IP RDS is intended for use with Oracle RAC 11g For programming details enter host1 man rds RDS Configuration The RDS ULP is installed as part of Mellanox OFED for Linux To load the RDS module upon boot edit the file etc infiniband openib conf and set RDS LOAD yes Note For the changes to take effect run etc init d openibd restart Sockets Direct Protocol Overview Sockets Direct Protocol SDP is an InfiniBand byte stream transport protocol that provides TCP stream semantics Capable of utilizing InfiniBand s advanced protocol offload capabilities SDP can provide lower latency higher bandwidth and lower CPU utilization than IPoIB or Ethernet running some sockets based applications SDP can be used by applications and improve their performance transparently that is without any recompilation Since SDP has the same socket semantics as TCP an existing application is able to run using SDP the difference is that the application s TCP socket gets replaced with an SDP socket It is also possible to configure the driver to automatically translate TCP to SDP based on the source IP port the destination or the application name See Section 3 5 5 The SDP protocol is composed of a kernel module that implements the SDP as a new address fam ily protocol family and a library see Section 3 5 2 that is used for replacing the TCP address family with SDP acco
109. ads Specifically you will be using the mlxburn tool to create and burn a composite image from an adapter device s firmware and the PXE ROM image onto the same Flash device of the adapter Image Burning Procedure To burn the composite image perform the following steps l Obtain the MST device name Run mst start t met status The device name will be of the form mt lt dev id pcif crO conf0 RL 2 Create and burn the composite image Run mlxburn d lt mst device name fw FW mlx file conf ini file exp rom expansion ROM image Example on Linux mlxburn dev dev mst mt25418 pci cr0 fw fw 25408 X X XXX mlx COnE MHGHAZS AXIO LHL exp rom Connectx LB 295410 ROM X X XXX 10M Example on Windows mlxburn dev mt25418 pci cr0 fw fw 25408 X X XXX mlx Cont MBEBGHZO XIC ADlL exp rom Connect IB 25418 ROMeX X XXX r0m H 17 6 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB H 17 7 Configuring the DHCP Server H 17 7 1 For ConnectX Family Devices When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions The 1 Depending on the OS the device name
110. anox Technologies o3 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features host1 sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address netstadt no support for AF INET bcp on this System To verify whether the module is loaded or not you can use the 1smod command host1 lsmod grep sdp 10 Sdpl250200 The example output above shows that the SDP module is loaded If the SDP module is loaded and the sdpnetstat command did not show SDP sockets then SDP is not being used by any application 3 5 3 2 Monitoring and Troubleshooting Tools SDP has debug support for both the user space 1ibsdp so library and the ib sdp kernel mod ule Both can be useful to understand why a TCP socket was not redirected over SDP and to help find problems in the SDP implementation User Space SDP Debug User space SDP debug is controlled by options in the 1ibsdp conf file You can also have a local version and point to it explicitly using the following command hostl1 export LIBSDP CONFIG FILE lt path gt libsdp conf To obtain extensive debug information you can modify 1ibsdp conf to have the log directive produce maximum debug output provide the min level flag with the value 1 The 1og statement enables the user to specify the debug and error messages that are to be sent and their destination The syntax of log is as follows log destination stderr syslog file lt filename gt min level 1 9 w
111. aptive interrupt moderation use the following command gt ethtool C eth lt x gt adaptive rx onloff e Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value To set the values for packet rate limits and for moderation time high and low values use the following command gt ethtool C eth lt x gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N e To set interrupt coalescing settings when adaptive moderation is disabled use Mellanox Technologies 95 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Working With VPI gt ethtool c eth lt x gt rx usecs N rx frames N Note Note usec settings correspond to the time to wait after the last packet is sent received before triggering an interrupt To query pause frame settings run 1o ethtood a erthex To set pause frame settings run gt ethtool A ethex rx omloff l ts omlott To query ring size values run gt ethtool g eth lt x gt To modify rings size run i cethtool G ethexc rs No tex NJ To obtain additional device statistics run i ethtool S eth lt x gt To perform a self diagnostics test run i ethtool t eth x e The mlx4 en parameters can be found under sys module mlx4 en or sys module mlx4 en parameters depending on t
112. automatically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages Note Pre existing configuration files will be saved with the extension conf saverpm f you need to install Mellanox OFED on an entire homogeneous cluster a common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh f your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the mlnx add kernel support sh script located under the docs directory Usage 1 The firmware will not be updated if you run the install script with the without fw update option Mellanox Technologies 25 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Installation 2 3 2 minx add kernel support sh 1 180 lt minx neo t tmpdir local work dir gt v verbose Example The following command will create a MLNX OFED LINUX ISO image for RedHat 5 4 under the tmp directory MLNX OFED LINUX 1 5 1 rhel5 4 docs mlnx add kernel support sh i mnt MLNX OFED LINUX 1 5 1 rhel5 4 iso All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to
113. band openib conf to 66 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 ee yes However this option also enables SRP High Availability that has some more features see Section 3 6 2 6 Note For the changes in openib conf to take effect run etc init d openibd restart 3 6 2 5 Multiple Connections from Initiator IB Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port 1 e SRP connections use the same path the configuration is enabled using a different initiator ext value for each SRP connection The initiator ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator ext value on each path The conventions is to use the Target port GUID as the initiator ext value for the relevant path If you use srp daemon with n flag it automatically assigns initiator ext values according to this convention For example id ext 200500A0B81146A1 ioc guid 0002c90200402bec dgid fe800000000000000002c90200402bed pkey ffff service ad 200500a0belld6al iInIit1ator ext 8d2b400
114. based on the Linux Ethernet Bonding Driver and was adapted to work with IPoIB The ib bonding package contains a bonding driver and a utility called ib bond to manage and control the driver operation The ib bonding driver comes with the ib bonding package run rom qi ib bonding to get the package information The ib bonding driver can be loaded manually or automatically Manual Operation Use the utility ib bond to start query or stop the driver For details on this utility please read the documentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Automatic Operation Automatic ib bonding operation can be configured as follow 1 Using a standard OS bonding configuration For details on this please read the documentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt onSuSE Notes If the bondX name is defined but one of bondX SLAVES or bondX IPs is missing then that specific bond will not be created e The bondX name must not contain characters which are disallowed for bash variable names such as and Note All the newer OSes Bonding can be done with the inbox bonding module Mellanox Technologies 229 Mellanox Technologies Confidential
115. bleshooting 0 0 00 100 5 3 2 InfiniBand Performance Troubleshooting 0 0 00 ccc eens 101 5 3 3 System Performance Troubleshooting 0 0 0 0000s 102 Chapter 6 MPI Message Passing Interface oooooooooooooooocc r cece eee eees 103 6 1 Overview 103 6 2 Prerequisites for Running MPI 103 021 SH CORSA 9255562 is i e td bee P ne betes 103 6 3 MPI Selector Which MPI Runs 104 6 4 Compiling MPI Applications 105 Chapter 7 OpenSM Subnet Manager cc ccc ccc cece ccc cere cr romo rro 106 7 1 Overview 106 7 2 opensm Description 106 Taal SOPEASM S y aR es site ee ae ae de oe mathe Ae a oS todas ee be 106 422 Bnvironment Vatlables ec da aaa Sea Marie ati tan iet M Rind 114 23 Senales O RR eed eue 114 Zub ARUBA O PENAS e il tl 114 7 2 4 1 Running OpenSM As Daemon 114 7 3 osmtest Description 114 Ol Oy NAAN AA A iaia 115 Pz AA OS MES A dd sad 117 7 4 Partitions 118 Jub AMES di it ae Met ete dne as IC a ae deett 118 7 5 Routing Algorithms 120 TO EMeC
116. col 52 Dak OV CIVIC Wick duit slice de ew E Red oa ts nh vis 52 5 5 2 libsdp s0 Libiay senile aha ee E eee A Su ctp E 53 CE MEE os P e 53 3 3 3 1 Howto Know SDP Is Working 5 uoo es 53 3 5 3 2 Monitoring and Troubleshooting Tools o ooooooooooooooooooo 54 525 Environment Varah OS s uses IU tubes bet Heal xi hse iT Geng MIA 55 3 5 5 Converting Socket based Applicati0OS o o oooooooooooooorr ees 56 330 BZ Copy Zero Copy Send sia a a 62 35 7 Using RDMA dor Small Buffers isrtri riesa dt a 62 3 6 SCSI RDMA Protocol 62 LO Ovidio ione Pilar ole leali 62 320 2 SRP TMa O acri ras el de ei dli ese he atea 63 2024A Lodd SRP nta ri Ss ieee idiota ist 63 3 6 2 2 Manually Establishing an SRP Connection LL 63 3 6 2 3 SRP Tools ibsrpdm and srp daemon LL 64 3 6 2 4 Automatic Discovery and Connection to TargetS oooooooooooo 66 3 6 2 5 Multiple Connections from Initiator IB Port to the Target 67 3 62 6 High Availability HA ooooooocooooorr eens 67 3 0 2 17 Shutting Down SRE ii de id 69 3 7 Ethernet over IB EoIB vNic 69 Sk Ethernet over IB TODOLOS V ir ceo Evo eeu Sd PRSE is ies 70 3 7 1 1 External Ports eports and Gateway 70 39 1 2 Virtual ERubs CVEIUDS 6a oe EUER SH latere ERU 70 3s Vi
117. create a new udev rule file under etc udev rules d 61 vnic net rules include the line SUBSYSTEM net PROGRAM sbin mlx4 vnic info u k NAME c 2 UDEV service is active by default however if it 1s not active run sbin udevd d When vNic MAC address is consistent you can statically name each interface using the UDEV following rule SUBSYSTEM net SYSFS address aa bb cc dd ee ff NAME ethX For further information on the UDEV rules syntax please refer to udev man pages 3 8 IP over InfiniBand 3 8 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service The IPoIB driver ib ipoib exploits the following ConnectX ConnectX 2 capabilities Uses any CX IB ports one or two e Inserts IP UDP TCP checksum on outgoing packets Calculates checksum on received packets 78 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 e Support net device TSO through CX LSO capability to defragment large datagrams to MTU quantas e Dual operation mode datagram and connected e Large MTU support through connected mode IPoIB also supports the following software based enhancements e Large Receive Offload e NAPI e Ethtool support This chapter describes the following PoIB mode setting Section 3 8 2
118. d 0x1234 5 SRP when SRP Target is located on a specified IB port GUID any target port guid 0x0ABC 0xFFFFF 6 match any PR MPR query with a specific target port GUID end qos ulps similar to the advanced policy definition matching of PR MPR queries is done in order of appear ance in the QoS policy file such as the first match takes precedence except for the default rule which 1s applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched Mellanox Technologies 139 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager 7 6 6 1 IPolB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the multi cast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent IPOLD x Sl ipoib pkey 0x7fff lt SL gt any Dkey OX ELE x sl 7 6 6 2 SDP SDP PR query is matched by Servic
119. d by IB cables Subnet A term assigned to administration activities traversing the IB connectivity only LID An address assigned to a port data sink or source point by the Subnet Man ager unique within the subnet used for directing packets within the subnet Local Device Node Sys The IB Host Channel Adapter HCA Card installed on the machine running tem IBDIAG tools The IB port of the HCA through which IBDIAG tools connect to the IB fab ric Master Subnet Man The Subnet Manager that 1s authoritative that has the reference configura ager tion information for the subnet See Subnet Manager Multicast Forward A table that exists in every switch providing the list of ports to forward ing Tables received multicast packet The table is organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot and provides one Card NIC or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in the role of a Master ager Subnet Manager by agency of the master SM See Subnet Manager Subnet Administra An application normally part of the Subnet Manager that implements the tor SA interface for querying and manipulating subnet management data Subnet Manager SM One of several entities involved in the configuration and control of the sub net Unicast Linear For A table that exists in every switch providing the port through which packets war
120. d by the IB spec Atomicity guarantees Atomic Ack generation ordering rules and error behavior for this set of extended Atomic operations 1s the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec 3 10 1 1 Masked Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic response va if compare add va amp compare add mask then va va swap mask swap amp swap mask return atomic response The additional operands are carried in the Extended Transport Header Atomic response generation and packet format for MskCmpSwap is as for standard IB Atomic operations Mellanox Technologies 89 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features 3 10 1 2 Masked Fetch and Add MFetchAdd The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the field boundary parameter specifies the field boundaries The pseudocode below describes the operation DiLt adderiet
121. d gt include the following statement Log min tevel 1 destin tuor File sap debug Log Kernel Space SDP Debug The SDP kernel module can log detailed trace information if you enable it using the debug level variable in the sysfs filesystem The following command performs this host1 echo 1 gt sys module ib sdp debug level Note Depending on the operating system distribution on your machine you may need an extra level parameters in the directory structure so you may need to direct the echo command to sys module ib sdp parameters debug level Turning off kernel debug is done by setting the sysfs variable to zero using the following com mand host1 echo 0 gt sys module ib sdp debug level To display debug information use the dmesg command host1 dmesg 3 5 4 Environment Variables For the transparent integration with SDP the following two environment variables are required l LD PRELOAD this environment variable is used to preload 1ibsdp so and it should point to the 1ibsdp so library The variable should be set by the system administrator to sry libs libsdp so of usr laib64 libpspd450 2 LIBSDP CONFIG FILE this environment variable is used to configure the policy for replacing TCP sockets with SDP sockets By default it points to etc libsdp conf Mellanox Technologies Do 4 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features 3 SIMPLE LIBSDP ignore l
122. d of test host1 Client host2 sdp client 15 2 2 43 connected po 154222445 122222 sent 2048 bytes host2 sdp client c Code usage sdp client ip addr gt zu include lt stdio h gt include lt stdlib h gt include lt stdint h gt rinclude lt unistd h gt include lt string h gt include lt sys types h gt include lt sys socket h gt include lt netinet in h gt include lt arpa inet h gt define DEF PORT 22222 define AF INET SDP 27 define PF INET SDP AF INET SDP define TXBUFSZ 2048 unto t Ex butter TXBURSZ nd main at argo char arov LE args 2 Y printf Usage sdp client ip addr gt n 58 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 exit EXIT FAILURE inu SO SOCkKeUt PE INET SDP SOCK STREAM 0 gt XE li ded X A perror socket failed exit EXIT FAILURE Sbpruct sockaddr in to adds gt 4 Sun sandy A NET Sl port MENS ADEAR SORT JI p Pet met gLtonjdorovLLlg AO adar LA dors X Lap el 00 4 printt invalrd ip address Ss in arav L exXIL QEXTIT FALLURE 7 LE COMA et coD ecutsde SCTU SOCKAgdr Lo addr sizeof to addr qo ap COMA St A perror connect failed exit EXIT FAILURE printf connec ed to 5s 9uVn met REJAS adas Sin adr NtOAS EO Addis Sin port J7 SOLZO Iw write sd ox butter
123. d scripts RPM 36 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 Driver Features 3 1 RDMA over Converged Ethernet 3 1 1 RoCE Overview RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Ethernet net works It encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type While the use of GRH is optional within IB subnets it is mandatory when using RoCE Verbs applications written over IB verbs should work seamlessly but they require provisioning of GRH information when creating address vectors The library and driver are modified to provide for mapping from GID to MAC addresses required by the hardware 3 1 2 Software Dependencies In order to use RoCE over Mellanox ConnectX R hardware the mlx4 en driver must be loaded Please refer to MLNX EN README txt for further details 3 1 3 Firmware Dependencies In order to use RoCE over Mellanox ConnectX R hardware RoCE requires ConnectX firm ware version 2 7 000 or higher Features such as loopback require higher firmware versions 3 1 4 General Guidelines Since RoCE encapsulates InfiniBand traffic in Ethernet frames the corresponding net device must be up and running In case of Mellanox hardware mlx4 en must be loaded and the corresponding interface configured e Make sure thatm1x4 en ko is loaded To verify the module is loaded run the
124. d switch does not need virtual layers as deadlock will not arise between switch and HCA In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by main taining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 3 Once this stage has been completed it is highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperform
125. ding Tables LFT should be sent to each LID Virtual Protocol A Mellanox Technologies technology that allows Mellanox channel adapter Interconnet VPI devices ConnectX to simultaneously connect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports 14 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Related Documentation Table 4 Reference Documents InfiniBand Architecture Specification Vol 1 Release 1 2 1 IEEE Std 802 3aeTM 2002 Amendment to IEEE Std 802 3 2002 Document PDF SS94996 Fibre Channel BackBone 5 standard for Fibre Channel over Ethernet Document INCITS xxx 200x Fibre Channel Backbone Firmware Release Notes for Mellanox adapter devices MFT User s Manual MFT Release Notes Support and Updates Webpage The InfiniBand Architecture Specification that is provided by IBTA Part 3 Carrier Sense Multiple Access with Collision Detec tion CSMA CD Access Method and Physical Layer Spec ifications Amendment Media Access Control MAC Parameters Physical Layers and Management Parameters for 10 Gb s Operation http www t1l org draft See the Release Notes PDF file relevant to your adapter device under docs folder of installed package Mellanox Firmware Tools User s Manual See under docs folder of installed package Release Notes for the Mellanox Firmware T
126. ding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing 1s used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to the one specified in the p option Otherwise an error is reported Note When ibdiagpath queries for the performance counters along the path between the source and destination ports it always traverses the LID route even if a directed route is specified If along the LID route one or more links are not in the ACTIVE state ibdiagpath reports an error Moreover th
127. dit loops in general all applications would need to repath to avoid message deadlock Since torus 2005 has the ability to reroute after a single switch failure without changing path SL values repa thing by running applications is not required when the fabric is routed with torus 2QoS Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover provided that all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that prevent routing that is free of credit loops and will log warnings and refuse to route If no fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that case all paths that do not transit the failed components will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was configured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and message deadlock are likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capa ble of routing a torus without credi
128. e One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule Mellanox Technologies 133 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by e Source port group whether a source port is a member of a specified group e Destination port group same as above only for destination port e PKey e QoS class e Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Ser vice ID disregarding rest of the query fields However if a certain query has only Service ID which means that this is the only bit in the PR MPR component mask that is on it will not match any rule that has other
129. e 186 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Revision History Rev 1 5 2 2 1 0 1 1 1000 May 2011 This 1s the initial verrsion of the document 10 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Preface This Preface provides general information concerning the scope and organization of this User s Manual It includes the following sections e Section Intended Audience on page 11 e Section Documentation Conventions on page 12 e Section Related Documentation on page 15 e Section Support and Updates Webpage on page 15 Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet FCoE adapter cards It is also intended for application developers Mellanox Technologies 11 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Documentation Conventions Typographical Conventions Table 1 Typographical Conventions File names file extension Directory names directory Commands and their parameters command paraml Optional items Mutually exclusive parameters tpl 1 392 T P3 3 Optional mutually exclusive parameters pt La 371 Prompt of a user command under bash shell hostnames Prompt of a root command under bash shell ETT Waw 0 0 caecum quem E
130. e ID The Service ID for SDP 1s 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The following two match rules are equivalent sdp e any service id 0x0000000000010000 0x000000000001ffff t lt SL gt 7 6 6 3 RDS similar to SDP RDS PR query is matched by Service ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds ete any Service id 0x00000000010648CA SL 7 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the target IB port GUID The following two match rules are equivalent Srp batgelL pore quad dol254 Sb any baroget porL gurd 01234 9Lh Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 140 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 7 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traf fic and that s why it is the only UL
131. e binary image file spark This tool burns a firmware binary image to the EEPROM s attached to an InfiniScaleIII switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific MADs over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace mstdump isw and 12c OpenSM is disabled by default See Chapter 7 OpenSM Subnet Manager for details on enabling it 22 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 For additional details please refer to the MFT User s Manual docs 1 5 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 7 OpenSM Subnet Manager Mellanox Technologies 23 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Installation 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed The chapter includes the following sections e Secti
132. e is zipped Extract it using the following command host1 gzip dc initrd image cpio id The initrd files should now be found under tmp initrd ib Step 4 Create a directory for the InfiniBand modules and copy them hostl mkdir p tmp initrd ib lib modules ib host1 cd lib modules uname r updates kernel drivers hostl cp infiniband core ib addr ko tmp initrd ib lib modules ib hostl cp infiniband core ib core ko tmp initrd ib lib modules ib hostl cp infiniband core ib mad ko tmp initrd ib lib modules ib hostl cp infiniband core ib sa ko tmp initrd ib lib modules ib hostl cp infiniband core ib cm ko tmp initrd ib lib modules ib hostl cp infiniband core ib uverbs ko tmp initrd ib lib modules ib hostl cp infiniband core ib ucm ko tmp initrd ib lib modules ib hostl cp infiniband core ib umad ko tmp initrd ib lib modules ib hostl cp infiniband core iw cm ko tmp initrd ib lib modules ib Mellanox Technologies 201 Mellanox Technologies Confidential hostl cp hostli cp host Lecp hostl cp hostlscp hostlscp Hostlocp Step 5 hostis 1 5 2 2 1 0 1 1 1000 infiniband core rdma cm ko tmp initrd ib lib modules ib infiniband core rdma ucm ko tmp initrd ib lib modules ib net mlx4 mlx4 core ko tmp initrd ib lib modules ib infiniband hw mlx4 mlx4 ib ko tmp initrd ib lib modules ib infiniband hw mthca ib mthca ko tmp initrd ib lib modules ib infiniband ulp ipoib ipoib helper ko tmp
133. e location of the illegal turn at I in the path from S to D requires that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I 1 e hop r D cannot contribute to a credit loop because they cannot be used to construct a loop encircling T The hop I r uses a separate VL so it cannot contribute to a credit loop encircling T Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2QoS can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus 5 Ho Ho Ho Ho l l l l l l 4 Ho D I l l l l l 3 I u Ho I l l l l l 2 q m Ho I l l l l l 1 Mm 8 T 0 pr I l l l l y 0 Ho H Ho Ho l l l l l l z 0 1 a 3 a Suppose switches T and R have failed and consider the path from S to D Torus 2QoS will gener ate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that torus 2QoS cannot route without deadlock two failed switches adjacent in a dimension that is not the last dimension routed by DOR her
134. e note that nothing will be written to your hard disk until you confirm the entire installation You have nat assigned a swap partition There is nothing wrong with that but in the last installation i in most cases itis highly recommended to create and assign a swap partition dialog Until that point Swap partitions on your system are listed in the main window with the you can safely abor type Linux Swap An assigned swap partition has the mount point swap the installation You can assign more than one swap partition if desired Far LVM setup using a nan LVM root device and a non LVM swap device is zm recommended Other i i OS than the root and swap devices you should have partitions managed by LVM Do you want to change this The table to the right showsthe current partitions on all your hard disks Hard disks are Ra EN Step 14 Select the Expert tab and click Booting Preparation y Language y License Agreement Disk Activation 4 System Analysis a Installation Settings Click any headline to make changes or use the Change menu below w Time Zone Overview Expert Installation sp Installation Summary Keyboard Layout Perfor Inslallalivri English US Configuration Bea ace Root Password Partitioning e Hostname Create swap partition dew sdal 502 0 MB e Network Create root partition dev sda 7 5 GB with reiserfs Customer Center Onl
135. e or multiple devices on the local Sys tem p lt port num gt Specifies the local device s port number used to connect to the IB Fabero eO Sou due Specifies the directory where the output files will be placed default tmp 1w Tse 4x 12x Specifies the expected link width She Seo lo LOS Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm DC Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen Betel Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values 8 5 2 Output Files Table 13 ibdiagpath Output Files A dump of all the application reports generated according to the provided flags A dump of the Performance Counters values of the fabric links 8 5 3 ERROR CODES 1 The path traced is un healthy 2 Failed to parse command line options 3 More then 64 hops are required for traversing the local port to the Source port and then to the Destination port Unable to traverse the LFT data from source to destination Failed to use Topology File Failed to load required Package Mellanox Technologies 157 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities 8 6 ibv devices Applicable Hardware All InfiniBand devices
136. e port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure below Mellanox Technologies 193 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 tel lanox Connects FlexBoot v3 6 BOE PXE 0 9 9 Open Source Boot Firmware nette 90 52 09 59 90 90 aa bc on PC10Z 060 0 open Link doun 1 0 TKE 0 Rx 0 RAE b Link status Not connected 0x38086001 aiting for link up on netH ok Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier EE 000050000 7002072007002 02269 700200202 65205700200 10 995 H 17 7 2 For InfiniHost lll Family Devices PCI Device IDs 25204 25218 When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier 1s used to distinguish between the various DHCP sessions The value of the client identifier 1s composed of 21 bytes separated by colons having the follow ing components 20 lt QP Number 4 bytes gt lt GID 16 bytes Note Bytes are represented as two hexadecimal digits Extractin
137. e similar to the following the numbers are only given as part of the example Process 0 is on l reg 6107 Process 1 is on l reg 6108 Host gt device bandwidth for process 0 6002 400960 MB sec Host gt device bandwidth for process 1 5940 005940 MB sec MPI send recv bandwidth 1432 808448 MB sec 3 3 Fibre Channel over Ethernet Note Fibre Channel over Ethernet FCoE is still at beta level in this release 3 3 1 FCoE Overview The FCOE feature provided by Mellanox OFED allows connecting to Fibre Channel FC targets on an FC fabric using an FCoE capable switch or gateway Key features include e TII and pre T11 frame format Complete hardware offload of SCSI operations in pre T11 format Hardware offload of FC CRC calculations in pre T11 format 48 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 e Zero copy FC stack in pre T11 format e VLANs and PFC Priority flow control that is PPP The FCoE feature is based on and interacts with the Open FCoE project The m1x4 fc module is designed to replace the original fcoe module and to allow using ConnectX hardware offloads Mellanox OFED also includes the following open fcoe org modules libfc Used by the m1x4 fc module to handle FC logic such as fabric login and logout remote port login and logout fc ns transactions etc e fcoe Implements FCoE fully in software Will load instead of mlx4 fc to sup
138. e specified order see the exam ple below mlx4 core ko e mlx4 en ko H 17 14 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 9 3 3 1 on page 93 and connected to the client machine 3 Aninitrd file 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with a MLNX EN Linux Driver that 1s appro priate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File Warning The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Step1 Backup your current initrd file Step2 Make a new working directory and change to it 204 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 host1 mkdir tmp initrd en host1 cd tmp initrd en Step 3 Normally the initrd image is zipped Extract it using the following command hostis gzip d lt initeo amage gt opio rd The initrd files should now be found under tmp initrd en Step 4 Create a directory for the ConnectX EN modules and copy them host1 mkd
139. e the failed switches are O and T 5 4 4 4 4 4 l l l 4 boo l l 3 4 D l l 2 4 I qu r l l l l 1 m 5 n O0 T p l I l y 0 4 l l x 1 2 3 4 x 128 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 In a pristine fabric torus 2QoS would generate the path from S to D as S n O T r D With failed switches O and T torus 2QoS will generate the path S n I q r D with illegal turn at switch I and with hop I q using a VL with bit 1 set In contrast to the earlier examples the second hop after the illegal turn q r can be used to construct a credit loop encircling the failed switches 7 5 7 2 Multicast Routing Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically avail able in current switches there is no way to use SL VL values to separate multicast traffic from uni cast traffic Thus torus 2QoS must generate multicast routing such that credit loops cannot arise from a combination of multicast and unicast path segments It turns out that it is possible to con struct spanning trees for multicast routing that have that property For the 2D 6x5 torus example above here is the full fabric spanning
140. e tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source SYNOPSYS ibdiagpath n lt src name dst name gt 1 lt src lid dst lid gt d lt p1 p2 p3 gt c lt count gt v t lt topo file gt s lt sys name gt ic lt dev index gt c p lt port num gt o lt out dir gt 1w lt 1x 4x 12x gt 1s lt 2 5 5 10 gt pm pc P lt lt PM counter Trash Limit gt gt OPTIONS n lt src name dst name gt Names of the source and destination ports as defined in the topology file source may be omitted gt local port is assumed to be the source lt src lid dst lid gt Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Source and destination LIDs source may be omitted gt the local port is assumed to be the source O lt p a rae ae ee Directed route from the local node which is the source and the destination node C lt count gt The minimal number of packets to be sent across each link default 100 y Enable verbose mode t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric xn cas
141. echnologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 6 MPI Message Passing Interface 6 1 Overview Note PGI compiler does not support RHEL6 0 thus MLNX OFED v1 5 2 will not include openmpi and mvapich with PGI compiler on RHEL6 Mellanox OFED for Linux includes the following MPI implementations over InfiniBand and RoCE e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 6 lists some useful MPI links Table 6 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mp1 org MVAPICH MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections Prerequisites for Running MPI page 103 MPI Selector Which MPI Runs page 104 e Compiling MPI Applications page 105 e Please refer to http www open mpi org faq category mpi apps page 105 6 2 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login 1 e password less onto the remote machines S
142. echo command Step 2 The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented ibsrpdm ibsrpdm is using for the following tasks l Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device dev umad0 execute the following command ibsrpdm This command will output information on each SRP Target detected in human readable form sample output TO UNIT JDBmbos por LED Od port GID fe800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller d GUID 0002c90200402bd4 vendor ID 0002c9 64 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 device ID 005a44 TO Glass x ULOG ED LSI Storage Systems SRP Driver 200400a0b81146al service entries 1 service O0 200400a0b81146a1 7 SBEBP TIO 200400AO0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the following command ibsrpdm d umad device 2 Assistance in creating an SRP connection a To generate output suitable for utilization in the echo command of Section 3 6 2 2 add the c option to ibsrpdm
143. ed for address vector creation However RoCE traffic does not go through the mlx4 en driver it is completely offloaded by the hardware Mellanox Technologies 41 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Configre an IP Address to mlx4 en Interface Run the following on both sides of the link ifconfig eth2 20 4 3 220 ifconfig eth2 eth2 Link encap Ethernet HWaddr 00 02 C9 08 E8 11 iner addes20 43220 Beost z0 2552952395 Mask 255404020 UP BROADCAST MULTICAST MTU 1500 Metriesl RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 0 0 0 b TX bytes 0 X0 0 15 it Make sure that ping is working t ping 20 42 3 219 PING 20453249 204434219 SOLE bytes cor desta 64 Dytes trom 204 5 219 temp seg l LEl 64 Lime 0 875 ms o4 Dytes Trom 20 9 24219 remp 5692 EE L 04 timeseU0 r1995 ms 64 bytes rom 2044302197 Emp sSeg 2 Ltl 04 taime 0 167 ms ZO eA ZI pum Sia bisa kes 3 packets transmitted 3 received 0 packet loss time 2000ms rtt min avg max mdev 0 167 0 412 0 873 0 326 ms Inspecting the GID Table Cat Sys Class intiniband s mix4 0 ports 2 g1ds 0 fe80 0000 0000 0000 0202 c9ff fe08 e811 Cat sys eclass 1intintband mix4 0 ports 2 gids 1 0000 0000 0000 0000 0000 0000 0000 0000 According to the output we currently have one entry only Run an Example
144. emote boot technology FlexBoot supports remote boot over InfiniBand BoIB and Boot over Ethernet BoE Using Mellanox Virtual Protocol Interconnect VPI technologies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target SCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox ConnectX products FlexBoot is based on the open source project Etherboot gP XE available at http www etherboot org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then it connects to a DHCP server to obtain its assigned IP address and net work parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some other service For an InfiniBand port Mellanox FlexBoot implements a network driver with IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Linux software package see www mellanox com gt Products gt InfiniBand V PI SW Drivers The binary code is exported by the device as an expansion ROM image H 17 1 Supported Mellanox Adapter Devices and Firmware The package supports all ConnectX ConnectX 2 network adapter devices and cards It also supports the InfiniHost III Ex and InfiniHost Lx adapter devices and cards
145. ep 3 Repeat Step 1 and Step 2 on the remaining interface s Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri 82 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 mary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to Create a subinterface Section 3 8 4 1 Remove a subinterface Section 3 8 4 2 3 8 4 1 Creating a Subinterface To create a child interface subinterface follow this procedure Note In the following procedure i10 is used as an example of an IB subinterface Step1 Decide on the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For example a value of 0 will give a PKey with the value 0x8000 Step2 Create a child interface by running hostl echo lt PKey gt gt sys class net lt IB subinterface gt create child Example host1 echo 0 gt sys class net ib0 create child This will create the interface 1b0 8000 Step 3 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 hostis ifconfig 1
146. eral such parallel links fail routes are redistributed across the remaining links When the last of such a set of parallel links fails traffic 1s rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise illegal i e not allowed by DOR rules Torus 2QoS will introduce such a turn as close as possible to the failed switch in order to route around it n the above example suppose switch T has failed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns 1 e those turns that are illegal under DOR This Mellanox Technologies 127 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager causes the first hop after any such turn to use a separate set of VL values and prevents deadlock in the presence of a single failed switch For any given path only the hops after a turn that is illegal under DOR can contribute to a credit loop that leads to deadlock So in the example above with failed switch T th
147. ernet net works It encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type 20 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 RDS Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram deliv ery between sockets over RC or TCP IP For more details see Chapter 3 4 Reliable Datagram Sockets SDP Sockets Direct Protocol SDP is a byte stream transport protocol that provides TCP stream semantics SDP utilizes InfiniBand s advanced protocol offload capabilities Because of this SDP can have lower CPU and memory bandwidth utilization when compared to conventional imple mentations of TCP while preserving the TCP APIs and semantics upon which most current net work applications depend For more details see Chapter 3 5 Sockets Direct Protocol SRP SRP SCSI RDMA Protocol 1s designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP driver known as the SRP Initiator differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric T
148. error is detected it is displayed on the standard output After the discovery phase 1s completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links 1s displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes SM report e Number of nodes and systems Hop count information maximal hop count an example path and a hop count histogram e All CA to CA paths traced e Credit loop report e mgid mlid HCAs multicast group and report Partitions report e IPoIB report Note In case the IB fabric includes only one CA then CA to CA paths are not reported Furthermore if a topology file is provided ibdiagnet uses the names defined in it for the output reports 8 4 5 ERROR CODES 1 Failed to fully discover the fabric Mellanox Technologies 155 Mellanox Technologies Confidential 8 5 8 5 1 156 Mellanox Technologies 1 5 2 2 1 0 1 1 1000 Failed to parse command line options barled Lo Interact wrth ID fabric Failed to use local device or local port Failed to use Topology File 2 2 i 9 6 Failed to load requierd Package ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regar
149. erver should be configured and started see Section 9 3 3 1 IPoIB Configuration Based on DHCP on page 93 Configure and start at least one of the services iSCSI Target see Section A 9 and or TFTP see Section A 4 196 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 H 17 9 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot for ConnectX family or gPXE for InfiniHost III family to be the first on the boot device priority list see Section A 5 Note On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 Note also that the driver waits up to 90 seconds for each port to come up If MLNX FlexBoot gPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by the Subnet Manager Note In case sensing the port protocol fails the port will be configured as an InfiniBand port For ConnectX lellanox Connectx FlexBoot v3 H 00b PXE 0 9 9 Open Source Boot Firmware neto 0 02 c9 00 00 04 aa bc on PC102 060 0 open Link dowm 1 0 IxXE 0 Rx 0 RAE 0 Link status Not connected 0x380866001 3 aiting for link up on net ok For
150. esolving multiple use of same LID Mellanox Technologies 107 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager c crouting engine Ki lt engine nanes This option chooses routing engine s to use instead of default Min Hop algorithm Multiple routing engines can be specified separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail If all configured routing engines fail OpenSM will always attempt to route with Min Hop unless no fallback i included in the list of routing engines Supported engines updn file ftree dash dor torus 20905 zano default routing This option prevents OpenSM from falling back to default routing if none of the provided engines was able to configure the subnet do mesh analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing clash Stark Ml xw DUMISES Sets the starting VL to use for the lash routing algorithm Defaults to O0 SM sli sl number Sets the SL to use to communicate with the SM SA Defaults to 0 CCconnect roots SZ This option enforces routing engines up down and fat tree to make connectivity between root switches and in this way be IBA compliant In many cases this can violate pure deadlock free algorithm
151. eth and the m1x4 en driver must be loaded The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified if there 1s only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device Mellanox Technologies 93 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Working With VPI Note This utility also has a non interactive mode sbin Connect port contig L ed eedevioe PCI device ID e econt portl port2 4 2 InfiniBand Driver The InfiniBand driver m1x4 ib handles InfiniBand specific functions and plugs into the InfiniBand midlayer 4 3 Ethernet Driver 4 3 1 Overview The Ethernet driver m1x4 en exposes the following ConnectX ConnectX 2 capabilities e Single Dual port e Up to 16 Rx queues per port e 5 Tx queues per port Rx steering mode Receive Core Affinity RCA Tx arbitration mode VLAN user priority off by default MSI X or INTx Adaptive interrupt moderation HW Tx Rx checksum calculation Large Send Offload 1 e TCP Segmentation Offload Large Receive Offload P Reassembly Offload Multi core NAPI support VLAN Tx Rx acceleration HW VLAN stripping insertion HW VLAN filtering HW multicast fi
152. fall back to TCP if the SDP connection fails role can be one of server or listen for defining the listening port address family client or connect for defining the connected port address family lt program name gt Defines the program name the rule applies to not including the path Wildcards with same semantics as ls are supported and So db2 would match on any program with a name starting with dbZ t cp would match on ttcp etc If program name is not provided default the statement matches all programs lt address gt Either the local address to which the server binds or the remote server address to which the client connects The syntax for address matching is lt IPv4 address gt lt prefix length gt IPv4 address 0 9 0 9 0 9 0 9 each sub number lt 255 prefix length 0 9 and with value lt 32 A prefix length of 24 matches the subnet mask 255 255 255 0 A prefix length of 32 requires matching of the exact IP lt port range gt start port end port where port numbers are gt 0 and lt 65536 56 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Note that rules are evaluated in the order of definition So the first match wins If no match is made 1ibsdp will default to both Examples Use SDP by clients connecting to machines that belongs to subnet 192 168 1 use sdp connect 192 1
153. first eth lt n gt interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 IPADDR ib0 11 4 NETMASK XDU0 25542594040 NETWORK ib0 11 4 0 0 BROADCAST LDO L1 4 255 259 ONBOOT ib0 1 3 8 3 3 Manually Configuring IPoIB 3 8 4 To manually configure IPoIB for the default IB partition VLAN perform the following steps Note This manual configuration persists only until the next reboot or driver restart Step1 To configure the interface enter the ifconfig command with the following items The appropriate IB interface 1b0 1b1 etc The IP address that you want to assign to the interface e 9 The netmask keyword The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface host1 ifconfig ib0 11 4 3 175 netmask 255 255 0 0 Step2 Optional Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib argument The following example shows how to verify the configuration host1 ifconfig ibO bO Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 inet addr 11 4 3 175 Bcast 11 4 255 255 Mask 255 255 0 0 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b St
154. form the changes and click Accept when done Mellanox Technologies 219 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Step 18 In the Confirm Installation window click Install to start the installation See image below Preparation Installation Settings Language a 3 w License Agreement Disk Activation T n e E T E Click any headline to make changes or use the Change menu below wv Time Zone verve Expert Installation Installation Sum Perform Installatii Confirm Installation Configuration Hostname Root Password All information required forthe base installation is now complete Network Customer Center Online Update If you continue now partitions on your hard disk will be formatted erasing any existing data in those partitions according to the installation settings in the previous dialogs Service Go back and check the settings if you are unsure Users Clean Up Release Notes Hardware Configi E Back Install Show Release Mi L E Enea H bs Change ri e ET Step 19 At the end of the file copying stage the Finishing Basic Installation window will pop up and ask for confirming a reboot You can click OK to skip count down See image below Note Assuming that the machine has been correctly configured to boot from FlexBoot via its connection to the iSCSI target make sure that MLNX IB for
155. g the Client Identifier Method The following steps describe one method for extracting the client identifier Step1 QP Number equals 00 55 04 01 for InfiniHost III Ex and InfiniHost III Lx HCAs Step 2 GID is composed of an 8 byte subnet prefix and an 8 byte Port GUID The subnet prefix is fixed for the supported Mellanox HCAs and is equal to fe 80 00 00 00 00 00 00 The next steps explains how to obtain the Port GUID Step3 To obtain the Port GUID run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT pack age has been installed on the client machine bosch met Start 194 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Hhostl4 mst status The device name will be of the form dev mst mt dev id pci crO conf0 Use this device name to obtain the Port GUID via a query command flint d MST DEVICE NAME q Example with InfiniHost III Ex as the HCA device DoStlg flint d dev msLt mL29215 por or q Image type Failsafe FW Version tata Rom Info type GPXE version 1 0 0 devid 25218 port 2 I S Version 1 Device ID 29218 Chip Revision AO Description Node Portl Port2 Sys image GUIDs 0002c90200231390 0002c90200231391 0002c90200231392 0002c90200231393 Board ID MT 0370110001 VSD PSiD MT O370L10004 Assuming that FlexBoot is connected via Port 2 then the Port GUID 1s 0
156. ges below echo add gt proc net accl policy Usage Add Remove rule add remove lt app gt tcp connect tcp accept udp bind a b c d lt n gt lt x gt lt y gt If running the rule as follow echo add vperf tcp connect 10 4 0 0 16 2000 2100 gt proc net accl policy the acceleration is applied on sockets of vperf application that connect to TCP ports 2000 2100 on destination IPs 10 4 Remove all the rules reset Removes all the rules at once For example hostl1 echo reset gt proc net accl policy Print all the policy rules to dmesg host1 cat proc net accl policy Mellanox Technologies 91 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features 3 11 4 Kernel Space Socket Accelaration Debug The Socket Accelaration kernel module can log detailed trace information if you enable it using the mlx4 accl dbg and mlx4 accl sys _ dbg variables variable in the sysfs filesystem The fol lowing command performs this host1 echo 1 gt sys module mlx4 accl parameters mlx4 accl dbg host1 echo 1 gt sys module mlx4 accl sys parameters mlx4 accl sys dbg Turning off kernel debug is done by setting the sysfs variable to zero using the following com mand host1 echo 0 gt sys module mlx4 accl parameters mlx4 accl dbg host1 echo 0 gt sys module mlx4 accl sys parameters mlx4 accl sys dbg To display debug information use the dmesg command host1
157. gured to remove the vNic or if the connection between the host and BridgeX is lost the vNic interface will disappear running ifconfig will not display the interface Similar to host administered vNics a network administered vNic resides on a specific vHub For further information on how to configure a network administered vNic please refer to BridgeX documentation To disable network administered vNics on the host side load mlx4 vnic module with the net_admin module parameter set to 0 3 7 2 3 VLAN Configuration A vNic instance is associated with a specific vHub group This vHub group is connected to a BridgeX external port and has a VLAN tag attribute When creating configuring a vNic you define the VLAN tag it will use via the vid or the VNICVLAN fields if these fields are absent the vNic will not have a VLAN tag The vNic s VLAN tag will be present in all EoIB packets sent by the vNics and will be verified on all packets received on the vNic When passed from the InfiniBand to Ethernet the EoIB encapsulation will be disassembled but the VLAN tag will remain For example if the vNic eth23 is associated with a vHub that uses BridgeX bridge01 eport A10 and VLAN tag 8 all incoming and outgoing traffic on eth23 will use a VLAN tag of 8 This will be enforced by both BridgeX and destination hosts When a packet is passed from the internal fabric to the Ethernet subnet through the BridgeX it will have a true Ethernet VLAN tag of 8
158. guring IPoIB 0 0 0 0 ooo moon 82 A Seleucia bebe lan 82 3 59 41 Created SUDNICE irc edi das da ee i 83 3 50 42 Removing a Subinterface si bll SRI AA EI REM quere dq 83 3 8 5 Verifying IPoIB Functionality o oooooooooo eee 84 2100 Bondme POB ts a A Ad 84 3 9 Quality of Service 85 59 Quality of serate UNCIVISW dr A nas 85 3 9 2 SN A II Ple obese 86 3 93 Supported POLE da il fas es ies 87 35 94 OMA Pear nadadora ce bb dd 88 SORE SPOR cot iia oil a a a ia 88 S P A A A Pe A eee 88 RO Ideali aia ek bat da I 88 SOA V SP esL iLustacenrcASEUL LI Lalla ela 88 3 79 59 Open M TOI OS ittici A Bee nete aerae P og Bera dices sa 89 3 10 Atomic Operations 89 3 10 1 Enhanced Atomic Operations 0 0 0 0 000 ccc cc eee eee e eens 89 3 10 1 1 Masked Compare and Swap MskCmpSwap 0 0 0 0 eee ee eee 89 3 10 1 2 Masked Fetch and Add MFetchAdd 0 0 0 00 cee eee 90 3 11 Socket Acceleration 91 SN NES Sn eds A a a ai 91 11 2 Software Dependencies e 2er tr ER eR e dd e pl NU eR XS 91 3 11 3 MLXA Socket Acceleration Module Configuration 91 3 11 4 Kernel Space Socket Accelaration Debug oooooooooonoo eee 92 Chapter 4 Working With Viele caia Ad 93 4 1 Port Type Management
159. gy change has been found Also SIGUSRI can be used to trigger a reopen of var log opensm log for logrotate pur poses 7 2 4 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter hostl opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file message registers only general major events the second file opensm log includes details of reported errors All errors reported in opensm log should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly Note If a fatal non recoverable error occurs opensm exits 7 2 4 1 Running OpenSM As Daemon OpenSM can also run as daemon To run OpenSM in this mode enter hostl etc init d opensmd start 7 3 osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administrator osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports 114 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0
160. he IB HCA and Switches hardware support 4K MTU 2 Configure Mellanox low level driver to support 4K MTU Add mlx4 core module parameter to set 4k mtu 1 3 Change the MTU value of the vNic for example run ifconfig eth2 mtu 4038 Note Due to EoIB protocol overhead the maximum MTU value that can be set for the vNic interface 1s 4038 bytes 3 7 4 Advanced EolB Settings 3 7 4 1 Module Parameters The mlx4_vnic driver supports the following module parameters These parameters are intended to enable more specific configuration of the mlx4 vnic driver to customer needs The mlx4 vnic is also effected by module parameters of other modules such as set 4k mtu of mlx4 core This mod ules are not addressed in this section The available module parameters include e tx rings num Number of TX rings use 0 for cores default 0 max 16 e tx rings len Length of TX rings must be power of two default 1024 max 8K rx rings num Number of RX rings use 0 for cores default 0 max 16 e rx rings len Length of RX rings must be power of two default 2048 max 8K vnic net admin Network administration enabled default 1 eport state enforce Bring vNic up only when corresponding External Port is up default 0 For all module parameters list and description run mlx4 vnic info I To check the current module parameters run mlx4 vnic info P Mellanox Technologies Tf Mellanox Technologies Confidential 1 5 2 2 1 0 1
161. he OS and can be listed using the command i modinfo mlx4 en To set non default values to module parameters the following line should be added to the file etc modprobe conf Ww e options mlx4 en param name gt lt value gt param name gt lt value gt 96 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 5 Performance 5 1 General System Configurations The following sections describe recommended configurations for system components and or inter faces Different systems may have different features thus some recommendations below may not be applicable 5 1 1 PCI Express PCle Capabilities Table 5 Recommended PCle Configuration PCIe Generation Speed Width Max Payload size Max Read Request Note For VPI Ethernet adapters with ports configured to run 40Gb s or above it is recom mended to use an x16 PCIe slot to benefit from the additional buffers allocated by the system 5 1 2 BIOS Power Management Settings Set BIOS power management to Maximum Performance On Intel Processors Only Disable C states of PCI Express Note that these performance optimizations may result in higher power consumption 5 1 3 Intel amp Hyper Threading Technology Note This section applies to Intel processors only supporting Hyper Threading For latency and message rate sensitive applications it is recommended to disable Hyper Thread i
162. he SRP Target resides in an I O unit and provides storage services See Chapter 3 6 SCSI RDMA Protocol and Appendix B SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that promotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface is defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifica tion is timed to coincide with OFED release of the Open Fabrics www openfabrics org software stack For more information about the DAT collaborative go to the following site http www datcollaborative org 1 4 6 MPI Message Passing Interface MPI is a library specification that enables the development of parallel software libraries to utilize parallel computers clusters and heterogeneous networks Mellanox OFED includes the following MPI implementations over InfiniBand e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Bench mark and Presta Mellanox Technologies 21 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Mellanox OFED Overview 1 4 7 1 4 8 1 4 9 l InfiniBand Subnet Manager All InfiniBand compliant ULPs require
163. he combination of a specific BridgeX box and a specific eport is referred to as a gateway The gateway is an entity that is visible to the EoIB host driver and is used in the configuration of the network interfaces on the host side For example in the host administered vNics the user will request to open an interface on a specific gateway identifying it by the BridgeX box and eport name Distinguishing between gateways 1s essential because they determine the network topology and affect the path that a packet traverses between hosts A packet that is sent from the host on a spe cific EoIB interface will be routed to the Ethernet subnet through a specific external port connec tion on the BridgeX box 3 7 1 2 Virtual Hubs vHubs Virtual hubs connect zero or more EoIB interfaces on internal hosts and an eport through a vir tual hub Each vHub has a unique virtual LAN VLAN ID Virtual hub participants can send packets to one another directly without the assistance of the Ethernet subnet external side rout ing This means that two EoIB interfaces on the same vHub will communicate solely using the InfiniBand fabric EoIB interfaces residing on two different vHubs whether on the same gateway or not cannot communicate directly There are two types of vHubs a default vHub one per gateway without a VLAN ID vHubs with unique different VLAN IDs 70 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User
164. he device mlx4 0 and print user available information for its Port 2 c M GOURMET hca id mlx4 O0 WOES 2 5 944 KONCE E OOKO OOOO OOT SS SyeulmMage quad OOOO 00010 OF 3698 Meno ra Opere vendor part rd 25418 iwi gers OxAO Board MT 04A0140005 PS Om ems 2 DOr 2 Siles ESA AS max He Wh 2048 4 ee Oia site U 2048 4 yap have 1 porr bre AL pones 0x00 Mellanox Technologies 159 Mellanox Technologies Confidential 8 8 8 8 1 8 9 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities ibdev2netdev ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link SYNOPSYS ibdiagnet v h OPTIONS V Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mlx4 0 MT26428 MT1006X00034 FALCON QDR FW Ae 1697 80 port l ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR EW 227109299 POLE 1 ACTIVE gt ibO Down mlx4 0 MI26428 MT1006X00034 FALCON QDR bw 24 De ZOO POTE 2 DOWN gt ibl Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 1 DOWN gt eth2 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 2 DOWN gt eth3 Down sw417 BXOFED 1 5 2 20101128 1524
165. he same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn guid file OpenSM options Mellanox Technologies 123 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will match the routing tables 7 5 4 1 Routing between non CN Nodes The use ofthe cn guid file option allows non CN nodes to be located on different levels in the fat tree In such case 1t 1s not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around Spinel Spine2 Spine 3 Poi A cut NI Switch N2 Switch N3 PAN GI 7 F3 Going down to compute nodes To solve this problem a list of non CN nodes can be specified by G or V 10 guid fileW option Theses nodes will be allowed to use switches the wrong way around a specific number of times specified by H or V max reverse hops With the proper max reverse hops and io guid file values you can ensure f
166. heck s gt load db db file gt Mellanox Technologies 153 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OPTIONS C COLE E wopostrqlie S lt sys name gt i lt dev index gt p lt port num gt o out dir Lw bse dc 12x eus s od up pm pc P lt PM lt Trash gt gt InfiniBand Fabric Diagnostic Utilities Min number of packets to be sent across each link default 10 Enable verbose mode Provides a report of the fabric qualities Specifies the topology file name Specifies the local system name Meaningful only if a topology file is specified Specifies the index of the device of the port used to con nect to the IB fabric in Case of multiple devices on the local system Specifies the local device s port num used to connect to the TB fabric Specifies the directory where the output files will be placed default tmp Specifies the expected link width Specifies the expected link speed Dump all the fabric links pm Counters into ibdiagnet pm Reset all the fabric links pmCounters If any of the provided pm is greater then its provided value print ITE to Screen skip lt skip option s gt Skip the executions of the selected checks Skip options wt file name one Or more can be Speci tiled dup guias zero gurdo pm logrcal state part pozb ail Write out the discovered topology into the given file This flag is useful if you later want to check fo
167. here options are destination Send log messages to the specified destination stderr forward messages to the STDERR syslog send messages to the syslog service file filename write messages to the file var log filename for root For a regular user write to tmp filename uid if filename is not specified as a full path otherwise write to lt path gt lt filename gt lt uid gt min level verbosity level of the log 9r print errors only dii Print warnings 7 print connect and listen summary useful for tracking SDP usage 4 print positive match summary useful for config file debug 94 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 print negative match summary useful for config file debug 2 print function calls and return values 1 print debug messages Examples To print SDP usage per connect and listern to STDERR include the following statement log min level 7 destination stderr A non root user can configure 1ibsdp so to record function calls and return values in the file tmp libsdp log lt pid gt root log goes to var log libsdp log for this example by including the following statement in 1ibsdp conf log min level 2 destination file libsdp log To print errors only to syslog include the following statement log min level 9 destination syslog To print maximum output to the file tmp sdp debug log lt pi
168. ibsdp conf and always use SDP 3 5 5 Converting Socket based Applications You can convert a socket based application to use SDP instead of TCP in an automatic also called transparent mode or in an explicit also called non transparent mode Automatic Transparent Conversion The libsdp conf configuration policy file is used to control the automatic transparent replacement of TCP sockets with SDP sockets In this mode socket streams are converted based upon a destination port a listening port or a program name Socket control statements in 1ibsdp conf allow the user to specify when libsdp should replace AF INET SOCK STREAM sockets with AF SDP SOCK STREAM sockets Each con trol statement specifies a matching rule that applies 1f all 1ts subexpressions must evaluate as true logical and The use statement controls which type of sockets to open The format of a use statement 1s as follows use lt address family gt role lt program name gt lt address gt lt port range gt where lt address family gt can be one of sdp for specifying when an SDP should be used tcp for specifying when an SDP socket should not be matched both for specifying when both SDP and AF INET sockets should be used Note that both semantics is different for server and client roles For server it means that the server will be listening on both SDP and TCP sockets For client the connect function will first attempt to use SDP and will silently
169. ibution or another vendor s commercial stack e Install the MLNX OFED LINUX binary RPMs if they are available for the current kernel e Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 3 2 Software Components MLNX OFED LINUX contains the following software components e Mellanox Host Channel Adapter Drivers e mthca IB only e mlx4 VPD which is split into multiple modules B mlx4 core low level helper E mix4 ib IB B mlx4 en Ethernet B mlx4 fc FCoE B mlx4 vnic EoIB e Mid layer core Verbs MADs SA CM CMA uVerbs uMA Ds e Upper Layer Protocols ULPs IPoIB RDS SDP SRP Initiator e MPI e Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces e OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces e MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta e OpenSM InfiniBand Subnet Manager e Utilities e Diagnostic tools e Performance tests e Firmware tools MFT e Source code for all the OFED software modules for use under the conditions mentioned in the modules LICENSE files QIB Low level driver implementation for all QLogic InfiniPath PCI Express HCAs This driver was not tested by Mellanox Technologies e CXGB3 Provide RDMA and NIC support for the Chelsio S series adapters This driver was not tested by Mellanox Technologies e NES e Support for the NetEffect Ethernet Cluster Server Adapters Mellanox Technologies 17
170. ic Utilities Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 1 On the command line specify the file name using the option t topology file name gt 2 Define the environment variable IBDIAG TOPO FILE To specify the local system name to an diagnostic tool use one of the following two options 1 On the command line specify the system name using the option s lt local system name gt 2 Define the environment variable IBDIAG SYS NAME 8 2 2 IB Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the follow ing options 1 On the command line specify the port number using the option p local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option 1 index of local device gt 2 Define the environment variable IBDIAG DEV IDX 8 2 3 Addressing Note This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies The following addressing modes can be used to define the IB ports
171. ic for which torus 2QoS can generate credit loop free unicast routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example consider that same 2D 6x5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree Mellanox Technologies 129 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager i 3 I 2 X 4 4 1 v 0 0 1 2 3 4 2 Two things are notable about this master spanning tree First assuming the x dateline was between x 5 and x 0 this spanning tree has a branch that crosses the dateline However just as for uni cast crossing a dateline on a 1D ring here the ring for y 2 that is broken by a failure cannot con tribute to a torus credit loop Second this spanning tree 1s no longer optimal even for multicast groups that encompass the entire fabric That unfortunately is a compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appropriately select ing a root switch In the 2D 6x5 torus example assume now that the switch at 3 2 1 e the root for a pristine fabric fails Torus 2QoS will generate the following master spanning t
172. ic s BridgeX box eport and VLAN id properties The mlx4 vnic confd service is used to read these configuration files and pass the relevant data to the mlx4 vnic module EoIB Host Administered vNic supports two forms of configuration files e Central Configuration File etc infiniband mlx4 vnic conf e vNic Specific Configuration Files ifcfg ethX Both forms of configuration supply the same functionality If both forms of configuration files exist the central configuration file has precedence and only this file will be used Central Configuration File etc infiniband mlx4 vnic conf The mlx4 vnic conf file consists of lines each describing one vNic The following file format is used name eth47 mac 00 25 8B 27 16 84 ib port m1x4 0 1 vid 2 vnic id 7 bx BX001 eport A11 The fields used in the file have the following meaning Table 2 mlx4_vnic conf file format Field Description name The name of the interface that is displayed when running ifconfig Mellanox Technologies T1 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Table 2 mlx4_vnic conf file format Field Description mac The mac address to assign to the vNic ib port The device name and port number in the form device name port number The device name can be retrieved by running ibv devinfo and using the output of hca id field The port number can have a value of 1 or 2 vid VLAN ID
173. id See dE ON o iS dera carol eo OO OOOO 0000000000000 T R T base wid Orci Sinn lance Oxl state 4 ACTIVE phys Stace do rate 20 Gb sec 4X DDR nEri pand device tmishe d Toons E c cub es cies aude gotici 18 950 5046 9107 220191979 UU Oui base lid 0x0 Sin dI DIO SIS ee AINE phys state SI rate 10 Gb sec 4X 162 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 2 List the status of specific ports of specific devices gt OS os ae dedu die e Matan dev iea me a ONA pore ias deran EEE RE OE OO OTE OOO n O E IE oM TE base lid 0x0 sm las 0x0 Sce 22 t SNOT phys state S iei UO rate 10 Gb sec 4X eo Sumo claw Multa 0 oca e Sese Le e default gid jme CUI O OHO RONG TONONCNGLS OOOO IO 3 base lid Obl 8 10 ibportstate Applicable Hardware All InfiniBand devices Description Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a swich port then ibportstate can be used to disable enable or reset the port e validate the port s link width and speed against the peer port Synopsis ibportstate d e v V D G s lt smlid gt C ca name P ca port gt t timeout ms gt N dest dr path lid guid lt portnum gt lt op gt lt va
174. ies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Step 16 In the Optional Kernel Command Line Parameter field append the following string to the end of the line ibft mode off include a space before the string Click OK and then Finish to apply the change Section Name l Boot Loader Settings Section Management Use Section Name to specify the boot loader section name The section name must be unique Section Editor Section Settings Selecting Do not section Name verity Filazy amm SUSE Linux Enterprise Server 10 SP2 before Booting will A C MR EA skip all file system checks Optional Kernel Section Settings tion soa _ Do nat verify Filesystem before Booting define additional Optional Kernel Command Line Parameter M passa resume dev sdal splash silent showopts ibft mode off Kernel Image defines Kernel irt the kernel to boot I bootivmlinuz Browse Either enterthe name Sis t directly or choose via Initial RAM Disk Browse Iboot initrd Browse Initial RAM Disk if not empty definesthe oh ges initial ramdisk to use dev sda2 Either enter the path GE EE Root Device and file name directly Vga Mode or choose by using 0x337 Browse L Root Device sets the device to passtothe LA kernel as root device Back Abot MTI Step 17 If you wish to change additional settings click the appropriate item and per
175. ile for InfiniHost III Ex PCI Device ID 25218 called dhel iene conf The value indicates a hexadecimal number interface ibl send dhcp client identifier yO Ot oo tA duces soe OU DOO sco Mss t 9 2925 In order to use the configuration file run host aneltent ef dhielients tonf 1b1 3 8 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the instal lation script with a configuration file using the n option containing the full IP configuration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface e A static IPoIB configuration An IPoIB configuration based on an Ethernet configuration Note See your Linux distribution documentation for additional information about configur ing IP addresses The following code lines are an excerpt from a sample IPoIB configuration file it Static settings all values provided by this file IPADDR YDU lL4 3 l45 NETMASK DU 259 2994030 NETWORK ib0 11 4 0 0 BROADCAST IDOSA 20542 59 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 IPADDR ib0 11 4 NETMASK EXD0 29554295930 0 Mellanox Technologies 81 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features NETWORK ib0 11 4 0 0 BROADCAST XD0Sl1l 4 25954259 ONBOOT ib0 1 Based on the
176. iled Example soar ccna BEER aprire S bat bati tdi on 40 3 1 10 Conhnsunne DAPICSOVer RoC E a iater att e Cet pe an het deben is 45 3 2 GPUDirect 46 So GPU eC UO CI CW ak cee tact he Oh cs Aisa Dearie tae a eaa A Se UR S atit dus ei ed Lea 46 3522 GPU Direct installation ua het ie ea sd ES 46 Installation Instructions LL 47 S22 Enabling GPUDHOCPSS e teorie ia She Sh hae Palen nes 47 3 2 2 2 How to Know GPUDirect Is Working LL 47 3 3 Fibre Channel over Ethernet 48 So PECORE OVON OW univ cR SO AE ind pc ak 48 EN JF COM BASIC A O dee Oe tae CRAG A Agen eas ee 49 39 FCoE oni euralOn 2 ossa ae O echas c ote bacs 49 3 9 2 2 tario BOE Ser VICO a aues qaot eres Kats hE we ees Code Hae Hee CO ees 50 35 2 3 DOPPIE FC OE Services uo e xia S ee le he E eee ae tops A e 50 3 3 2 4 Enabling Disabling FCoE and FCoIB Services 50 3323 Cok Advanced Usage ni qu KoA eda te eie dete sta ak ee ded aede aad 50 3 3 3 1 Manual vHBA Control 50 3 3 3 2 Creatine vHBAs Fhat Use PEC 22524 iii 51 3 3 3 3 Creating VHBAs That Use Link Pause 00 00 eee 51 3 4 Reliable Datagram Sockets 52 SUD COMETE Wa sace ich sing ita Raso Mat SA Sons deo dun dian enda b dedo M A aee te as 52 30 2 RDS COD UTA ON edu e SS A Ea dut astu ut ar ite ie 52 3 5 Sockets Direct Proto
177. in which case the shouldn t have UP going ports at all e Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches e Switches of the same rank should have the same number of ports in each UP going port group e Switches of the same rank should have the same number of ports in each DOWN going port group All the CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and it should only comply with the following rules Tree rank should be between two and eight inclusively e All the Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algorithm allows leaf switches to have any number of CAs the closer the tree is to be fully populated the more effective the shift communication pattern will be In general even if the root list is pro vided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be 1 Ports that are connected to t
178. ine Update Add On Products a Service Ue No add on product selected for installation a Clean U Software a Release Mates e Hardware Configuration SUSE Linux Enterprise Server 10 SP2 Server Base System KDE Desktop Environmentfor Server C C Compiler and Tools X Window System Size of Packages to Install 1 6 GB Booting Boot Loader Type GRUB Location dev sda2 boot El o Sections SUSE Linux Enterprise Server 10 SP2 default Floppy Failsafe SUSE Linux Enterprise E Y Shaw Release Notes Server 10 SP2 Change Help Back Abort Accept Mellanox Technologies 217 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Step 15 Click Edit in the Boot Loader Settings window Section List Boot Loader Settings From Other you can manually editthe boot Section Management Boot Loader Installation loader configuration files eo zi clearthe current configuration and propose a new Def Label Type Section Summary configuration start from a se Sen Image append resume dev scratch or reread the Floppy Other chainloader dev ifd configuration saved on Failsafe SUSE Linux Enterprise Server 10 SP Image append showopts ide n your disk If you have multiple Linux systems installed YaST can try to find them and merge their menus Add Edit Delete Setas Default Back Abort Finish 218 Mellanox Technologies Mellanox Technolog
179. ing that would be passed to the plugin s event plugin options ccmgr conf file cc mgr options file name 2 Run the SM with the new options file opensm F lt options file name gt For a list of examples of CC Manager options file with all the default values See Congestion Con trol Manager Options File on page 148 Note Once the Congestion Control is enabled on the fabric nodes to completely turn off Congestion Control you will need to actively turn it off Running the SM w o the CC Manager is not sufficient as the hardware still continues to function in accordance to Mellanox Technologies 147 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager the previous CC configuration To turn it off set enable to FALSE in the Congestion Control Manager configuration file and run OpenSM ones with this configuration 7 9 2 1 Congestion Control Manager Options File Table 7 Congestion Control Manager General Options File enable Enables disables Congestion Control mechanism on the Values lt TRUE FALSE gt fabric nodes Default True cc ec keyo ec keyo Congestion Congestion Control Key 0000 Congestion Control Key 0000 Vale 0 num hosts Indicates the number of nodes The CC table values are cal L EN 0 48K culated based on this number Default 0 base on the CCT cal culation on the current subnet size Table 8 Congestion Control Manager Switch Options File thre
180. ink down to an existing vHBA e g on eth3 run gt echo eth3 gt SFCSYSFS link down 3 3 3 2 Creating vHBAs That Use PFC To create a vHBA that uses the PFC feature it is required to configure the Ethernet driver to sup port PFC create a VLAN Ethernet interface assign it a priority and start a vHBA on the interface The following steps demonstrate the creation of such a vHBA To configure the mlx4 en Ethernet driver to support PFC add the following line to the file etc modprobe conf and restart the network driver Options mixa en proetx 0x fF prerx 0XtE To create a VLAN with an ID e g 55 on interface e g eth3 run t gt voonutrq add eths 55 gt c4 tOGONHntrig eth3 55 up To set the map of skb priority 0 to the requested vlan priority e g 6 run te VCONnfig Set egress imap e6th3 55 0 6 To create the vHBA enter gt echo eth3 55 gt SFCSYSFS create 3 3 3 3 Creating vHBAs That Use Link Pause The m1x4_ en Ethernet driver supports link pause by default To change this setting you can use the following command t gt ethtool A eth lt x gt rx onlott tx on off To create a vHBA run Mellanox Technologies 51 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features 3 4 3 4 1 3 4 2 3 5 3 5 1 t gt echo eth3 55 gt SFCSYSEPS create Reliable Datagram Sockets Overview Reliable Datagram Sockets RDS is a socket API that provides reli
181. ious flags of the command Table 23 ibdump Options Default Optional If Not Description Mandatory Specified Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 23 ibdump Options Optional peru Flag Pra dur If Not Description y Specified d 1b dev lt dev gt Optional First device Use IB device dev found b max burst lt log2 Optional 12 4096 entries log2 of the maximal burst size that can be captured with burst no packet loss Each entry takes MTU bytes of memory mem mode size Optional When specified packets are written to the dump file only after the capture 1s stopped It is faster than the default mode less chance for packet loss but it uses more memory In this mode ibdump stops after size bytes are cap tured decap Optional Decapsulate port mirroring headers Should be used pmo mm o Jae Examples 1 Runibdump 2 OOO IB device Ind pp acne XE Dump file Saa pee Shire MOERS mccum sis dz e S AMES O sees Ounces sede nin go UB devices aim ost Eus op eZ MR was registered with addr 0x60d850 lkey 20x280426001 rkey 0x28042601 flags 0x1 Mellanox Technologies 187 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 188 Mellanox Technologies Mellanox Technologies Confidential Appendix A Mellanox FlexBoot A 1 Overview Mellanox FlexBoot is a multiprotocol r
182. ipt hostl mnt mlnxofedinstall This program will install the MLNX OFED LINUX package on your machine Note that all other Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Uninstalling the previous version of OFED Starting MENA OPBDZLINUXeL D 2904 0 5 installation 3 Installing kernel ib RPM Preparing kernel ib TETTE ae ae aE ae E aE ae aE ETE ttt ttt ttt ttt iit ae at at ae aE aE ae AP aE ae aE aE aE a aE EA AE EAE EAE aaa aaa aaa Installing kernel ib devel RPM Preparing kernel ib devel Installing kernel mft RPM Preparing kernel mft Installing mpi selector RPM Preparing mpi selector Install user level RPMs Preparing libibverbs libibumad libibverbs libibumad libibumad devel librdmacm opensm libs librdmacm libibmad iat ae at at ae aE E ae E E E ae aE ae E E EAE EAE EAE TETTE aa iat ae ae at ae aE aE ae aE aE ae ae aE aE a aE EAE EAE EAE EEE aa aa aaa iat ae a at ae aE aE ae aE aE ae aE aE ETT AEE HEE EEA EE aa aaa itt ae aE at ae ae aE EORR ETE AEE AEE AEE aE aa aaa aaa iit TE FE at ae aE aE ae E aE TE ae aE aE a E EAE EAE EAE E E E E E E E EEEE E E E EEEE RE TECTORIO TETTE TEE ERE TEE TETTE ECTETUR ERE TETTE iat TE AE at ae ae ETE AE E E E E E E AE E TE iat ae at at ae aE at ae aE aE ae RIA itt ae aE at ae aE at ae aE aE ae aE aE ae a aE EA AE EAE EEE APE aa aa aaa iit ae at at ae aE aE ae aE aE ae ae aE ae a aE EA AE E
183. ir p tmp initrd en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers hostl cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en hostl cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en Step 5 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd en sbin Step 6 If you plan to give your Ethernet device a static IP address then copy ifconfig Otherwise skip this step Hosts Op ySbiO Contig EMP IDEE EC Sh shin Step 7 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be loaded Warning The order of the following commands for loading modules 1s critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network interface Step 9 Save the init file Step 10 Close initrd host1 cd tmp initrd en hostlo find y epid H newe 0 gt Emp new initrd en 1mg host1 gzip tmp new init en img At this stage the modified init rd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it
184. irect access bus dev fn 02 00 0 bar 0xdef00000 size 0x100000 Chip revision is AO dev mst mt25418 pci msix0 PCI direct access Mellanox Technologies 35 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Installation bus dev fn 02 00 0 bar 0xdeefe000 size 0x2000 dev mst mt25418 pci uar0 PEESOLESCE access bus dev fn 02 00 0 bar 0xdc800000 size 0x800000 2 Your InfiniBand device is the one with the postfix pci cr0 In the example listed above this will be dev mst mt25418 pci cro Step 3 Burn firmware 1 Burning a firmware binary image using mstflint that is already installed on your machine Please refer to MSTFLINT README txt under docs 2 Burning a firmware image from a mlx file using the m1xburn utility that 1s already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 host ls mixburn dev J dev msL mt25418 pei Cro fw mnt firmware fw 25408 fw 25408 rel mlx Warning Make sure that you have the correct device name firmware path and firmware file name before running this command For help please refer to the Mellanox Firm ware Tools MFT User s Manual under mnt docs Step 4 Reboot your machine after the firmware burning is completed 2 5 Uninstalling Mellanox OFED Use the script usr sbin ofed uninstall sh to uninstall the Mellanox OFED package The script is part of the ofe
185. is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions e n the bonding master configuration file e g ifcfg bond0 in addition to Linux bonding semantics use the following parameter MTU 65520 Note 65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 3 8 2 IPoIB Mode Setting on page 79 and are configured with the same value 84 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 For IPoIB slaves that work in datagram mode use MTU 2044 If you do not set the correct MTU or do not set MTU at all performance of the interface might decrease e In the bonding slave configuration file e g 1fcfg 1b0 use the same Linux Network Scripts semantics In particular DEVICE 1b0 e n the bonding slave configuration file e g ifcfg 1b0 8003 the line TYPE InfiniBand is nec essary when using bonding over devices configured with partitions p key For RHEL users In etc modprobe conf add the following lines alias bond0 bonding For SLES users It is necessary to update the MANDATORY DEVICES environment variable in etc sysconfig net work config with the names of the IPoIB slave devices e g 1b0 1b1 etc Otherwise bonding mas ter may be created before IPoIB slave interfaces at boot time
186. itiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D avail able from http www t10 org The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp t10 drafts spc3 spc3r21b pdf Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf Basic functionality task management and limited error handling 3 6 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or change the value of SRP LOAD in etc infiniband openib conf to yes Note For the changes to take effect run etc init d openibd restart Note When loading the ib srp module it is possible to set the module parameter srp sg tablesize This is the maximum number of gather scatter entries per I O default 12 3 6 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 3 6 2 4 explains how to do this automatically e Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id ext GUID value ioc guid GUID value dgid port
187. lanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Link Speed 2 5Gb s Installation finished successfully Programming HCA firmware for dev mst mt25418 pci cr0 device Running mlxburn d dev mst mt25418 pci cr0 fw mnt firmware fw 25408 2 5 0000 TWw 29408 reL mix dev type 254098 ho I Querying device I Using auto detected configuration file mnbt firmnwere tw 25408 2 9 00007 MHGH28 XTC A4 A7 ini PSID MT 04A0140005 I Generating image Current FW version on flash 2 7 0 New FW version 2 8 0 Burning FW image without signatures OK Restoring signature OK I Image burn completed successfully Please reboot your system for the changes to take effect warning etc infiniband openib conf saved as etc infiniband openib conf rpm save Note In case your machine has the latest firmware no firmware update will occur and the installation script will print at the end of installation a message similar to the follow ing Installation finished successtully The firmware version 2 8 0000 is up to date Note To force firmware update use force fw update flag Note In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configu
188. le gPXE gt ifopen netl H 17 12 3 ifclose Closes the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example gPXE ifclose netl H 17 12 4 autoboot Starts the boot process from the device s H 17 12 5 sanboot Starts the boot process of an ISCSI target Example GPXE gt sanboot 3189051 211 44 52 742 1210 2U007 00 7 92 4 111419c0891D00E Mellanox Technologies 199 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 H 17 12 6 echo Echoes an environment variable Example gPXh echo S root path H 17 12 7 dhcp A network interface attempts to open the network interface and then tries to connect to and com municate with the DHCP server to obtain the IP address and filepath from which the boot will occur Example gPXE gt dhcp netl H 17 12 8 help Displays the available list of commands H 17 12 9 exit Exits from the command line interface A 8 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the remote kernel or initrd image must include and be configured to load that driver This can be achieved either by compiling the HCA driver into the kernel or by adding the device driver module into the initra image and loading it H 17 13Case I InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section H 17 13 1 for an example ib add
189. le qos Enable Quality of Service support in the HCA if gt 0 default 0 enable pre tll mode For BooxX enable pre t1l mode LE non zero derault 0 internal err reser Reset device on internal errors Lf nonezero default lt L Mellanox Technologies 227 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 C 2 mlx4 ib Parameters debug level Enable debug tracing if gt O default 0 C 3 mlx4 en Parameters inline theld Threshold for using inline data default is 128 tcp rss Enable RSS for incoming TCP traffic default 1 enabled dporss Enable RSS for incoming UDP traffic default 1 enabled num lro Number of LRO sessions per ring or disabled 0 default is 32 ip reasm Allow the assembly of fragmented IP packets default 1 enabled Dieux Priority based Plow Control poliey on TX 7 0 Per priority bit mask default is 0 Dror Priority based Flow Control polrey om BXITz 0 Per priority bit mask default is 0 C 4 mlx4 fc Parameters log exch per viba Max outstanding EC exchanges per virtual HBA log Default 9 int max vhba per port Max vHBAs allowed per port Default 2 int 228 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Appendix D ib bonding Driver for Systems using SLES10 SP3 D 1 Using the ib bonding Driver The ib bonding driver is a High Availability solution for IPoIB interfaces It is
190. logies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Installer Privileges The installation requires administrator privileges on the target machine 2 2 Downloading Mellanox OFED Step 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHost entries in the display The following example shows a system with an installed Mellanox HCA hostl lspci v grep Mellanox 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 Step 2 Download the ISO image to your host The image s name has the format MLNX OFED LINUX lt ver gt lt OS label gt iso You can download it from http www mellanox com gt Products gt IB SW Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO image Run the following com mand and compare the result to the value provided on the download page hostis md5sum MLNX OFED LINUX ver OS label gt iso 2 3 Installing Mellanox OFED The installation script mlnxofedinstall performs the following Discovers the currently installed kernel Uninstalls any software stacks that are part of the standard operating system distribution or another vendor s commercial stack e Installs the MLNX OFED LINUX binary RPMs if they are available for the current kernel dentifies the currently installed InfiniBand and Ethernet network adapters and
191. ltering ifconfig up down mtu changes up to 10K Ethtool support Net device statistics CX4 OSFP and SFP connectors 4 3 2 Loading the Ethernet Driver By default the Mellanox OFED stack loads m1x4 en Run ifconfig a to verify that the module is listed 4 3 3 Unloading the Driver If etc infiniband openib conf had MLX4 EN LOAD yes at driver start up then you can unload the m1x4 en driver by running etc init d openibd stop 94 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Otherwise unload m1x4 en by running gt modprobe r mlx4 en 4 3 4 Ethernet Driver Usage and Configuration To assign an IP address to the interface run gt ifconfig eth n ip where x is the OS assigned interface number To check driver and device information run gt ethtool i eth lt x gt Example gt ethtool i eth2 driver mlx4 en MT 04A0140005 versions lb5 1 Maren Z010 firmware version 2 7 000 bus 1into7 0000213200450 To query stateless offload status run t gt etHbtool k erhex To set stateless offload status run r erhteoL Ak eth k Lex omn orf l TEX onlott so ona tot Eso oml ort To query interrupt coalescing settings run to ethtool etx e By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern To enable disable ad
192. lue gt Mellanox Technologies 163 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Table 16 lists the various flags of the command Table 16 ibportstate Flags and Options Default If Not Description Specified d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d show Optional o f Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v Me IN D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases 1t is the Port GUID Example 0x08f1040023 A nn the default timeout for the solicited MADs num ms gt msec dest dr path Optional Destination s directed path LID or GUID lid Dc guid pomum Optional Destination s port number 000 Destination s port Destinations port number 0 lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria Optional Mandatory 1 The first ACTIVE port that is fo
193. lun max sect A continuous background daemon operation providing an automatic ongoing detection and connection capability See Section 3 6 2 4 3 6 2 4 Automatic Discovery and Connection to Targets Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running e To connect to all the existing Targets in the fabric run srp daemon e o This utility will scan the fabric once connect to every Target it detects and then exit Note srp daemon will follow the configuration it finds in etc srp daemon conf Thus it will ignore a target that 1s disallowed in the configuration file To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp daemon e This utility continues to execute until it 1s either killed by the user or encounters connection errors such as no SM in the fabric e To execute SRP daemon as a daemon you may run run srp daemon found under usr sbin providing it with the same options used for running srp daemon Note Make sure only one instance of run srp daemon runs per port e To execute SRP daemon as a daemon on all the ports run srp daemon sh found under usr sbin srp daemon sh sends its log to var log srp daemon log e It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRPHA ENABLE in etc infini
194. machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 2USDUSSOTUATULSEIOSODSDUSDUOSUUSDUSDOSDUSUUSUAZSOSTUZ21002292 132925 A 3 Subnet Manager OpenSM Note This section applies to ports configured as InfiniBand only FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accomplish this Note that OpenSM may be run on the same host running the DHCP server but it is not manda tory For details on Opens M see OpenSM Subnet Manager on page 106 A 4 TFTP Server When you set the filename parameter in your DHCP configuration file to a non empty filename the client will ask for this file to be passed through TFTP For this reason you need to install a TFTP server A 5 BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device or gPXE for an InfiniHost III device The priority of this list can be modified through BIOS setup A 6 Operation H 17 8 Prerequisites Make sure that your client 1s connected to the server s The FlexBoot image is already programmed on the adapter card see Section A 2 e Start the Subnet Manager as described in Section A 3 The DHCP s
195. may be superceded by a prefix 192 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecimal digits Extracting the Port GUID Method To obtain the port GUID run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT package has been installed on the client machine hostl4 mst Start nostl4 mst status The device name will be of the form dev mst mt dev id pcif crOlconf0j Use this device name to obtain the Port GUID via the following query command flint d MST DEVICE NAME q Example with ConnectX 2 QDR MHJH29B XTR Dual 4X IB QDR Port PCIe Gen2 x8 Tall Bracket RoHS R6 HCA Card CX4 Connectors as the adapter device hostl flint d dev mst mt26428 pci cr0 q Image type COANECtA FW Version Z a O Device ID 26428 Chip Revision BO Description Node Porci Port2 Sys image GUIDs 0002c90300001038 0002c90300001039 0002c9030000103a 0002c9030000103b MACs 000209001039 0002909001032 Board ID n a MI OBZ20 110009 MSDs n a PSIDs MT 0D20110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 00 10 39 Extracting the Port GUID Method Il An alternative method for obtaining th
196. mmands Force clear the Flash semaphore on the device No command is allowed clear semaphor allowed when this switch is used e Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash 1 mage burn verify Binary image file lt image gt burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g I2C MTUSB 1 nofs Burn image in a non failsafe manner Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sector difference is detected Mellanox Technologies Two MACs must be specified here The specified MACs are assigned to portl and port2 repectively Note This switch is applicable only for Mellanox Technologies Ethernet products Burn the image with blank GUIDs and MACS where applicable These val ues can be set later using the sg command see Table 22 below guids burn s 4 GUIDs must be specified here The specified GUIDs are assigned the fol lt GUIDs gt lowing values repectively node portl port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 21 mstflin
197. n the link state of the external port associated with the vNic interface Note A link state is down on a host administrated vNic when the BridgeX is connected and the InfiniBand fabric appears to be functional The issue might result from a miscon figuration of either BXADDR or and BXEPORT configuration file To query the link state run the following command and look for Link detected ethtool interface name Example ethtool eth10 Settings for eth10 Supported ports Supported link modes Supports auto negotiation No Advertised link modes Not reported Advertised auto negotiation No Speed Unknown 10000 Duplex Full Port Twisted Pair PHYAD 0 Transceiver internal Auto negotiation off Supports Wake on d Wake on d Current message level 0x00000000 0 Link detected yes 76 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 7 3 4 Bonding Driver EoIB uses the standard Linux bonding driver For more information on the Linux Bonding driver please refer to lt kernel source gt Documentation networking bonding txt Currently not all bonding modes are supported e g LACP is not supported 3 7 3 5 Jumbo Frames EoIB supports jumbo frames up to the InfiniBand limit of 4K bytes The default Maximum Trans mit Unit MTU for EoIB driver is 1500 bytes To configure EoIB to work with jumbo frames 1 Make sure that t
198. n IPoIB interface configuration based on DHCP is performed similarly to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line Mellanox Technologies 719 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp Note If IPoIB configuration files are included ifcfg ib lt n gt files will be installed under etc sysconfig network scripts ona RedHat machine etc sysconfig network ona SuSE machine Note A patch for DHCP is required for supporting IPoIB For further information please see the REAME which is available under the docs dhcp directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hardware address To overcome this problem DHCP over InfiniBand messages convey a client identifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field 1s not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section H 17 7 Configuring the DHCP Server on page 192 DHCP Ser
199. nager as it may get timeouts on the AR related queries to these switches To run AR Manager with an AR configuration file enter 7 Opensm eedr ssd COn io Elle pata TO Tle Currently there are two options in the config file 1 Enable disable AR on fabric switches by including the following line to the AR configuration file enable lt truelfalse gt where the default value is true which is also valid for cases when the AR config file is not provided This option 1s different from the OpenSM command line option ar The former controls AR on fabric switches while the latter specifies whether AR Manager in OpenSM should be launched or not Note that once AR is enabled you will need to actively turn it off in order to disable it To turn it off set enable to false in the AR configuration file and run OpenSM as follows OPensm sean c Contig ta le path VO tle 2 AR Mode In the configuration file set ar mode bounded free where the default value is bounded 7 8 2 1 AR Configuration File Example The following is an example of AR configuration file content 146 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Begin AR configuration file enable true ar mode bounded End AR configuration file The above file has options with default values which 1s equivalent to not having the AR configuration file at all partB 0x8002
200. nal and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file qos levels qos level name DEFAULT sL 0 end qos level end qos levels Port groups section is missing because there are no match rules which means that port groups are not referred anywhere and there is no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning it PORE Groupes port group using port GUIDs name Storage t use DS ust a deseription that ws used tor logging Other than that it is just a comment use SRP Targets port gurgs QOxI00ODOQOQUOOODL Qx10000000000005 0x1000000000FFFA port gunzdrsz OxI0000000O0FEEFE end port group DOre gEQU name Virtual Servers The syntax of the port name is as follows node description Pnum node description Is compared to the NodeDescriptron of the node Mellanox Technologies 135 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager and Pnum is a port number on that node port name vs1 HCA 1 P1 vs2 HCA 1
201. nd pro vide direct access for the latter to the buffers allocated by the NVIDIA CUDA library hence pro viding Zero Copy of data and better performance The allocated buffers by the CUDA library are managed by the NVIDIA driver and provide the ability to mark pages that are shared so that the Kernel MM can allow the Mellanox driver to access them and use them for transportation without the need to copy or re pin Mellanox driver modifications enable the driver to query the memory and to share it with the NVIDIA Tesla driver using the new Linux Kernel MM API Additionally it supports callbacks to allow other drivers sharing the memory to notify upon any changes performed during the run time in the shared buf fers state in order for the driver to use the memory accordingly and to avoid invalid access to any shared pinned buffers For futher information see MLNX OFED for Linux GPUDirect Release Notes txt 3 2 2 GPUDirect Installation Mellanox and NVIDIA s GPUDirect solution is composed of several software components 46 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 All components must be installed on your machine in the following order to enable GPUDirect support 1 Special kenrel that enables memory sharing between drivers 2 NVIDIA s CUDA driver 3 Mellanox driver MLNX OFED GPUDirect Installation Instructions 1 Download and extract nvidia gpudi
202. nd was aborted because firmware is current Examples 1 Find Mellanox Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE Ae 04 00 0 InfiniBand Mellanox Technologies MT25418 Con nectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the devices PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn Note The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 1 Verify the ConnectX firmware using its ID using the results of the example above 184 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 MS eno 0 OCA ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size OASIS A BOO O AUS EONO OA 0008 3000 NA SS O A OO SO Sis als oe e OJO OK 1O 0000n 22 USOS SOI CUT OK 102 00 0055 0200000542680 0 00 2497 image like Ok ANS ESO ASS a A Ae CIPIDES MSS TS 10230000063 0 OSO DOS td E ADDR OE 0200 00 290000040 OSOS 2005 DDR OK NOG Ae MO OO SS cue MIO dL T ecd I O E 150290100 5 920
203. ng 5 2 Performance Tuning for Linux You can use the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 5 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance Mellanox Technologies 97 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Performance Disable the TCP timestamps option for better CPU utilization sysctl w net ipv4 top timestamps 0 Disable the TCP selective acks option for better CPU utilization SySctl w net 1pv4 tcp Sack 0 Increase the maximum length of processor input queues sysctl w net core netdev max backlog 230000 e Increase the TCP maximum and default buffer sizes using setsockopt SyscuLl ewonetcore rmem max 1067 71216 sysctl w net core wmem max 16777216 sysctl woneus core rmem deLreuke lo7 7 216 sysctl w net core wmem default 16777 216 Sysctl w net core optmem max 16 7 7216 Increase memory thresholds to prevent packet dropping SVSOCEL w net orpv4 top mem 19777216 1677216 LOTTTA2L6 ncrease Linux s auto tuning of TCP buffer limits The minimum default and maximum num ber of
204. nux User s Manual 1 5 2 2 1 0 1 1 1000 Step 11 Select Base Partition Setup on This Proposal then click Next Your hard disks have Suggested Partitioning been checked The partition setup displayed is proposed for your hard drive Create boot partition dev sdal 70 5 MB with ext2 Create swap partition dev sda2 502 0 MB To accept these Create root partition dev sda3 7 4 GB with reiserts suggestions and continue select Accept Proposal If the suggestion does not fit your needs create your own partition setup starting with the partitions as currently present on the disks Farthis select Custom Partition Setup This is also the option to chaase for advanced options like RAID and LVM cPartitioning C Accept Proposal Base Partition Setup on This Proposal L Create Custom Partition Setup rro Back Abort Next Step 12 In the Expert Partitioner window select from the IET VIRTUAL DISK device the row that has its Mount column indicating swap then click Delete Confirm the delete operation and click Finish Mellanox Technologies 215 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Partition your hard Expert Partitioner disks This is intended far experts If you are not Device 5ize F Type Mount Mount By Start End Used By Label fame idevisda 8 0 GB IET VIRTUAL DISK 0 1045 concepts of hard disk jdevisdal 705MB FLinuxnative Ext boot 8
205. o multicast configuration for native Ether net interfaces Note EoIB maps Ethernet multicast addresses to InfiniBand MGIDs Multicast GID It ensures that different vHubs use mutually exclusive MGIDs Thus preventing vNics on different vHubs from communicating with one another 3 7 2 5 EolB and Quality of Service EoIB enables the use of InfiniBand service levels The configuration of the SL is performed through the BridgeX and lets you set different data control service level values per BridgeX box For further information on the use of non default service levels please refer to BridgeX documen tation IP Configuration Based on DHCP Setting an EOoIB interface configuration based on DHCP v3 1 2 which is available via www isc org is performed similarly to the configuration of Ethernet interfaces When setting the EoIB configuration files verify that it includes following lines e For RedHat BOOTPROTO dhcp e For SLES BOOTPROTO dchp 14 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Note If EoIB configuration files are included ifcfg eth lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine and under etc sysconfig network on a SuSE machine DHCP Server Using a DHCP server with EoIB does not require special configuration The DHCP server can run on a server located on the Ethernet side using any Etherne
206. oS Class in the PR MPR request The result ing PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64b1t value A new field QoS Class 1s also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes 3 9 3 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four subsec tions l Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription II Fabric Setup Defines how the SL2VL and VLArb tables should be setup Note In OFED this part of the policy is ignored SL2VL and VLArb tables should be con figured in the OpenSM options file opensm opts Mellanox Technologies 87 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features Ill QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Note Path Bits are not implemented in OFED
207. ogies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Figure 1 Mellanox OFED Stack Back end App Life Sciences Eth Cluster Block Storage HPC Application Application Midd Men Front Config Mgmnt Mellanox VPI Device HCA NIC mw Markets ES Linux EN OF ED in Linux Currently not supported by Mellanox OFED Applications MO OFED The following sub sections briefly describe the various components of the Mellanox OFED stack 1 4 1 mthca HCA IB Driver mthca is the low level driver implementation for the following Mellanox Technologies HCA InfiniBand devices InfiniHost InfiniHost III Ex and InfiniHost III Lx 1 4 2 mlx4 VPI Driver mlx4 is the low level driver implementation for the ConnectX and ConnectX 2 adapters designed by Mellanox Technologies ConnectX ConnectX 2 can operate as an InfiniBand adapter as an Ethernet NIC or as a Fibre Channel HBA The OFED driver supports InfiniBand and Ether Mellanox Technologies 19 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Mellanox OFED Overview net NIC configurations To accommodate the supported configurations the driver is split into four modules mlx4_core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mix4_ib Handles InfiniBand
208. ogy of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2QoS will detect and report when it has insufficient configuration for a torus with radix 4 dimensions In the event the torus is significantly degraded 1 e there are many missing switches or links it may happen that torus 2QoS is unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condition occurs if torus 2QoS is misconfigured 1 e the radix of a torus dimension as configured does not match the radix of that torus dimension as wired and many switches links in the fabric will not be placed into the torus 7 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration con figuration unless it is invoked with Q Since torus 2QoS depends on such functionality for cor rect operation always invoke OpenSM with Q when torus 2QoS is in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines supported by OpenSM except torus 2QoS there is a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used fo
209. on add the IB driver into it after the load commands of the 1SCSI Initiator modules and continue as described in Section A 7 on page 198 Warning Pay extra care when changing initrd as any mistake may prevent the client machine from booting It is recommended to have a back up iSCSI Initiator on a machine other than the client you are working with to allow for debug in case initrd gets corrupted In addition edit the init file that is in the initrd zip and look for the following string if SiSCSI TARGET IPADDR then iscsiserver S1SCSI TARGET IPADDR fi Now add before the string the following line 1SCSI TARGET IPADDR lt IB IP Address of iSCSI Target Example iSCSI TARGET IPADDR 11 4 3 7 WinPE Mellanox FlexBoot enables WinPE boot via TFTP For instructions on preparing a WinPE image please see http etherboot org wiki winpe 222 Mellanox Technologies Mellanox Technologies Confidential Appendix B SRP Target Driver The SRP Target driver 1s designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it 1s possiblee to work with and support a lot of IO modes on real or virtual devices in the backend 1 scst vdisk fileio and blockio modes This allows turning software raid vol
210. on instead of old u to activate the UPDN algorithm Use a root guid file gt for adding an UPDN guid file that contains the root nodes for rank ing If the a option is not used OpenSM uses its auto detect root nodes algorithm Notes on the guid list file 122 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 1 A valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 7 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also prevents credit loop dead locks If the root guid file is not provided a or root guid file options the topology has to be pure fat tree that complies with the following rules e Tree rank should be between two and eight inclusively Switches of the same rank should have the same number of UP going port groups unless they are root switches
211. on 2 1 Hardware and Software Requirements on page 24 e Section 2 2 Downloading Mellanox OFED on page 25 e Section 2 3 Installing Mellanox OFED on page 25 e Section 2 5 Uninstalling Mellanox OFED on page 36 2 1 Hardware and Software Requirements 2 1 1 Hardware Requirements Platforms A server platform with an adapter card based on one of the following Mellanox Technologies InfiniBand HCA devices MT25408 ConnectXe 2 VPI IB EN FCoE firmware fw ConnectX2 MT25408 ConnectXe VPI IB EN FCoE firmware fw 25408 e MT25208 InfiniHoste III Ex firmware fw 25218 for Mem Free cards and fw 25208 for cards with memory e MT25204 InfiniHoste III Lx firmware fw 25204 MT23108 InfiniHoste firmware fw 23108 Note For the list of supported architecture platforms please refer to the Mellanox OFED Release Notes file Required Disk Space for Installation e 400 MB Device ID Note For the latest list of device IDs please visit Mellanox website For InfiniBand Cards go to www mellanox com gt Products gt InfiniBand Cards gt Overview For Ethernet Cards go to www mellanox com gt Products gt Products gt Ethernet Cards gt Overview 2 1 2 Software Requirements Operating System Linux operating system Note For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file 24 Mellanox Technologies Mellanox Techno
212. on case that is han dled by the unicast routing cache is host reboot which otherwise would cause two full routing recal culations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 120 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destination LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID This step is common to standard and Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same targe
213. on from Ethernet layer 2 MAC addresses 48 bits long to InfiniBand layer 2 addresses made of LID GID and QPN This transla tion is totally invisible to the OS and user Thus differentiating EoIB from IPoIB which exposes a 20 Bytes HW address to the OS The mlx4 vnic module is designed for Mellanox s ConnectX family of HCAs and intended to be used with Mellanox s BridgeX gateway family Having a BridgeX gateway is a requirement for using EoIB It performs the following operations e Enables the layer 2 address translation required by the mlx4 vnic module Enables routing of packets from the InfiniBand fabric to a 1 or 10 GigE Ethernet subnet 3 7 1 Ethernet over IB Topology EoIB is designed to work over an InfiniBand fabric and requires the presence of two entities e Subnet Manager SM The required subnet manager configuration is not unique to EoIB but rather similar to other InfiniBand applica tions and ULPs BridgeX gateway The BridgeX gateway is at the heart of EoIB On one side usually referred to as the internal side it is connected to the InfiniBand fabric by one or more links On the other side usually referred to as the external side it 1s con nected to the Ethernet subnet by one or more ports The Ethernet connections on the BridgeX s external side are called external ports or eports Every BridgeX that is in use with EoIB needs to have one or more eports connected 3 7 1 1 External Ports eports and Gateway T
214. ools See under docs folder of installed package Please visit http www mellanox com gt Products gt IB VPI SW Drivers for downloads FAQ trou bleshooting future updates to this manual etc Mellanox Technologies 15 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Mellanox OFED Overview 1 Mellanox OFED Overview 1 1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack based on the OpenFabrics OFED Linux stack and operates across all Mellanox network adapter solutions supporting 10 20 and 40Gb s InfiniBand IB 10Gb s Ethernet 10GigE Fibre Channel over Ethernet FCoE and 2 5 or 5 0 GT s PCI Express 2 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are supported with major operating system distributions 1 2 Introduction to Mellanox VPI Adapters Mellanox VPI adapters which are based on Mellanox ConnectX and ConnectX 2 adapter devices provide leading server and storage I O performance with flexibility to support the myriad of communication protocols and network fabrics over a single device without sacrificing func tionality when consolidating I O For example VPI enabled adapters can support Connectivity to 10 20 and 40Gb s InfiniBand switches Ethernet switches emerging Data Cen ter Ethernet switches InfiniBand to Ethernet and Fibre Channel Gateways
215. or the solicited MADs lt timeout_ms gt msec Starting LID in an MLID range Ending LID in an MLID range dest dr path Optional Destination s directed path LID or GUID lid guid gt Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 robe Unt oa ot ETIS oe E gua Quse OCDE SOL L EU MTA T 0 DEI ds ete veda nex Technologies WE TORE Destination Port ET OO OS ao E O E cio DO a Oe MPAT SO Sonic pee deem MS ano Teecno L ogena OLOCO TO EE o e aa 00000 MAY Soo hk ns Sale FII Mellano Technologies Ox000G 9D Ge mamme Ada tes s posto Us lt 0002 c SES OVO OMS oc ss PIE M Oeo EU cnn esee AG te D Os quad QUESO OU DA Sd So eA es Mellanox Technologies 169 Mellanox Technologies Confidential 2 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Ulisse Seco 0 0002 e9 09 2665566000 MO dus Oo nana ale e ee Me ner Technologies RENZO Lire Destination IONS Haro SIA eee DE oe 959101092 SO pcr mie E DOS MPAT 3 SO inch ments dee OPE o Fe cima ollo giles AS O oe echo eto 991901 ovee1e a6 esc CNOA OIG MEAS OG Vin CIR OEA 200 7 es la seus loro eme pulido AMAS SO SS A Ox0007 021 Channel Adaprer porkquid CGO Ae PD Sas lo eA as 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 IDL OUTAT DSST Unicas clas oos EE Strait o ACT OOOO OE teret OO MAS oia ase o be EDS GE mo Technologies I RoHS IDEE Destination ROTE P
216. ort number This step can be performed by executing srp daemon sh which sends its log to var log srp daemon log Now it is possible to access the SRP LUNs on dev mapper Note It is possible for regular non SRP LUNs to also be present the SRP LUNs may be identified by their names You can configure the etc multipath conf file to change multipath behavior Note It is also possible that the SRP LUNs will not appear under dev mapper This can occur if the SRP LUNS are in the black list of multipath Edit the blacklist sec 68 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 tion in etc multipath conf and make sure the SRP LUNS are not black listed Automatic Activation of High Availability e Set the value of SRPHA ENABLE in etc infiniband openib conf to yes Note For the changes in openib conf to take effect run fetc 1nit d lopenabdo restart From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper Note It is possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name tis possible to see the output of the SRP daemon in var log srp daemon log 3 6 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop or as a by product of a complete system shutdown Prior to shu
217. ox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features local address LID 0x0000 OPN 0x04004f PSN Oxbdde2c GID lecUurt202509005 PUDO eS remote address LID 0x0 0 005 OPN Ox08004F PSN O0xo9dgd900 GILD fesUt 202370900 TOO ree LI 8192000 bytes in 0 01 seconds 4824 50 Mbit sec 1000 terg Im 06m T seconds 18 58 usec liter On Client bv Ye pingpong g L L1 2 sw419 local address LID 0x0000 OPN 0x08004f PSN 0xc9d800 GID TeS0 2272027 90071098 e08l 1 remote address LID 0x0000 OPN 0x04004f PSN Oxbdde2c GID lesu i c2027209007 71709768799 8192000 bytes in 0 01 seconds 4844 83 Mbit sec 1000 4 ters in 0 01 seconde 19 59 useccrber Defining Ethernet Priority PCP in 802 1q Headers On Server EID re pingpong i 1 2 4 local address LED 0sQUU00 OPN UxLcOOJPT PSN Ox9daroco GELD fe80 202 c900 708 e799 remote address LID 0x0000 OPN 0x1c004f PSN 0xb0a49b GID IepUOrT702 v1 0900 7 T09 6911 8192000 bytes in 0 01 seconds 4840 89 Mbit sec 1000 Itere xn 30 0 Seconds 13 54 usSsc rter On Client if LOVE Epi rig Done g L 2L Ao Sado local address LID 0x0000 OPN 0x1c004f PSN 0xb0a49b GID FESO 3707220900 TUS eol r remote address LID 0xOUO0 OPN OxlicOUOJdT PSN O0x9daroc GID Leg0rt f202 0900s 70926799 8192000 bytes in 0 01 seconds 4855 96 Mbit sec TODO Geers mu 0d Seconds do 50 see ex Using rdma cm Tests On Server ucmatose cmato
218. p bin hostname tmp initrd ib bin Create a configuration file for the DHCP client as described in Section 9 3 3 1 and place it under tmp initrd ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number 202 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 For a ConnectX device interface 1b0 send dhcp client identifier ff 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 For an InfiniHost III Ex device interface ibl send dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 Step 9 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd ib init and add the following lines at the point you wish the IB driver to be loaded Warning The order of the following commands for loading modules 1s critical echo loading ipvo sbin insmod lib modules ipv6 ko echo loading IB driver fsbin insmod l b modules rib ib addr ko Sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insmod lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod lib modules ib iw cm ko sbin insmod lib modules ib rdm
219. pecific Service ID in the PR MPR query e Any UL P application with a specific PKey in the PR MPR query e Any UL P application with a specific target IB port GUID in the PR MPR query Since any section of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows qos u Lps default 0 default SL end qos ulps 138 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible keywords qosuLbps default 0 default SL SUP port num 30000 Dod eL cor application running on top of SDP when a destination TCP IPport is 30000 sap port num 10000 20000 E sdp 1 default SL for any other t application running on top OL DIE rds 2i ol ctor RDS trafie rporb pkey 0x0001 x Oi Sb For LPOLB On Partition with pkey 0x0001 1505 t 4 XB default LPOLB partition pkey 0x7FFF any Service id 0x6234 6 match any PR MPR query with a specific Service ID any pkey OxOABC 6 4 match any PR MPR query with a specific PKey srp target port gui
220. pensM defaults to 3 Fetries for transactions maxsmps n number This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs Without maxsmps OpenSM defaults to a maximum of 4 outstanding SMPs console q OEA local This option activates the OpenSM console default off ignore guids i lt equalize ignore guids file gt This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm 110 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Chopo weights file w lt path to file gt This option provides the means to define a weighting factor per port for customizing the least weight hops for the routing e ports ILIS O pati Eo trkle gt This option provides the means to define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR NOnOr gulrd2lld s This option forces OpenSM to honor the guid21lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and Ss Valid by default his 5 FALSE Od ELLO em ccROSenrlemames This option defines the log to be the given file By default the log goes to var log opensm log For the log to go to standard output use f stdout SOG amity cee 1 MBs This
221. ple on Linux flint dev dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX rom Example on Windows flint dev mt26428 pci cr0 Drom ConnectX 26428 ROM X X XXX rom H 17 5 Burning the Image on InfiniHost lll Ex Lx Products Prerequisites 1 Firmware packages The appropriate firmware mlx packages ConnectX fw 25408 InfiniHost III Ex fw 25208 and or InfiniHost III Lx fw 25204 can be downloaded from Mellanox Technologies Web site see www mellanox com gt Downloads gt Firmware gt Customized Firmware 2 Firmware Configuration ini Files 1 Depending on the OS the device name may be superceded with a prefix 2 Relevant only if your ConnectX EN devices are currently burnt with a firmware version earlier than 2 7 000 Mellanox Technologies 191 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 For standard Mellanox products ini files are included in the firmware mlx packages For help in identifying the correct ini file of your adapter hardware please refer to MFT User s Manual which is provided in Mellanox OFED for Linux 3 Expansion ROM Image The expansion ROM images are provided as part of the SW package and are listed in the release notes file FlexBoot release notes txt 4 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package version 2 6 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under www mellanox com gt Downlo
222. plications Please refer to http mvapich cse ohio state edu support mvapich user guide html To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps Mellanox Technologies 105 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager 7 OpenSM Subnet Manager 7 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Man agement Model 13 Subnet Management 14 and Subnet Administration 15 7 2 opensm Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for 1n1 tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for change
223. port T11 frame format Works on top of standard Ethernet NICs including m1x4 en See http www open fcoe org for further information on the Open FCoE project 3 3 2 FCoE Basic Usage After loading the driver userspace operations should create destroy vHBAs on required Ethernet interfaces This can be done manually by issuing commands to the driver using simple sysfs oper ations Alternatively it can be handled automatically by the dcbxd daemon if the interface 1s con nected to an FCoE switch supporting DCBX negotiation of the FCOE feature e g Cisco Nexus Once a vHBA is instantiated on an Ethernet interface it immediately attempts to log into the FC fabric Provided that the FC fabric and FC targets are well configured LUNs will map to SCSI disk devices dev sdX XX vHBAs instantiated automatically by the dcbxd daemon are created on a VLAN 0 interface with VLAN priority set to the value negotiated with the switch This takes advantage of PFC which allows pausing FCOE traffic when needed without pausing the entire Ethernet link Also with proper configuration of the FCoE switch the link s maximum bandwidth can be divided as needed between FCoE and regular Ethernet traffic Instantiating vHBAs manually allows creating them on VLAN interfaces with any arbitrary VLAN id and priority as well as on the regular without VLAN Ethernet interfaces Using the reg ular interface means that PFC cannot be used In this case it is highly
224. porting QoS Up to 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served Packets carry class of service marking in the range 0 to 15 in their header SL field Each switch can map the incoming packet by its SL to a particular output VL based on a pro grammable table VL SL to VL MAP in port out port SL The Subnet Administrator controls the parameters of each communication flow by providing them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 1s widely used in highly dynamic fabrics The fol lowing subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack 3 9 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chronol ogy approach to describe how the overall system works l The network manager human provides a set of rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 86 Mellanox
225. ptions Default Flag apro If Not Description Mandatory Specified h help Print the help menu B EX Toma DO Tier Raise the IB debug level May be used several times for higher debug levels ddd or d d d Use GUID address argument In most cases 1t is the Port GUID Example 0x08 1040023 Optional Use the specified channel adapter or router Optional Use the specified port i Reset the counters Override the default timeout for the solicited MADs msec Show version info LID or GUID perfquery r 32 1 read performance counters and reset f t Optional lt timeout_ms gt lt lid guid gt Optional port reset_ma sk Examples perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port only Mellanox Technologies Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports perfquery R 32 2 OxOfff reset only error counters of port 2 perfquery R 32 2 Oxf000 reset only non error counters of port 2 1 Read local port s performance counters gt perfquery fe Ore colon Ss ulm sor pore al LS A A pee d ik COUCE cho A eee riu Diosc 109 1900 STILO EE ERIS OD e MM IR ERU Tie as O e
226. r changes from the current state of the fabric A directory named tbdiag ipil Ls also created by this Optron and holds the IBNL files required to load this topology To use these files you will need to set the environment variable named IBDM IBNL PATH to that directory Ihe directory 18 located in tmp or in the output directory provided by the o flag 2loag db Ifrle name noad subnet data from the given db file nd Skip subnet h help V version vars discovery stage Note Some of the checks require actual subnet discovery and therefore would not run when load db is specified These checks are Duplicated zero guids link state SMs status Prints the help page information Prints the version of the tool Prints the tool s environment variables and their values 154 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 8 4 2 Output Files Table 12 ibdiagnet of ibutils Output Files ibdiag A dump of the multicast forwarding tables of the fabric switches net mcfdbs ibdiagnet masks In case of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an
227. r ko ib core ko ib mad ko ib sa ko e ib cm ko e ib uverbs ko e ib ucm ko e ib umad ko iw cm ko 200 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 rdma cm ko rdma ucm ko mlx4 core ko e mlx4 ib ko ib mthca ko e ipoib helper ko this module is not required for all OS kernels Please check the release notes e ib ipoib ko H 17 13 1 Example Adding an IB Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section 9 3 3 1 IPoIB Configura tion Based on DHCP and 1s connected to the client machine 3 Aninitrd file To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the IB Driver to the initrd File Warning The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Step1 Backup your current initrd file Step 2 Make a new working directory and change to it hostl 9 mkdir tmp initrd ib hostilo cd Jtmp znatrd ib Step3 Normally the initrd imag
228. r unicast QoS configuration will be honored by torus 2QoS For multicast QoS con figuration only SL values 0 and 8 should be used with torus 2QoS Since SL to VL map configuration must be under the complete control of torus 2QoS any config uration via qos sl2vl qos swe sl2vl etc must and will be ignored and a warning will be gener ated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise 1f traffic 1s not deliv ered fairly across each of these two VL ranges Torus 2QoS will detect and warn if VL arbitration is configured unfairly across VLs in the range 0 3 and also in the range 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via qos vlarb high qos vlarb low etc 7 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the correct Mellanox Technologies 131 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager path SL for connection setup Applications that use rdma cm for connection setup will automati cally meet this requirement If a change in fabric topology causes changes in path SL values required to route without cre
229. ration file Step 4 In case the installation script performed firmware updates to your network adapter hardware it will ask you to reboot your machine Step 5 The script adds the following lines to etc security limits conf for the userspace components such as MPI soft memlock unlimited hard memlock unlimited These settings unlimit the amount of memory that can be pinned by a user space application If desired tune the value unlimited to a specific amount of RAM Step 6 For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be running on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the Mellanox Technologies 31 J Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Installation OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 7 OpenSM Subnet Manager Step 7 InfiniBand only Run the hca self test ofed utility to verify whether or not the InfiniBand link is up The utility also checks for and displays additional information such as HCA firmware version Kernel architecture Driver version Number of active HCA ports along with their states Node GUID Note For more details on hca self test ofed see the file hca self test readme under docs hostl usr bin hca self test ofed sens Performing InfiniBand HCA Self Test Number of HCAs Detected PCI Device Check Kernel Arch Host
230. rding to a policy This chapter includes the following sections e libsdp so Library on page 53 e Section 3 5 3 Configuring SDP on page 53 e Section 3 5 4 Environment Variables on page 55 e Section 3 5 5 Converting Socket based Applications on page 56 e Section 3 5 6 BZCopy Zero Copy Send on page 62 e Section 3 5 7 Using RDMA for Small Buffers on page 62 52 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 5 2 libsdp so Library libsdp so is a dynamically linked library which is used for transparent integration of applica tions with SDP The library is preloaded and therefore takes precedence over glibc for certain socket calls Thus it can transparently replace the TCP socket family with SDP socket calls The library also implements a user level socket switch Using a configuration file the system administrator can set up the policy that selects the type of socket to be used 1ibsdp so also has the option to allow server sockets to listen on both SDP and TCP interfaces The various configu rations with SDP TCP sockets are explained inside the etc 1ibsdp conf file 3 5 3 Configuring SDP To load SDP upon boot edit the file etc infiniband openib conf and set SDP LOAD yes Note For the changes to take effect run etc init d openibd restart SDP can work over IPoIB interfaces or RoCE interfaces In case of
231. rect 3 2 1 tar gz tarball See the link in Step 2 2 Install the special kernel that enables memory sharing between drivers The kernel RPMs and Patches are available from the following link http developer download nvidia com compute cuda 3 2 GPUDirect nvidia gpudirect 3 2 1 tar gz If you are using standard kernel please replace the kernel RPMs with the RPMs provided in this package If you are using a non standard kernel please use the provided patches to patch your kernel For further information please see the README file provided with the tarball 3 Download and install both NVIDIA s CUDA toolkit and driver The CUDA Toolkit v4 0 with R270 drivers or later are available at NVIDIA s website at the fol lowing link http developer nvidia com cuda downloads 4 Install MLNX OFED GPUDirect 1 5 2 2 1 0 1 1 1000 driver For further information please see Section 2 3 Installing Mellanox OFED on page 25 For further information about NVIDIA s GPUDirect feature please go to http developer nvidia com gpudirect 3 2 2 1 Enabling GPUDirect To enable GPUDirect feature set the IB USE GPU environment variable whenever running an application that uses the GPUDirect feature Run in you shell export IB USE GPU 1 3 2 2 2 How to Know GPUDirect Is Working To verify the GPUDirect capability is working properly please download and extract the mpi pinned application The application can be downloaded from the following link
232. ree for that case 4 l l l 3 l l 2 l l 1 K 4 l l l y 0 1 2 3 4 Jg Assuming the y dateline was between y 4 and y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as 1t occurs on a 1D ring the ring for x 3 that is broken by a failure as in the above example 130 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 7 5 7 3 Torus Topology Discovery The algorithm used by torus 2QoS to construct the torus topology from the undirected graph repre senting the fabric requires that the radix of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm 1s to examine the cube formed by the eight switch locations bounded by the corners x y z and x 1 y 1 z 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that is consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm is based on examining the topol
233. ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z 4 3 4 D 2 I r Ho LI m LD or p y 0 4 4 x 0 1 2 3 4 5 For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has failed torus 2QoS would use the path S m p o T r D Note that 1t can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say x 5 and x 0 can be ignored for path segments on that ring One result of this 1s that torus 2QoS can route around many simultaneous link failures as long as no 1D ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabrics Note that in the case where there are multiple parallel links between a pair of switches torus 2QoS will allocate routes across such links in a round robin fashion based on ports at the path destina tion switch that are active and not used for inter switch links Should a link that is one of sev
234. ronment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and login again Other packages such as environment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also shows what the current selections are This command is rec ommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting 6 4 Compiling MPI Applications Note A valid Fortran compiler must be present in order to build the MVAPICH MPI stack and tests The following compilers are supported by Mellanox OFED s MVAPICH and Open MPI packages Gcc Intel and PGI The install script prompts the user to choose the compiler with which to install the MVAPICH and Open MPI RPMs Note that more than one compiler can be selected simultane ously if desired Compiling MVAPICH Ap
235. rtual NICS NTC taras pss CUSCO oa oa 71 3 552 BOLD COMICUrANONe a4 4a a a E es 71 3 72 1 BoIB HostAdmiunstered VNIC oppioidi 71 3 7 2 2 EoIB Network Administered VNic 0 0 0 00 ccc eens 73 31 25 NEAN o cheno ue dee E A ee lees ioe oe 73 3 7 2 4 EolIB Multicast Configuration LL 74 57 25 EolB and Quality of Service oia asada 74 4 Mellanox Technologies J Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 3 7 2 6 IP Configuration Based on DHCP 74 3 L2 Static BEoIB Conmeuration 5c vez DE hho I4 EAM iii 75 3052 65 SUD Interfaces V GAN eus vam Ee ed oleate eei deme ikea 19 3 75 3 Retrieving EoIB Information smart 75 SA E A Bul Cale oa eh ee I ERU EE I d pads TD 30590 Colite E otha e eae des Cea Loc qiia 75 3 5 9 9 EINK Sides eiut A ears batt hath AS 76 3 924 BONO Diye reduc o A Cathe eater TT 39 5 Jumbo eean e AA lana davi 77 314 Advanced BolB3SetuBS ucs aon Rich sed ieu i dubai n UE do ede ol ea TI Ar Module Parameters 4 4 a5 ace de NS hit TT 9 74 2 Nic Intertace Naming sh cre Ade 78 3 8 IP over InfiniBand 78 It OMOGENEE lina ilari 78 3 5 2 APOUB Mode Sete araa e ode Ea ee Bee eee are edo 79 3 85 IPo B Conhigiranon 4 AA UE DERE EU etatis DE Se ue fs 79 3 8 3 1 IPoIB Configuration Based on DHCP 79 3 85 92 Sue POB COn HP lOs raider a rt iaa 81 3 8 3 3 Manually Confi
236. s opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indicators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 7 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits ccOomtrg eB lt t1L e name gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists ccporegteecontig e lt tle n me gt 106 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 OpenSM will dump its configuration to the specified file and exit This is a way to generate OpenSM configuration file template guidy eg GUID xn be
237. s Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoiding the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm Note QoS support has to be turned on in order that SL VL mappings are used Note LMC gt 0 is not supported by the LASH routing If this is specified the default rout ing algorithm is invoked instead For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do mesh analysis This will add an addi tional phase that analyses the mesh to try to determine the dimension and size of a mesh If it deter mines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs Mellanox Technologies 125 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager 7 5 6 1 9 1 7 5 7 1 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min
238. s cR oe E M M Iu AME 0 PEA IA NO A IT etnea 0 A A E EP ap NR eee 0 ex SIO Wee iy Hilo ou S e 0 SS I OE RE TORI TU 0 E A E E RTT ver ees 0 E SS MA ee cu A NM 0 ESSE OS Sd a 0 ad O b ESO S M NEUE 0 Ecc COHIBERE OMS E I a 0 IIS UE MM TU E haa 0 Mellanox Technologies 177 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities 2 Read performance counters from LID 2 all ports imp quei ie ROUGE oec ci 9 cis 55 BOs Sue AE ye eae M MM ME pe SOS opc oe CI M e ETT CESOMADIO SACO ON ERR M M RT ay DOSSI RK NS nu Rd E pe gue LADO donec Wc A Rn ea 16 ASE ORO REA S Re re ere See d EE o9 SUE SMO re ER SB Bou S e a M M e 0 RIETI e 700 mi SO O RR T rate 488 EOS on ad 0 EOD EAT E ETO S e M ME 0 Eine geo en 0 EXODUS Mek Ole Stews M M E 0 NEO RT E AN NERA iets 0 3 Read then reset performance counters from LID 2 port 1 gt perfquery r2l A o esca o Pu cmq SA A O LI I COn E ETOO O E M E E 0x0 TOO SEATS o A E N es 0 AA O o MEME 0 A AA lite ri e 0 E VA E O dI 0 I oen vs BONS e iM 0 VS Sd lh ids CREO 0 A A clninGl s A Eu cu DEM ee eens ers 5 STEMS Ore ist wc theron 0 Poe OE Su uM E TOI SEU 0 IRR ra O LR IL EM ME 0 SO Co UNET ECOL e d Le Mut 0 STRIS ON O Y eU PUEDE PLN EA O PA El Do a e O 0 178 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 8 14 ibcheckerrs Applicable Hardware All
239. s infiniband ulp srpt 1D SPD e RDS lib modules uname r updates kernel net rds rds ko lib modules uname r updates kernel net rds rds rdma ko lib modules uname r updates kernel net rds rds tcp ko e The package kernel ib devel include files are placed under usr src ofa kernel include These include files should be used when building kernel modules that use the stack Note that the include files 1f needed are backported to your kernel The raw package un backported source files are placed under usr src ofa kernel lt ver gt e The script openiba is installed under etc init d This script can be used to load and unload the software stack e The script connectx port config is installed under sbin This script can be used to configure the ports of ConnectX network adapter cards to Ethernet and or InfiniBand For details on this script please see Section 4 1 Port Type Management e The directory etc infiniband is created with the files info and openib conf and connectx conf The info script can be used to retrieve Mellanox OFED installation infor mation The openib conf file contains the list of modules that are loaded when the openibd script is used The connectx conf file saves the ConnectX adapter card s ports configuration to Ethernet and or InfiniBand This file is used at driver start restart etc init d openibd start e The file 90 ib rules is installed under etc udev rules d If OpenSM i
240. s installed the daemon opensma is installed under etc init d and opensm conf is installed under etc e f IPoIB configuration files are included ifcfg ib lt n gt files will be installed under e etc sysconfig network scripts ona RedHat machine e etc sysconfig network ona SuSE machine The installation process unlimits the amount of memory that can be pinned by a user space application See Step 5 Man pages will be installed under usr share man Mellanox Technologies 99 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Installation Firmware e The firmware of existing network adapter devices will be updated 1f the following two condi tions are fullfilled 6 1 You run the installation script in default mode that is without the option without fw update 2 The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image Note If an adapter s Flash was originially programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image e In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates 34 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Error message I Q
241. sM 1b bonding 26 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 MVAPICH Open MPI MPI tests MPI selector dynamic librar ies pasic Install all kernel modules libibverbs libibumad mft mstflint dynamic libraries msm Install all kernel modules libibverbs libibumad mft mstflint diagnostic tools OpenSM ib bonding dynamic libraries NOTE With msm flag the OpenSM daemon is configured to run upon boot vy vv vvv Set verbosity level SPEC lt 0 bitmask gt Priority based Flow Control policy on TX and RX 7 0 0 Set quiet no messages will be printed 2 3 2 1 minxofedinstall Return Codes Table 1 lists the mlnxofedinstall script return codes and their meanings Table 1 mlinxofedinstall Return Codes The Installation ended successfully No firmware was found for the adapter device The installation failed Failed to start the ms t driver Mellanox Technologies 2 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Installation 2 3 3 28 Mellanox Technologies Mellanox Technologies Confidential Installation Procedure hostl mount o ro loop MLNX OFED LINUX ver OS label gt iso Login to the installation machine as root After mounting the ISO image mnt will be a Read Only folder Step 1 Step 2 Mount the ISO image on your machine mnt Note Step 3 Run the installation scr
242. se starting server initiating data transfers completing sends 44 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 receiving data transfers data transfers complete cmatose disconnecting disconnected test complete return status 0 it On Client ucmatose a 20 443 219 cmatose starting client cmatose connecting receiving data transfers sending replies data transfers complete test complete return status 0 it This server client run is without PCP or VLAN because the IP address used does not belong to a VLAN interface If you specify a VLAN IP address then traffic should go over VLAN Type Of Service TOS The TOS field for rdma cm sockets can be set using the rdma set option API just as it is set for regular sockets If the user does not set a TOS the default value 0 will be used Within the rdma cm kernel driver the TOS field is converted into an SL field The conversion formula is as follows SL TOS gt gt 5 e g take the 3 most significant bits of the TOS field In the hardware driver the SL field is converted into PCP by the following formula PCP SL amp 7 take the 3 least significant bits of the TOS field Note SL affects the PCP only when the traffic goes over tagged VLAN frames 3 1 10 Configuring DAPL over RoCE The default dat conf file which contains entries for the DAPL devices does not contain entries for the
243. se ahaa RU ace apos 24 2 2 Downloading Mellanox OFED 25 2 3 Installing Mellanox OFED 25 2 51 Pre mstallati n Notes 44 Lese Eti elia pori sibi delia s restes 25 2o ristallatiol oe eeren ea rte eil Eee 26 2 3 2 4 mlnxofedinstall Return Codes 2T 23 3 Installation Procede cae what beh cR 28 20 Installation Results o da a 32 2 3 5 PoOst installatiOn NOUS no A dada 35 2 4 Updating Firmware After Installation 35 2 5 Uninstalling Mellanox OFED 36 Chapters Driver F Catures lt 4 service ideale eta 37 3 1 RDMA over Converged Ethernet 37 3h ROCES eade ere ree ten end 37 341 2 Sotlware Dependencies xut aas a eiie Un id dut iens nia 37 Sl Firmware Dependencia apr Robe dent edi dd a 37 Mellanox Technologies 3 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 3454 Genera GIIA Sen RR eat e alude Ra dusie dun eeu feu oc APA ed Rc Be 37 3 1 5 Ported Applica Ons sse oni Secr SE pote du elu ee ex dw te Rad 37 ico GEDSEADIES dana tals te cleats RC birth 38 S5 Pretty Pause Frames cae Sede i ea Naked 38 Soy hisp VI UNS Ue eU Ib DERE BENE EE Eae dove enters EE uL E 38 5 15 Reading Port Counters St als Sacr esie did E US e Ieri nid ae ges de be 39 3 9 Deta
244. shold Indicates how aggressive the congestion mark 0 0xf ing should be 0 no packet marking e Oxf very aggressive Default Oxf marking rate The mean number of packets between marking eligible Values 0 Oxffff packets with a FECN Default Oxa packet size Any packet less than this size bytes will not be marked Values 0 0x3fc0 with FECN Default 0x200 Table 9 Congestion Control Manager CA Options File port control Specifies the Congestion Control attribute for this port Values e 0 QP based congestion control e SI Port based congestion con trol Default 0 ca control map An array of sixteen bits one for each SL Each bit indicates Values Oxffff whether or not the corresponding SL entry 1s to be modi fied Sets the CC Table Index CCTI increase Default 1 trigger threshold Sets the trigger threshold Default 2 Sets the CC Table Index CCTI minimum Default 0 148 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Table 9 Congestion Control Manager CA Options File Sets all the CC table entries to a specified value The first Values lt comma separated list gt entry will remain 0 whereas last value will be set to the rest Default 0 of the table When the value 1s set to 0 the CCT calculation is based on the number of nodes ccti timer Sets for all SL s the given ccti timer Default 0 When the value 1s set to 0 the CCT calcula
245. ss aos etasss ed Name of qos level to apply to the matching PR MPR qos level name WholeSet end qos match rule show matching by destination group and service id qos match rule use Storage targets destination Storage service 1d 0x10000000000001 0x10000000000008 0x10000000000FFF qos level name WholeSet end qos match rule qos match rule Sources Storage use match by source group only qos level name DEFAULT end qos match rule Mellanox Technologies 137 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager qos match rule use match by all parameters qosceclass r LL source Virtual Servers destination Storage service id 0x0000000000010000 0x000000000001FFFF pkey 0x0F00 0x0FFE qos level name WholeSet end qos match rule end qos match rules 7 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include Default match rule that is applied to PR MPR query that didn t match any of the other match rules e SDP SDP application with a specific target TCP IP port range SRP with a specific target IB port GUID e RDS e IPoIB with a default PKey e IPoIB with a specific PKey e Any UL P application with a s
246. ssei ee ERE E 12 Clossarn C his accanto utendo lied 14 Reference Documents se souder URGE RR DE S Xd tebeos 15 Minkoredims tal 1 Retr COUGS 2342 SERES DREEQUE QAO REPSENE PERRA 21 mix4 Vai conl MICTONNAL s mio aor pone ern x lec dU do e s 71 Red Hat Linux mlx4 vnic conffileformat ooooooooooooooooo o 72 Supported ConnectX Port Configurations ees 93 Recommended PCIe Configuration o es 977 Use EMPERIK S tana pmo oce s eso aE dee O a cu Ro deci tee 103 Congestion Control Manager General Options File 0 0 0 0 cece eee 148 Congestion Control Manager Switch Options File llle 148 Congestion Control Manager CA Options File o oooooooooooonnooo oo 148 Congestion Control Manager CC MGR Options File o o oooo o oooo 149 ibdiagnet of ibutils2 Output Files iie RR ARA es 153 ibdiagnet of ibutils Output Files i uus ERREUR REGERE ween e as 155 ibdiappatt Output Piles asa Gee ese deen Ai 157 iby devinto Blass and OPUONS iia dais qx a ia 158 ibstatus Plass and ODHOFBS cierro 161 ibportstate Blapscnd ODLUODS cisne EA IER EH EE Eas 164 ibportstate Flags and Options o o oooooorr e 168 smpdg ery Flags and Opuons ies AAA 172 periqucery Placs and ODLDOS giada eee ei etn ie RS ies ACE Rd S 176 ibcheckerrs Flags and Options 179 t stiHt SWILChOS micas daa dav a a 182 MSAC OMMMANGS 2 4 irradia end bee ae 183 ibaump CDUOBS 22 2274 283 ae sii A a pee ae eee eee e
247. st support m emax lid This option specifies the maximal LID number to be searched for during inventory file build Default 100 Mellanox Technologies 115 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager Ir Dr m M Es 1 SUI DOT LE inventory stress Multicast Mode timeout SOG LI This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port This option displays a menu of possible local port GUID values with which osmtest could bind This option specifies the name of the inventory file Nor mally osmtest expects to find an inventory file which osmtest uses to validate real time information received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See G option for related information This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description sl Single MAD response SA queries 92 Multi MAD RMPP response SA queries 283 Multi MAD RMPP Path Record SA queries Without s stress testing is not performed This option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode
248. sword Hostname Network Customer Center Incoming Authentication Online Update Username Password Service Users Clean Up Release Notes Hardware Configuration IM No Authentication Outgoing Authentication Username Password Help Back Abort Next 3 ELE Step6 The iSCSI Initiator Discovery window will show the iSCSI target that got connected to Note that the Connected column must indicate True for this target Click Next See figure below Mellanox Technologies 211 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Preparation V Language y License Agreement sp Disk Activation e ee a System Analysis Portal Address Target Name Connected aimes 10 4 3 7 3260 1 ign 2007 08 7 3 4 10 iscsiboot True E iSCSI Initiator Discovery Installation Installation Summary a Perform Installation Configuration Root Password Hostname Metwork Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration Help Back Abort Next Step 7 The iSCSI Initiator Overview window will pop up Click Toggle Start Up to change start up from manual to automatic Click Finish 212 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Preparation 4 Language y License Agreement sp Disk Activation E iSCSI Initiator Overview e System Analysis Service Connected Targe
249. t Switches Sheet 2 of 2 Affected Relevant Description Commands byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types Non interactive mode Assume the answer is yes to all questions e CI Non interactive mode Assume the answer is no to all questions wed lt string gt bum Write this string of up to 208 characters to VSD upon a burn command inse ad Burn vsd as it appears in the given image do not keep existing VSD on Flash dual image Make the burn process burn two images on Flash The current default fail safe burn process burns a single image in alternating locations Table 22 mstflint Commands bo Burn Block Burn the given image as is without running any checks dc lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file wbne lt addr gt size Write a data block to Flash without sector erase lt data gt rb lt addr gt lt size gt out Read a data block from Flash file swreset SW reset the target InfniScale IV device This command is supported only in the In Band access method Mellanox Technologies 183 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities Possible command return values are 0 successful completion 1 error has occurred 7 the burn comma
250. t hardware or on a server located on the InfiniBand side and running EoIB module 3 7 2 7 Static EolB Configuration To configure a static EoIB you can use an EoIB configuration that is not based on DHCP Static configuration 1s similar to a typical Ethernet device configuration For further information on how to configure IP addresses please refer to your Linux distribution documentation Note Ethernet configuration files are located at etc sysconfig network scripts on a RedHat machine and at etc sysconfig network on a SuSE machine 3 7 2 8 Sub Interfaces VLAN EoIB interfaces do not support creating sub interfaces via the vconfig command To create inter faces with VLAN refer to Section Configuring VLANS on page 74 3 7 3 Retrieving EolB Information 3 7 3 1 mlx4 vnic info To retrieve information regarding EoIB interfaces use the script mlx4_vnic info This script pro vides detailed information about a specific vNic or all EoIB vNic interfaces such as BX info IOA info SL PKEY Link state and interface features If network administered vNics are enabled this script can also be used to discover the available BridgeX s from the host side e To discover the available BridgeXs run mlx4 vnic info g To receive the full vNic information of eth10 run mlx4 vnic info i ethlO To receive a shorter information report on eth10 run mlx4 vnic info s ethlO To get help and usage information run mlx4 vnic info
251. t loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To verify that a torus fabric is routed free of credit loops use ibdmchk to analyze data collected via ibdiagnet vlr 7 6 Quality of Service Management in OpenSM 7 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the pro vided policy on client requests The overall flow for such requests is as follows The request is matched against the defined matching rules such that the QoS Level definition 1s found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 3 QoS Manager a Y Administrator InfiniBand i subnet with Manager based nodes OSM A _ _ 4 QoS Policy Config File 132 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 There are two ways to define QoS policy Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS cons
252. t port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group If none prefer those which go through another NodeGuid s d Fall back to the number of paths method if all go to same node 7 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign lids option is specified Iu FS ESaS Ss gn 11 ds This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change 1f the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obvi ously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 7 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by defa
253. t the target that you wish to connect to and click Connect Preparation y Language y License Agreement sp Disk Activation M e System Analysis Portal Address Target Name Connected e Time Zone 10 4 3 7 3260 1 iqn 2007 08 7 3 4 10 iscsiboot False PM iSCSI Initiator Discovery Installation Installation Summary Perform Installation Configuration Root Password Hostname Network Customer Center Online Update a Service a Users a Clean Up a Release Notes Hardware Configuration Connect If no ISCSI target was recognized then either the target was not properly installed or no connection was found between the client and the 1SCSI target Open a shell to ping the iSCSI target you can use CTRL ALT F2 and verify that the target is or is not accessible To return to the graphical installation screen press CTRL ALT F7 210 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Step5 The iSCSI Initiator Discovery window will now request authentication to access the iSCSI tar get Click Next to continue without authentication unless authentication 1s required Preparation 4 Language y License Agreement sp Disk Activation e System Analysis e Time Zone E iSCSI Initiator Discovery Installation Installation Summary e Perform Installation Configuration Root Pas
254. ter the driver 1s loaded To check for symbol errors enter cat sys class infiniband lt device gt ports 1 counters symbol error The command above is performed on Port 1 ofthe device device The output value should be 0 if no symbol errors were recorded 3 Bandwidth is expected to vary between systems It heavily depends on the chipset memory and CPU Nevertheless the full wire speed should be achieved by the host e With IB SDR the expected unidirectional full wire speed bandwidth is 900MB sec e With IB DDR and PCI Express Gen 1 the expected unidirectional full wire speed band width is 1400MB sec e With IB DDR and PCI Express Gen 2 the expected unidirectional full wire speed band width is 1800MB sec e With IB QDR and PCI Express Gen 2 the expected unidirectional full wire speed band width is 3000MB sec Mellanox Technologies 101 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Performance To check the adapter s maximum bandwidth use the ib write bw utility To check the adapter s latency use the ib write lat utility Note The utilities ib write bwand ib write lat are installed as part of Mellanox OFED 5 3 3 System Performance Troubleshooting On some systems it is recommended to change the power saving configuration in order to achieve better performance This configuration is usually handled by the BIOS Please contact the system vendor for more information 102 Mellanox T
255. tests mvapich gcc mpitests mvapich pgi nprteste mvaprch anced mpitests openmpi gcc mpitests openmpi pgi mpitests openmpi intel Device 15b3 634a 1 5 2 2 1 0 1 1 1000 Installation iat ae at at ae aE at ae aE aE ae aE aE aE a aE EA AEE AEE APE aa aaa aaa aaa iat ae at at ae aE aE ae aE aE ae aE aE aE a aE EAE EAE EAE EAE EE aa aaa iat ae aE at ae aE at ae aE aE ae ae aE aE a aE EAE EAE EAE EEE aa aaa iit ar at at ae aE aE Ae aE aE ae a aE aE a aE EA AEE AEE AEE aaa aaa aaa aaa it it ae at at ae aE aE ae AP ae ae aE aE ae a ae EA AEE AA EEE EE aaa aaa iat ae at at ae aE ECTETUR TEE iit ae at at ae aE aE ae aE ae ae aE aE aE a ae EA AE EAA EAE aaa aaa iat ar aE at ae aE aE ae AP aE ae aE aE ae a aE EAE EAE EAE EEE aaa aaa iit ae at at ae aE aE ae aE aE ae aE aE ae a aE EA AEE AEE Ea aaa aaa itt ae iP at ae aE aE ae ae aE ae aE aE aE aa EA AE EAE EEE AEE aaa aaa iat ae ae at ae aE aE ae aE aE ae ae aE ae a ae EA AE EAE AEE EEE aaa iit ae at at ae aE aE ae aE aE ae aE aE aE a aE EAE EAE AEE EE aa aaa itt ae at at ae aE at ae aE at ae aE aE ae a aE EA AE EA AEE TETTE aa aa iat ae iP at ae aE aE ae aE aE Ae aE aE aE a ae EAE EAE AEE aE aaa ERE ETT TET iit ae ae at ae ae aE ae aE aE ae ae aE ae a ae EAE EAE EEE HEE aaa aaa TETTE ae aE at ae AP aE ae aE aE aE a aE EA AEE AEE EEA Ea aa aa TETTE ae aE aE ae aE aE ae aE aE ae aa EA AE EAE EEE aE aa aaa aaa iit ae aE at ae aE aE ae aE aE ae aE aE aE a aE EAE EAE EAE aaa aaa aaa iat ae at at ae aE at
256. th to this target from another port HCA When the failed path recovers 1t will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path Prerequisites Installation for RHEL4 5 Execute once e Verify that the standard device mapper multipath rpm is installed If not install it from the RHEL distribution Installation for SLES10 Execute once e Verify that multipath is installed If not take it from the installation you may use yast Update udev Execute once for manual activation of High Availability only Add a file to etc udev rules d you can call it 91 srp rules This file should have one line ACTION add KERNEL sd 0 9 RUN sbin multipath M m Note When SRPHA ENABLE is set to yes see Automatic Activation of High Avail ability below this file 1s created upon each boot of the driver and is deleted when the driver is unloaded Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA srp daemon c e R 300 i InfiniBand HCA name p p
257. the max number of hops the wrong way around an I O node is allowed to do connectivity for I O nodes on top swith ces 108 qund tile s lt pack cto file gt Name of the map file with set of the IDs which will be used by Up Down routing algorithm instead of node GUIDs format guid id per line g 1d routrmgPder Eile XoSpsth to Falles Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the guids provided in the given file one to a line P LOTUS cOGOnrro path To files Mellanox Technologies 109 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager This option defines the file name for the extra configuration info needed for the torus 2QoS routing engine The default name is etc opensm torus 2QoS conf DIIOQe m This option causes OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state sweep s interval This option specifies the number of seconds between subnet sweeps Specifying s 0 disables sweeping Without s OpenSM defaults to a sweep interval of 10 seconds timeout t lt milliseconds gt This option specifies the time in milliseconds used for transaction timeouts Specifying t O disables timeouts Without t OpensM defaults to a timeout value of 200 milliseconds retries number This option specifies the number of retries used for transactions Without rerrres O
258. then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage Note The user can override the node list manually Note If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm 1 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 2 Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet Note Up Down routing does not allow LID routing communication between switches that are located inside spine switch systems The reason is that there is no way to allow a LID route between them that does not break the Up Down rule One ramification of this 1s that you cannot run SM on switches other than the leaf switches of the fabric 7 5 3 1 UPDN Algorithm Usage Activation through OpenSM Use R updn opti
259. tion is based on the number of nodes Table 10 Congestion Control Manager CC MGR Options File max errors 5 When number of errors exceeds max_errors of send Values error_window 5 receive errors or timeouts in less than error window sec max errors 0 zero tollerance onds the CC MGR will abort and will allow OpenSM to abort configuration on first error proceed e error window 0 mechanism dis abled no error checking cc statistics cycle 20 Enables CC MGR to collect statistics from all nodes every Default 0 cc statistics cycle seconds When the value is set to 0 no sta tistics are collected Mellanox Technologies 149 Mellanox Technologies Confidential 8 8 1 8 2 8 2 1 150 Mellanox Technologies 1 5 2 2 1 0 1 1 1000 InfiniBand Fabric Diagnostic Utilities Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric The tools are e Section 8 3 ibdiagnet of ibutils2 IB Net Diagnostic on page 151 e Section 8 4 ibdiagnet of ibutils IB Net Diagnostic on page 153 e Section 8 5 ibdiagpath IB diagnostic path on page 156 e Section 8 6 ibv_devices on page 158 e Section 8 7 ibv_devinfo on page 158 e Section 8 8 ibdev2netdev on page 160 e Section 8 9 ibstatus on page 160 e Section 8 10 ibportstate on page 163 e Section 8 11
260. traints on the requested PR MPR e Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 7 6 2 Advanced QoS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by Port GUID Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port II QoS Setup denoted by qos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fab ric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts III QoS Levels denoted bv aos levels Each QoS Level defines Service Level SL and a few optional fields MTU limit Rate limit e PKey Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impos
261. ts Time Zone A TTT Portal Address Target Name StartUp Installation 10 4 3 7 3260 1 ign 2007 08 7 3 4 10 iscsiboot manual Installation Summary Perform Installation Configuration Root Password Hostname Metwork Customer Center Online Update Service Users Clean Up Release Motes Hardware Configuration Add Log Out Toggle StartUp ke ear Step 8 Select New Installation then click Finish in the Installation Mode window e Installation Mode y License Agreement Disk Activation mp System Analysis Time Zone Installation Installation Summary a Perform Installation Select Mode Configuration x 0 New Installation Root Password Hostname Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration Help Abort Mellanox Technologies 213 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Step 9 Select the appropriate Region and Time Zone in the Clock and Time Zone window then click Finish Preparation Ae Ld Lll Language License Agreement Disk Activation System Analysis Time Zone Installation Installation Summary Perform Installation Configuration Step 10 Hostname Root Password Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration Help Preparation La amp amp 4 amp Lang
262. tting down SRP remove all references to 1t The actions you need to take depend on the way SRP was loaded There are three cases 1 Without High Availability When working without High Availability you should unmount the SRP partitions that were mounted prior to shut ting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps Unmount all SRP partitions that were mounted b Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shutdown etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP parti tions that were mounted before driver shutdown 3 7 Ethernet over IB EolB vNic The Ethernet over IB EoIB mlx4 vnic module is a network interface implementation over InfiniBand EoIB encapsulates Layer 2 datagrams over an InfiniBand Datagram UD transport service The InfiniBand UD datagrams encapsulates the entire Ethernet L2 datagram and its pay load Mellanox Technologies 69 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Driver Features To perform this operation the module performs an address translati
263. u can use opensm with the Pconfig or P flags The default partition is created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P_ Key value of 0x7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial membership 7 4 1 File Format Notes e Line content followed after character is comment and ignored by parser General File Format Partrtron Detinrtrono PortGUuLDs liste Partition Definition PartitionName PKey flag value defmember full limited where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value for this partition Only Low L5 DTts wilt be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of this partition defmember full limited Specifies default membership for port guid list Default is limited Currently recognized flags are TIDOTD indicates that this partition may be used for IPoIB asa result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default
264. uage License Agreement Disk Activation System Analysis Time Zone Installation Installation Summary Perfarm Installation Configuration Hostname Root Password Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration Show Release Notes Help Abort E Clock and Time Zone Region Time Zone Europe Alaska Aleutian Canada Arizona Central and South America Central Russia Eastern Asia East Indiana Australia Hawaii Africa Indiana Starke Pacific Michigan Glabal Mauntain Etc Pacific Samoa Hardware Clock Set To Time and Date urc 07 52 06 24 03 2008 Change Abort Finish In the Installation Settings window click Partitioning to get the Suggested Partitioning window Installation Settings Click any headline to make changes or use the Change menu below Overview Expert Keyboard Layout English US Partitioning Create boot partition dev sdal 70 5 MB with ext2 Create swap partition dev sda2 502 0 MB Create root partition dev sda3 7 4 GB with reiserfs Software SUSE Linux Enterprise Server 10 X Window System GNOME Desktop Environment for Server 4 Server Base System Novell AppArmor Print Server Size of Packages to Install 1 3 GB Language Primary Language English LIS 214 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Li
265. uerying device E Can t auto detect fw configuration file 2 3 5 Post installation Notes Most of the Mellanox OFED components can be configured or reconfigured after the installa tion by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 2 4 Updating Firmware After Installation In case you ran the m1nxofedinstall script with the without fw update option and now you wish to manually update firmware on you adapter card s you need to perform the fol lowing steps Note If you need to burn an Expansion ROM image please refer to Burning the Expansion ROM Image on page 191 Note The following steps are also appropriate in case you wish to burn newer firmware that you have downloaded from Mellanox Technologies Web site http www mellanox com gt Downloads gt Firmware Step 1 Start mst hostl mst start Step 2 Identify your target InfiniBand device for firmware update 1 Get the list of InfiniBand device names on your machine hostl mst status MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre I2C module is not loaded MST devices dev mst mt25418 pciconfO0 PCI configuration cycles access bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is AO dev mst mt25418 pci cr0 PCI d
266. uffers For smaller buffers the overhead of preparing a user buffer to be RDMA ed is too big therefore it is more efficient to use BCopy Large buffers can also be sent using RDMA but they lower CPU utilization This mode is called ZCopy combined mode The sendmsg syscall is blocked until the buffer is transfered to the socket s peer and the data is copied directly from the user buffer at the source side to the user buffer at the sink side To set the threshold use the module parameter sdp zcopy thresh This parameter can be accessed through sysfs sys module ib sdp parameters sdp zcopy thresh Setting it to 0 disables ZCopy 3 6 SCSI RDMA Protocol 3 6 1 Overview As described in Section 1 4 5 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator con 62 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 trols the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services Section 3 6 2 describes the SRP Initiator included in Mellanox OFED for Linux This package however does not include an SRP Target 3 6 2 SRP Initiator This SRP In
267. ull connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt gt N3 With a max reverse hops value of 2 NI N2 and N3 will all have routes between them Note Using max reverse hops creates routes that use the switch in a counter stream way This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or simi lar Also having routes the other way around can cause credit loops 7 5 4 2 Activation through OpenSM Use R ftree option to activate the fat tree algorithm Note LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm 1s invoked instead 7 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path routing algorithm that enables topology agnostic deadlock free routing within communication networks 124 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 When computing the routing function LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock Note LASH analyzes routes and ensures deadlock freedom between switch pairs The link from HCA between an
268. ult if no routing algorithm is specified It can also be invoked by specifying R minhop The Min Hop algorithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter is supplied by i lt equalize ignore guids file gt ignore guids lt equalize ignore guids file gt Mellanox Technologies 121 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis 7 5 3 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a statis tical histogram is built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node
269. umes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices B 1 Prerequisites and Installation 1 For the supported distributions please see the Mellanox OFED release notes Note On distribution default kernels you can run scst vdisk blockio mode to obtain good performance 1 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from http scst sourceforge net downloads html b Untarscst 1 0 1 1 o tar zxvt SOSLt lU0 lol Db8f 07 S cd scst 1 0 1 1 c Install scst 1 0 1 1 as follows S make amp amp make install B 2 How to run A On an SRP Target machine l Please refer to SCST s README for loading scst driver and its dev handlers drivers scst vdisk block or file IO mode nullio Note Regardless of the mode you always need to have lun 0 in any group s device list Then you can have any lun number following lun 0 it is not required to have the lun numbers in ascending order except that the first lun must always be 0 Note Setting SRPT LOAD yes in etc infiniband openib conf is not enough as it only loads the ib srpt module but does not load scst not its dev handlers Note The scst disk module pass thru mode of SCST is not supported by Mellanox OFED Mellanox Technologies 223 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 Example 1 Working
270. und 2 Ifnot found the first port that is UP physical link state is LinkUp 164 Mellanox Technologies Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Examples 1 Query the status of Port 1 of CA mlx4 0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate ISS e sc ete qose dem 0 aul To m pan de vato ills o a ciao Qc cuui ud ice egre TOO OA OO O OO OO OOO 0 7 6 9 XOT base lid DES sm lec Obes Stel cS e Z2 IN ETE phys state Se bata suo rate AE O ABI gt MOISE Stance c bsc dec Sl Cie Porno HBO fo RSI AI Paita ze ES e Iu d nda Ea Alas OS d m MEL 1X or 4X A ioi RR 1X or 4X Mellanox Technologies 165 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 2 Query the status of two channel adapters using directed paths ASS SS O Ponso Monto Roe lede RIETI TT Initialize SA E E Re tea ee LinkUp Josep O Cose o BOO A rele O A NE NE MUS IX Aone X Nod ale hice Du CHE A A 4X ENS pee UO Ome dee a AO OLS oa o pee nao te cele LL o coe thea UE OD Oe iinkopeedbiektive ee stu ao SO Edna oa emo a 0A MOI Portare TRO RENO Rie ee EILEEN TT Down Phyo SN p LL a A Elina E AA Oe CI uer CUTE E UL E Boo x Josse Des I O haar mena went Sen a MX 1X or 4X Id Miro hed ay i tedsiuisv MM E M TM E E a 4X OPOLE T AS las E
271. values Very short run times with good scaling properties as fabric size increases Unicast Routing Torus 2QoS is a DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension It encodes into a path SL which datelines the path crosses as follows sl 0 Lor uc Oo lt torus dimensions er path crosses dateline d returns 0 or 1 sd ach Crosses detel me O eodd For a 3D torus that leaves one SL bit free which torus 2QoS uses to implement two QoS levels Torus 2QoS also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows for sl 0 sl lt 16 sl cdir port reports which torus coordinate direction a switch DOTE E DONES db end returnis Or ue OX 4 126 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Sl2vliuxpOrL O0DOFLSl UxL SL gt gt OSgdIT ODOEt j Thus on a pristine 3D torus 1 e in the absence of failed fabric switches torus 2QoS consumes 8 SL values SL bits 0 2 and 2 VL values VL bit 0 per QoS level to provide deadlock free routing on a 3D torus Torus 2QoS routes around link failure by taking the long way around any 1D
272. vels for various ULPs 142 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Figure 4 Example QoS Deployment on Traffic class SDP Traffic class Partition A Service level 2 Service level 0 7 Policy min 20 BW Policy min 40 di App A Server App B Server Service Access Points Traffic class SRP Service Level 1 Traffic class IPoIB Policy min 30 BW Service Level 3 Policy min 10 BW i AR ES App A Server App B Server 7 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each example provides the QoS level assignment and their administration via OpenSM configuration files 7 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI e Separate from I O load Min BW of 70 Storage Control Lustre MDS Low latency e Storage Data Lustre OST Min BW 30 Administration MPI is assigned an SL via the command line hostl mpirun sl 0 OpenSM QoS policy file Note In the following policy file example replace OST and MDS with the real port GUIDs qos ulps default 0 4 default SL for MPI Mellanox Technologies 143 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 OpenSM Subnet Manager any target port guid OST1 OST2 OST3 O0ST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL
273. ver In order for the DHCP server to provide configuration records for clients an appropriate configu ration file needs to be created By default the DHCP server looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of The DHCP server must run on a machine which has loaded the IPoIB module To run the DHCP server from the command line enter dhcpd lt IB network interface name gt d Example hostl dhcpd ib0 d DHCP Client Optional Note A DHCP client can be used if you need to prepare a diskless machine with an IB driver See Step 8 under Example Adding an IB Driver to initrd Linux In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient cf client conf file IB network interface name 80 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Example of a configuration file for the ConnectX PCI Device ID 26428 called dhelyenticont The value indicates a hexadecimal number interface ibl send dhcp client identifier EESOOS OOO 26601050 SO 7 GO DOS OZ Sols 9104819550575 COTS OU SOU Le 3 os Example of a configuration f
274. ver Loading The MLNX OFED installation script installs RoCE as part of mlx4 and mlx4 en and other mod ules See Section 2 3 Installing Mellanox OFED for details on installation Note The list of the modules that will be loaded automatically upon boot can be found in the configuration file etc infiniband openib conf Enter the following command to display the current run of MLNX OFED ibv devinfo hca id mlx4 0 EIXcHneporbri IW ver node guid Sys mage Guid vendor id vendor part id hw ver Bog ro 1d phys port onc POLES DOLL 1 state max mtu active mt sm lid port Lid port Ime Link Layer 2 state max mous active mtus sm Laos port lid Dort mo Link Layer InfiniBand 0 2d TO 0002090310008 ses 10 000Z 0c903 0008 e813 0X029 26428 OxBO MT 0DD0120009 Z PORT INIT 82 2048 4 2048 4 0 O 0x00 IB PORT ACTIVE 4 2048 4 1024 3 0 0 0x00 Ethernet 40 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Notes regarding the command output 1 The InfiniBand port port 1 is in PORT INIT state and the Ethernet port port 2 is in PORT ACTIVE state You can also run the following commands to obtain the port state cat sys class infiniband mlx4 0 ports 1 state 2 INTIT t cat sys class infiniband mlx4 0 ports 2 state 4 ACTIVE 2 Look at
275. wing line option dhcp client identifier Lerroon 00341 Fe30 0000300000000 00302 093030000041 option TOOL Pach 9081 L1 425 9 722 3 9349 0 4200750985 722 204 10 ESCSIDOOE H 17 16iSCSI Boot Example of SLES 10 SP2 OS This section provides an example of installing the SLES 10 SP2 operating system on an iSCSI tar get and booting from a diskless machine via FlexBoot Note that the procedure described below assumes the following The client s LAN card is recognized during installation The iSCSI target can be connected to the client via LAN and InfiniBand Prerequisites See Section H 17 8 on page 196 Warning The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Procedure Step1 Load the SLES 10 SP2 installation disk and enter the following parameters as boot options Mellanox Technologies 207 Mellanox Technologies Confidential 1 5 2 2 1 0 1 1 1000 netsetup 1 WithISCSI 1 Boot from Hard Disk Installation Installation ACPI Disabled Installation Local APIC Disabled Installation 5afe Settings Rescue Sustem Hemory Test Boot Options netsetup i WithIscsI 1l Step 2 Continue with the procedure as instructed by the installation program until the SCSI Initiator Overview window appears Preparation 4 Language y
276. x This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to port at sa times If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port mc 1 SEMO This option specifies the subnet s LMC value The number of LIDs assigned to each port is 2 LMC The LMC value must be in the range 0 7 LMC values gt 0 allow multiple paths between ports LMC values gt 0 should only be used if the subnet topology actually provides multiple paths between ports i e multiple interconnects between switches Without 1 OpenSM defaults to LMC 0 which allows one path between any two ports PELOBIE Y D OSERIORIIY This option specifies the SM s PRIORITY This will effect the handover cases where master is chosen by priority and GUID Range goes from 0 lowest priority to 15 highest ccsmkey ek 5M Key TAS OPt LON Speciti s the SMS SM Key 064 BITS This will effect SM authentication Note that OpenSM version 3 2 1 and below used the default value 1 in a host byte order it is fixed now but you may need this option to interoperate with old OpenSM running on a little endian machine s feassign lids P This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt Subnet traffic Without r OpenSM attempts to preserve existing LID assignments r
277. x User s Manual 1 5 2 2 1 0 1 1 1000 7 6 7 SL2VL Mapping and VL Atbitrati0OD o oo oooooooooor e 141 KOS Deployment Example bs ds 142 7 7 QoS Configuration Examples 143 7 7 Typical HPC Example MPI and Lustre o 143 iio SIC SOAQ Gerk IPoP and SRP oca ele tee mee are hae dias ote ee eS 144 Ts EDO er IPoIB RDS SRP ite te eet oo AS Beak eel ag dne qr dee Bona 145 7 8 Adaptive Routing 146 SL FONSI Wet ior ceeds these oath hee i Sola aetna hated a tao 146 7 8 2 Running OpenSM With AR Manager 0 ccc eee eens 146 7 8 2 1 AR Configuration File Example 146 7 9 Congestion Control 147 7 9 1 Congestion Control Overview i o dede AAA AS 147 7 9 2 Running OpenSM with Congestion Control Manager 147 7 9 2 1 Congestion Control Manager Options File o ooooooooooo eee 148 Chapter 8 InfiniBand Fabric Diagnostic Utilities ccc eee cece cee eee eee 150 8 1 Overview 150 8 2 Utilities Usage 150 8 2 1 Common Configuration Interface and Addressing 0 0 0 c cece ee ee 150 8 2 2 AB Interface De NMO ridad thd Ga ies 151 82 3 A A A iii 151 8 3 ibdiagnet of ibutils2 IB Net Diagnostic
278. yload size is 128 or higher 100 Mellanox Technologies Mellanox Technologies Confidential Mellanox OFED for Linux User s Manual 1 5 2 2 1 0 1 1 1000 Note A Max Read Req of 128 and or installing the card in an x4 slot will significantly limit bandwidth To obtain the current setting for Max Read_Req enter setpci d 15b3 68 w To obtain the PCI Express slot link width and speed enter setpci d 15b3 72 1 Ifthe output is neither 81 nor 82 card then the card is NOT installed in an x8 PCI Express slot 2 The least significant digit indicates the link speed e for PCI Express Gen 1 2 5 GT s e 2 for PCI Express Gen 2 5 GT s Note If you are running InfiniBand at QDR 40Gb s 4X IB ports you must run PCI Express Gen 2 5 3 2 InfiniBand Performance Troubleshooting InfiniBand IB performance depends on the health of IB link s and on the IB card type IB link speed 10Gb s or SDR 20Gb s or DDR 40Gb s or QDR also affects performance Note A latency sensitive application should take into account that each switch on the path adds 200nsec at SDR and 150nsec for DDR 1 To check the IB link speed enter ibstat Check the value indicated after the Rate string 10 indicates SDR 20 indicates DDR and 40 indicates QDR 2 Check that the link has NO symbol errors since these errors result in the re transmission of packets and therefore in bandwidth loss This check should be conducted for each port af

Download Pdf Manuals

image

Related Search

Related Contents

DeLonghi BL 46 ASV hob  Targus TET022EU  PDF カタログ    Compaq 5030 Computer Monitor User Manual  Chapter 10 DRAFT - Office of the State Auditor  NCT Group NCT 100T Lathe User Manual  WORK FLOW  Fortune Next Enterprise 2.0  Grease Meter Instructions  

Copyright © All rights reserved.
Failed to retrieve file