Home
Mellanox OFED Linux User`s Manual
Contents
1. Preparing feck Het HH i HH Het HH He dy HH H ar_mgr HH HH B4 EE HH HH HH HH HH HH E 4 4E UL HH HH T HH HH Preparing on HH HH B4 EE HH HH HH HH HH HH B4 EE HH HH HH HH ibdump HH HH HH HH HH HHH HH HH HH HH HH HH HH Preparing a HH HH idt HH HH HH HH He 4 4E UL HH HH HH infiniband diags compat HH HH B4 EE iat HH HH HH HH HH E 4 4E UL HH HH HH HH Preparing ET HH HH HU HH HH HH HH HH HH He 4 4E UL HH HH HH HH qperf 1HHHBHUHEHHHHHUUH HHHHHHE HE HE AE HEH HE E AE BHBHBHEH EHE Preparing de HH HH B4 EE HH Het HH HHH i HH HH HH Lu HH HH mxm HH HH HH HH HHH HH HHH HH HH HH HH HH Preparing 5 HHH HH HH HH HH HH HH HH Ht HHH HHH i HH openmpi 1HBHBBHE H SHHHHHUHESUHHHHHUHH BSHHHHHUHEHBSHHHBUHN NE Preparing M Het HHH i HH Het i HHH HH HH HHH HH openmpi 1HBHBHHHEHSHHHHUUH SHHHHHUH HHHH B HBSHBHBUH B EHE Preparing tes HHH HH HH HH HH HH HH HH E 4 4E 4L HH HHH HH HH bupc HH HH HH i HH HH i HH Ht i HH HH i Preparing Het HHH HHH HH HHH HHH HH HHH HHH i i inf
2. libcxgb3 devel HHH HH HH HH HH HH HH HHH HH HH HHH HH i Preparing HH H HH HHHH HH H HH H HH HH H HH H HH HHHH HH H HH H HH i libexgb4 HHH HH HH HH HH HH HH HH HHHH HH H HHH HH HH Mellanox Technologies Rev 2 1 1 0 0 Installation Prepari ng THHHBHHHHHHHHHBHBHHHHHHHBHHHHHHBHBHHHHHHHBHBHHHHE libexgb4 devel HH HH HHHH HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH Preparing rn HH H HH HHHH HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH libcxgb4 devel HH H HH HHHH HH H HH H HH HHH HH H HH H HH HH H HH H HH HH Prepari ng HH H HH HHHH HH H HH H HH HH H HH HHHH HH H HH H HH libnes HH H HH HHHH Het HH H HH HH H HH H HH HHHH HH H HH H HH HH Prepari Bo oo T1HHHBHUHEBHHHHUHE B HHHHHU H HE AE HEH HE E B HIBHBHBHBS B H libn es HH H HH HHHH HH H HH H HH HH H HH H HHH HH H HH H HH HH Preparing THHHHHHHBHHHHHBHHHHHHHHHBHHHBHHHHBHHHHHHHHHHHHHBSHRHHI libnes devel static THHHBHHHBHHHBHBHHHHHHHHHHHHHHHHHBHHHHHHHBHBHHRSHRHHHI Preparing on HH H HH HHHH HH HH HH HH H HH H HH HHHH HH H HH H HH HH libnes de
3. Preparing THHHBHHHBHHHBHBHHHHHHHHHBHHHHHHHHBHHHHHHHHHBHHBSHRHHHI ofed scripts HH HH TATENA HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH Prepari ng ee HH H HH HHHH HH H HH H HH HH HH H HH HHH HH H HH H HH HH libibverbs 1 ng Pre HH HH B4 EE HH HH HH HH HH HH HHHH HH H HH H HH HH libi bverbs HH H HH HHHH HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH Preparing or THHHHHHHBHHHBHHBHHHHHHHHHHHHBRHHHHBHHHHHHHHHHHHBSHRHHI libi bverbs devel HH H HH HHHH HH H Het HH HHH i H HH H i H HH H HH HH Prepari ng HH T HH HU HH HH H HH HH H HH H HH HHHH HH H HH H HH HH libi bverbs devel HH H HH HHHH HH HH HH HH HH H HH HHHH HH H HH H HH HH Preparing HH HH HU HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH libi bverbs devel static HH HH HU HH y HH HH HH i H HH HHHH HH H HH H HH HH Prepari ng e HH HH E 4 EE HH HH HH HH T HH HH E 4 4E UL HH HH H HH HH libi bverbs devel static HH HH TATENA HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH Preparing ae HH H HH HHHH HH H HH H HH HH H HH H HHH HH HH H HH H HH HH libi bverbs utils HH HH B4 EE HH HH HH HH i METETE HH HH HH Preparing 6 5 THHHBHHHHBHHHBHBHHHHHHHHBHHHBHHHHBHHHHHHHHHHHHRHRHHI libmlx4 HH H HH HU HH HH HH HH H HH HH HHHH HH H HH H HH HH Prepari HH HH HU HH H H
4. 156 8 5 6 DOR Routing Algorifhm 158 8 5 7 Torus 2QoS Routing Algorithm llle ees 158 8 6 Quality of Service Management in 5 166 8 021 OVERVIEW eoe gee ee ae Na pec at ded RU P a does 166 8 6 2 Advanced QoS Policy File 166 8 6 3 Simple QoS Policy 167 8 64 Policy File Syntax Guidelines 168 8 6 5 Examples of Advanced Policy 168 8 6 6 Simple QoS Policy Details and Examples 171 6 Mellanox Technologies J Rev 2 1 1 0 0 8 6 7 SL2VL Mapping and VL Arbitration 173 8 68 Deployment Example 174 8 7 QoS Configuration Examples 175 8 7 1 Typical HPC Example MPI and 175 8 7 2 EDC SOA 2 tier IPoIB and SRP 176 8 7 3 EDC 3 tier IPoIB RDS SRP 177 8 8 Adaptive Routing 178 8 8 1 OVERVIEW ho pee Rer xe Sh Eee ne cm eee dcn 178 8 82 Installing the Adaptive Routing 179 8 83 Ru
5. Counter Description rx_lro_aggregated Number of packets aggregated rx lro flushed Number of LRO flush to the stack rx lro no desc Number of times LRO description was not found rx alloc failed Number of times failed preparing receive descriptor IX csum good Number of packets received with good checksum rx csum none Number of packets received with no checksum indication tx chksum offload Number of packets transmitted with checksum offload tx queue stopped Number of times transmit queue suspended tx wake queue Number of times transmit queue resumed tx timeout Number of times transmitter timeout tx tso packets Number of packet that were aggregated Table 13 Per Ring SW Statistics where lt i gt is the ring per configuration Counter Description rx lt i gt packets Total packets successfully received on ring i rx lt i gt bytes Total bytes in successfully received packets on ring i tx lt i gt packets Total packets successfully transmitted on ring i tx lt i gt bytes Total bytes in successfully transmitted packets on ring i 4 20 Memory Window Memory Window allows the application to have a more flexible control over remote access to its memory It is available only on physical functions native machines The two types of Memory Windows supported are type and type 2B Memory Windows are intended for situations where the application wants to grant
6. 210 Table 39 ibcheckerrs Flags and Options 213 Table 40 mstflint Switches soco oeae cnet anasu ene 215 Table 41 mstflint Commands 217 Mellanox Technologies 11 J Rev 2 1 1 0 0 Document Revision History Table 1 Document Revision History Release Date Description 2 1 1 0 0 December 2013 Added the following sections Section 2 3 6 Installation Logging on page 41 Section 4 6 2 RoCE Time Stamping on page 77 and its subsections Section 4 17 PeerDirect on page 105 Section 4 18 Inline Receive on page 106 Section 4 19 Ethernet Performance Counters on page 107 Section 4 20 Memory Window on page 111 Section 4 1 2 1 1 SRP Module Parameters on page 42 Section 4 1 2 1 2 SRP Remote Ports Parameters on page 42 Section 4 1 2 2 1 SRP sysfs Parameters on page 43 Section srpd on page 46 Section 4 6 1 3 Querying Time Stamping Capabilities via ethtool on page 77 e Updated the following sections Section 1 5 RDMA over Converged Ethernet RoCE on page 25 Section 2 3 3 Installation Procedure on page 32 Section 4 13 2 Setting Up SR IOV on page 90 Section 5 3 1 Compiling OpenMPI with MXM on page 118 Section 5 3 2 Enabling MXM in OpenMPT on page 119 Section 5 3 4 Configuring Multi Rail Support on page 120 Section 4 8
7. 69 4 5 8 Quality of Service Tools 69 4 6 Ethernet Time Stamping 74 4 6 1 Ethernet Time Stamping Service 74 4 6 2 RoCE Time Stamping oo heey a epos oy ses aes easy 77 47 Atomic Operations uy ieee Re RES SON ex ND xe Ob 79 4 7 1 Enhanced Atomic Operations 79 4 8 Ethernet Tunneling Over IPoIB Driver eIPoIB 80 4 8 1 Enabling the eIPoIB Driver 81 4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver 82 4 8 3 VLAN Configuration Over an eIPoIB 83 4 8 4 Setting Performance Tuning 84 4 9 Contiguous Pages ere pu eMe as Hed aspa RR SANDRA 84 4 10 Shared Memory Region 85 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand 86 4 12 Plow Steering a ee eee epe tete EUR e e yee Ried 87 4 12 1 Enable Disable Flow Steering 87 4 12 2 Flow Domains and 87 4 13 Single Root IO Virtualization SR IOV 90 4 Mellanox Technologies J Rev 2 1 1 0 0 4 13 1
8. 120 5 5 ScalableURG 4 wl xen RE koe este see eae tee 121 5 5 1 Installing ScalableUPC 122 5 5 2 FCA Runtime 122 5 5 3 Various Executable Examples 123 Mellanox Technologies 5 Rev 2 1 1 0 0 Chapter 6 Working With VPI 124 6 1 Port Type 124 6 2 Ato Sensing sd Ae tret and tin dan a Gat Cae a pe ds 125 6 2 1 Enabling Auto Sensing 125 Chapter 7 Performance 126 7 1 General System Configurations 126 7 1 1 PCI Express PCIe Capabilities 126 7 1 2 Memory Configuration 126 7 1 3 Recommended BIOS Settings 126 7 2 Performance Tuning for Linux 129 7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 129 7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 129 7 2 3 Preserving Your Performance Settings after a Reboot 130 7 2 4 Tuning Power Management 130 7
9. 27 Table 6 mlnxofedinstall Return Codes 31 T ble 7 Buffer Values ose PH VEREOR e umama amaya es 84 Table 8 Parameters Used to Control Error Cases Contiguity 85 Table 9 Flow Specific Parameters 89 Table 10 ethtool Supported Options 104 Port IN Counters x u u e ee e Gace oth exem ne ER ER b doge es 107 Table 12 Port OUT Counters u aus e a 108 Table 13 Port VLAN Priority Tagging where lt i gt is in the range 0 7 109 Table 14 Port Pause where i is in the range 0 7 109 Table 15 VPort Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF 110 Table 16 SW SUtIStICS2 i arusa hu gon orden a Pe pee s luys be Re yaa 111 Table 17 Ring SW Statistics where lt i gt is the ring I per configuration 111 Table 18 Useful MPI Links ueterem cepere ee mA HR EROR S CER 116 Table 19 Runtime Parameters 122 Table 20 Recommended PCIe Configuration 126 Table 21 Recommended BIOS Settings for Intel Sandy Bridge Processors 127 Table 22 Recommended BIOS Settings for Intel Nehalem West
10. eect nsn 0 ens dosen E wana 0 9 14 ibcheckerrs Validates an IB port or node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold file using the same format as the dump can be specified using the t lt file gt option Synopsis ibcheckerrs h b v G T threshold file gt s N nocolor C ca name P ca port t timeout ms lid guid port Output Files Table 35 lists the various flags of the command Table 35 ibcheckerrs Flags and Options Default Flag If Not Description y Specified h help Optional Print the help menu b Optional Print in brief mode Reduce the output to show only if errors are present not what they are v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 T Optional Use specified threshold file threshold fi le gt S Optional Show the predefined thresholds N nocolor Optional color mode Use mono mode rather than color mode Mellanox Technologies
11. Table 37 mstflint Commands Command Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Verify the entire Flash bb Burn Block Burn the given image as is without running any checks sg Set GUIDs ri lt out file gt Read the firmware image on the Flash into the specified file dc lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase sector rw lt addr gt Read one DWORD from Flash ww lt addr gt lt data gt Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt lt size gt lt data gt Write a data block to Flash without sector erase rb lt addr gt lt size gt out file swreset Read a data block from Flash SW reset the target InfniScale IV device This command is supported only in the In Band access method Mellanox Technologies 217 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Possible command return values are 0 successful completion 1 error has occurred 7 the burn command was aborted because firmware is current Examples 1 Find Mellanox Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE gt 1 d 15b3 6
12. ibv task pingpong ibv cc pingpong 4 15 Ethtool ethtool is a standard Linux utility for controlling network drivers and hardware particularly for wired Ethernet devices It can be used to Get identification and diagnostic information Get extended device statistics Control speed duplex autonegotiation and flow control for Ethernet devices Control checksum offload and other hardware offload features Control DMA ring sizes and interrupt moderation Mellanox Technologies 103 Rev 2 1 1 0 0 Driver Features The following are the ethtool supported options Table 6 ethtool Supported Options Options ethtool 1 eth lt x gt Description Checks driver and device information For example d ethtool i eth2 driver mlx4 en MT 0DD0120009 CX3 version 2 1 6 Aug 2013 firmware version 2 30 3000 bus info 0000 1a 00 0 ethtool k eth lt x gt Queries the stateless offload status ethtool K eth lt x gt rx onloff tx onjoff sg on off tso on off Iro onjoff gro on off gso onjoff Sets the stateless offload status TCP Segmentation Offload TSO Generic Segmentation Offload GSO increase outbound throughput by reducing CPU overhead It works by queuing up large buffers and letting the network interface card split them into separate packets Large Receive Offload LRO increases inbound through put of high bandwidth network connections by reducing CPU overhead It works
13. 1 1 IB IB port type array 1 2 VPI IB Ethernet NO port type array module parameter ports are IB Step 9 Reboot the server If the SR IOV is not supported by the server the machine might not come out of boot load ae Step 10 Load the driver and verify the SR IOV is supported Run lspci grep Mellanox 03 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 03 00 1 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 2 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 3 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 4 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 5 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 Where 03 00 represents the Physical Function 03 00 X represents the Virtual Function connected to the Physical Function 4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup gt To enable SR IOV and Para Virtualization on the same setup Step 1 Create a bridge vim etc sysconfig network scripts ifcfg bridge0 DEVICE bridge0 TYPE Bridge TPADDE S NETMASK 255 255 0 0 BOOTPROTO static ONBOOT yes NM CONTROLLED no DELAY 0 94 Mellanox Technologies
14. IX OVer errors Number of received frames that were dropped due to overflow IX CIC errors Number of received frames with a bad CRC that are not runts jabbers or alignment errors Mellanox Technologies 107 Rev 2 1 1 0 0 Driver Features Table 7 Port IN Counters Counter Description rx_jabbers Number of received frames with a length greater than MTU octets and a bad CRC rx in range length error Number of received frames with a length type field value in the decimal range 1500 46 42 1s also counted for VLANtagged frames rx out range length error Number of received frames with a length type field value in the decimal range 1535 1501 rx It 64 bytes packets Number of received 64 or less octet frames rx 127 bytes packets Number of received 65 to 127 octet frames rx 255 bytes packets Number of received 128 to 255 octet frames rx 511 bytes packets Number of received 256 to 511 octet frames rx 1023 bytes packets Number of received 512 to 1023 octet frames rx 1518 bytes packets Number of received 1024 to 1518 octet frames rx 1522 bytes packets Number of received 1519 to 1522 octet frames rx 1548 bytes packets Number of received 1523 to 1548 octet frames rx gt 1548 bytes packets Number of received 1549 or greater octet frames Table 8 Port OUT Counters Counter Description tx packets Total packet
15. Set the port type as default Ethernet During driver run time Sense a link every 3 seconds if no link is sensed detected fsensed set the port type as sensed Mellanox Technologies 125 Rev 2 1 1 0 0 Performance 7 Performance 7 1 General System Configurations The following sections describe recommended configurations for system components and or interfaces Different systems may have different features thus some recommendations below may not be applicable 7 1 1 PCI Express PCle Capabilities Table 16 Recommended PCle Configuration PCIe Generation 3 0 Speed 8GT s Width x8 or x16 Max Payload size 256 Max Read Request 4096 For ConnectX3 based network adapters 40GbE Ethernet adapters it is recommended to use an x16 PCIe slot to benefit from the additional buffers allocated by the CPU 7 1 2 Memory Configuration For high performance it is recommended to use the highest memory speed with fewest DIMMs and populate all memory channels for every CPU installed For further information please refer to your vendor s memory configuration instructions or mem ory configuration tool available Online 7 1 3 Recommended BIOS Settings These performance optimizations may result in higher power consumption 7 1 3 1 General Set BIOS power management to Maximum Performance 126 Mellanox Technologies Rev 2 1 1 0 0 7 1 3 2 Intel Sandy Bridge Process
16. Optional Use the specified port R Optional Reset the counters t Optional Override the default timeout for the solicited lt timeout_ms msec gt V ersion Optional Show version info lt lid guid gt Optional LID or GUID port reset_ mask Examples perfquery r 32 1 read performance counters and reset perFquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports perfquery R 32 2 Ox0fff reset only error counters of port 2 perfquery R 32 2 0xf000 reset only non error counters of port 2 1 Read local port s performance counters perfquery Port counters Lid 6 port 1 POGES SC Cie e te re SETS 1 CounterSeleck 0x1000 nisa E ER A 0 LinkRecovers raone ISO 0 0 REVELLOVE A e RENT OO EIS 0 RevRemOcePhy SHULORSH 0 Mellanox Technologies 211 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities 2 Read performance counters from LID 2 all ports 3 Read then reset performance counters from LID 2 port 1 212 Mellanox Technologies J Rev 2 1 1 0 0 ore ami E som s ME 0 Tal rik TG SS one anoacsnca 0 0 tation 0 NEE nae hon Pano odo eed ed 0 Re vD aba Id 0
17. Burning a firmware image to a single InfiniBand node MFT includes the following tools mlxburn provides the following functions Generation of a standard or customized Mellanox firmware image for burning in bin binary 1mg format Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device Querying the firmware version loaded on an HCA board Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mella nox network adapter bridge switch device It includes query functions to the burnt firmware image and to the binary image file spark 1 OpenSM is disabled by default See Chapter 8 OpenSM Subnet Manager for details on enabling it 24 Mellanox Technologies Rev 2 1 1 0 0 This tool burns a firmware binary image to the EEPROM s attached to an InfiniScaleIII switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific MADS over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace mstdump isw and i2c For additional details please refer to the MFT User s Manual docs 1 4 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over
18. PKey QoS class Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Service ID disregarding rest of the query fields However if a certain query has only Service ID which means that this is the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 8 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a match rule has only one criterion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below Mellanox Technologies 167 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 6 4 8 6 5 Policy File Syntax Guidelines Leading and trailing blanks as well as empty lines are ignored so the indentation in the example is just for better readability Comments are started with the pound s
19. S n T o p I I I I I I 0 I I I I I I x 0 i 2 3 4 5 For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has failed torus 2QoS would use the path S m p o T r D Note that it can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say x 5 and x 0 can be ignored for path segments on that ring One result of this is that torus 2QoS can route around many simultaneous link failures as long as no 1D ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabrics Note that in the case where there are multiple parallel links between a pair of switches torus 2QoS will allocate routes across such links in a round robin fashion based on ports at the path destination switch that are active and not used for inter switch links Should a link that is one of severalsuch parallel links fail routes are redistributed across the remaining links When the last of such a set of parallel links fails traffic is rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn t
20. VERSION RELEASE t SUMMARY n grep Mellanox gpg pubkey a9e4b643 520791ba gpg Mellanox Technologies lt support mellanox com gt Step 6 Create a YUM repository configuration file called etc yum repos d mlnx ofed repo with the following content mlnx ofed name MLNX OFED Repository baseurl file lt path to extracted MLNX OFED package enabled 1 gpgkey file path to the downloaded key RPM GPG KEY Mellanox gpgcheck 1 Step 7 Check that the repository was successfully added yum repolist Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management You can use subscrip tion manager to register repo id repo name status mlnx ofed MLNX OFED Repository 108 rpmforge RHEL 6Server RPMforge net dag 4 597 repolist 8 351 Mellanox Technologies 43 J Rev 2 1 1 0 0 Installation 2 5 2 Installing MLNX_OFED using the YUM Tool After setting up the YUM repository for MLNX_OFED package perform the following Step 1 View the available package groups by invoking yum grouplist grep MLNX_OFED LNX OFED ALL LNX OFED BASIC LNX OFED GUEST LNX OFED HPC LNX OFED HYPERVISOR LNX OFED VMA LNX OFED VMA ETH LNX OFED VMA VPI Step 2 Install the desired group yum groupinstall MLNX OFED ALL Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management
21. XRC allows significant savings in the number of QPs and the associated memory resources required to establish all to all process connectivity in large clusters It significantly improves the scalability of the solution for large clusters of multicore end nodes by reducing the required resources For further details please refer to the Annex A14 Supplement to InfiniBand Architecture Speci fication Volume 1 2 1 A new API can be used by user space applications to work with the XRC transport The legacy API is currently supported in both binary and source modes however it is deprecated Thus we recommend using the new API The new verbs to be used are open xrcd ibv close xrcd e create Srq ex get srq num create ex 86 Mellanox Technologies Rev 2 1 1 0 0 open Please use ibv xsrq pingpong for basic tests and code reference For detailed information regarding the various options for these verbs please refer to their appropriate man pages 4 12 Flow Steering Flow Steering is applicable to the mlx4 driver only Flow steering is new model which steers network flows based on flow specifications to specific QPs Those flows can be either unicast or multicast network flows In order to maintain flexibil ity domains and priorities are used Flow steering uses a methodology of flow attribute which is a combination of L2 L4 flow specifications a destination QP and
22. u eth5 Shows all of ethtool s steering rule When configuring two rules with the same priority the second rule will overwrite the first one so this ethtool interface is effectively a table Inserting Flow Steering rules in the kernel requires support from both the ethtool in the user space and in kernel v2 6 28 MLXA Driver Support The mlx4 driver supports only a subset of the flow specification the ethtool API defines Asking for an unsupported flow specification will result with an invalid value failure The following are the flow specific parameters Table 5 Flow Specific Parameters ether tcp4 udp4 ip4 Mandatory dst src ip dst ip Optional vlan src ip dst ip src src ip dst ip vlan port dst port vlan RFS RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUs that are used by flow s owner applications This domain allows the RFS mechanism to use the flow steering infrastructure to support the RFS logic by implementing the ndo rx flow steer which in turn calls the underlying flow steering mechanism with the RFS domain Enabling the RFS requires enabling the ntuple flag via the ethtool For example to enable ntuple for 0 run ethtool K eth0 ntuple on RFS requires the kernel to be compiled with the CONFIG RFS ACCEL option This options is available in kernels 2 6 39 and above Furthermore RFS requires Device Managed Flow Steer
23. Port State of Port 2 on CA 0 VPI UP 4X FDR InfiniBand Error Counter Check on CA 0 VPI PASS check nn ME 55 NOGERGUMD oni CAV 0 00 02 29 081100 80 0 60 mee INIM LLL LLL L After the installer completes information about the Mellanox OFED installation such as prefix kernel version and installation parameters can be retrieved by running the com mand etc infiniband info Installation Results Software Most of MLNX OFED packages are installed under the usr directory except for the following packages which are installed under the opt directory openshmem bupc fca and ibutils The kernel modules are installed under lib modules uname r updates on SLES and Fedora Distributions lib modules uname r extra mlnx ofa kernel on RHEL and other RedHat like Distribu tions lib modules uname r updates dkms on Ubuntu Firmware The firmware of existing network adapter devices will be updated if the following two conditions are fulfilled a Yourun the installation script in default mode that is without the option without fw update 40 Mellanox Technologies Rev 2 1 1 0 0 b The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image If an adapter s Flash was originally programmed with an Expansion ROM image the automatic firmware update will also burn an Expans
24. Storage Data Lustre OST Min BW 30 Administration e MPlis assigned an SL via the command line host1l mpirun s1 0 e OpenSM QoS policy file In the following policy file example replace OST and MDS with the real port GUIDs Mellanox Technologies 175 Rev 2 1 1 0 0 OpenSM Subnet Manager qos ulps default 0 default SL for MPT any target port guid OST1 0ST2 0ST3 0ST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL for Lustre MDS end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 2 1 qos vlarb low 0 96 1 224 eio SLA 01 23 5 6 1 105 305 5 15 19 15 8 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels Application traffic e IPoIB UD and CM and SDP Isolated from storage Min BW of 50 SRP Min BW 50 Bottleneck at storage nodes Administration OpenSM QoS policy file p In the following policy file example replace SRPT with the real SRP Target port GUIDs qos ulps default ipoib sdp srp target port guid SRPT1 SRPT2 SRPT3 NS FF o 176 Mellanox Technologies Rev 2 1 1 0 0 end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 32 qos vlarb low 0 1
25. Table 21 Adaptive Routing Manager Pre Switch Options File Option File Description Values ENABLE Allows you to enable disable the AR on this Default true lt true false gt switch If the general ENABLE option value is set to false then this per switch option is ignored This option can be changed on the fly AGEING TIME Applicable to bounded AR mode only Specifies Default 30 lt usec gt how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value In the pre switch options file this option refers to the particular switch only This option can be changed on the fly 8 8 5 1 2 Example of Adaptive Routing Manager Options File ENABLE true LOG FILE tmp ar mgr 109 LOG SIZE 100 MAX ERRORS 10 ERROR WINDOW 5 SWITCH 0x12345 ENABLE true AGEING TIME 77 SWITCH 0x0002c902004050f8 AGEING TIME 44 SWITCH Oxabcde ENABLE false 182 Mellanox Technologies Rev 2 1 1 0 0 8 9 Congestion Control 8 9 1 Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in i e it is a shared library libcc mgr so that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation The Congestion Control mechanism controls traffic entry into a net
26. Two things are notable about this master spanning tree First assuming the x dateline was between x 5 and x 0 this spanning tree has a branch that crosses the dateline However just as for unicast crossing a dateline on a 1D ring here the ring for y 2 that is broken by a failure cannot contribute to a torus credit loop Second this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric That unfortunately is a compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appro priately selecting a root switch In the 2D 6x5 torus example assume now that the switch at 3 2 i e the root for a pristine fabric fails Torus 2QoS will generate the following master spanning tree for that case 4 l l 3 l l 2 l i l l y 0 0 1 2 3 4 5 Assuming the y dateline was between y 4 and y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as it occurs on a 1D ring the ring for x 3 that is broken by a failure as in the above example 8 5 7 3 Torus Topology Discovery The algorithm used by torus 2QoS to construct the torus topology from the undirected graph rep resenting the
27. de In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient cf client conf file IB network interface name Example of a configuration file for the ConnectX PCI Device ID 26428 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 Example of a configuration file for InfiniHost HI Ex PCI Device ID 25218 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 In order to use the configuration file run host1 dhclient cf dhclient conf ibl 58 Mellanox Technologies Rev 2 1 1 0 0 4 3 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the installation script with a configuration file using the n option containing the full IP configu ration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface A static IPoIB configuration An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring IP addresses The fol
28. to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fab ric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also setup partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA t
29. 1 2 Mellanox OFED Package 1 2 1 ISO Image Mellanox OFED for Linux MLNX OFED LINUX is provided as ISO images or as a tarball one per supported Linux distribution and CPU architecture that includes source code and binary RPMs firmware utilities and documentation The ISO image contains an installation script called m1nxofedinstall that performs the necessary steps to accomplish the following Discover the currently installed kernel Uninstall any InfiniBand stacks that are part of the standard operating system distribu tion or another vendor s commercial stack Install the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 2 2 Software Components MLNX OFED LINUX contains the following software components Mellanox Host Channel Adapter Drivers mlx5 mlx4 VPI which is split into multiple modules mlx4 core low level helper mlx4 ib IB mlx5 ib mlx 5 core Mellanox Technologies 19 J Rev 2 1 1 0 0 Mellanox OFED Overview mlx4 en Ethernet Mid layer core Verbs MADs SA CM CMA uVerbs uMADs Upper Layer Protocols ULPs IPoIB RDS SRP Initiator and SRP NOTE RDS was not tested by Mellanox Technologies MPI Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces MP
30. 23 J Rev 2 1 1 0 0 Mellanox OFED Overview 1 3 5 MPI Message Passing Interface MPI is a library specification that enables the development of paral lel software libraries to utilize parallel computers clusters and heterogeneous networks Mella nox OFED includes the following MPI implementations over InfiniBand Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Bench mark and Presta 1 3 6 InfiniBand Subnet Manager All InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OF ED See Chap ter 8 OpenSM Subnet Manager 1 3 7 Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data center managers jbutils Mellanox Technologies diagnostic utilities infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools 1 3 8 Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image Querying for firmware information
31. 4 3 1 Introductions cesses Pe a eo heed eb uten 56 4 32 IPoIB Mode Setting uuu ice gee ide ee ie tig HERE ah 56 4 3 3 IPoIB Configuration 57 4 34 Subintetfacesc uuu asa Ghia eae hab a chek ee ee ees 60 4 3 5 Verifying IPoIB Functionality 61 4 3 6 Bonding IPOIB sss recenseo e Peed era Seed qas Hepa eye ea 62 4 4 Quality of Service InfiniBand 63 4 4 1 Quality of Service Overview 63 4 42 OoS Architecture v oo esc as aW a RPE p Can Chama 64 4 4 3 Supported Policy 64 AAA CMA Features meo eser gc ape ova een ee i tede io 65 4 4 5 Opens M R atures orci in a ee EX UON ob eee ES RR 66 4 5 Quality of Service Ethernet 66 4 5 1 Quality of Service OvervieW 66 4 5 2 Mapping Traffic to Traffic 1 66 4 5 3 Plain Ethernet Quality of Service Mapping 66 4 5 4 Quality of Service Mapping 67 4 5 5 Raw Ethernet QP Quality of Service Mapping 68 4 5 6 Map Priorities with wrap py mlnx qos 68 4 5 7 Quality of Service Properties
32. 5 5 3 Various Executable Examples The following are various executable examples gt Torun a ScalableUPC application without FCA support o upcrun np 128 fca enable 0 lt executable filename gt gt run UPC applications with FCA enabled for any number of processes export GASNET FCA ENABLE CMD LINE 1 GASNET FCA NP CMD LINE 0 upcrun np 64 lt executable filename gt gt To run UPC application on 128 processes verbose mode o upcrun np 128 fca enable 1 fca_np 10 fca verbose 5 lt executable filename gt gt Torun UPC application offload to FCA Barrier and Broadcast only upcrun np 128 fca ops executable filename Mellanox Technologies 123 Rev 2 1 1 0 0 Working With VPI 6 Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth 6 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx port config script after the driver is loaded Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved con figuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are eth Eth
33. Gio _ Slaw 32 3 255 677 185 105 15 305 155 105 neo q 8 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels Management traffic ssh IPoIB management VLAN partition A Min BW 10 Application traffic IPoIB application VLAN partition B Isolated from storage and database Min BW of 30 Database Cluster traffic RDS Min BW of 30 SRP Min BW 30 Bottleneck at storage nodes Administration OpenSM QoS policy file In the following policy file example replace SRP T with the real SRP Initiator port GUIDs de qos ulps default ipoib pkey 0x8001 rds 0 1 ipoib pkey 0x8002 8 2 3 srp target port guid SRPT1 SRPT2 SRPT3 4 Mellanox Technologies 177 Rev 2 1 1 0 0 OpenSM Subnet Manager end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 96 3 96 4 96 qos vlarb low 0 1 qos Bl2vl 0 1 2 3 4 5 6 7 15 15 15 15 15 15 15 15 Partition configuration file Defaultz0Ox7fff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL full 8 8 Adaptive Routing 8 8 1 Overview Adaptive Routing is at beta stage Adaptive Routing AR enables the switch to select the output port based on port s load AR supports two routing modes Free A
34. Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2014 Mellanox Technologies All Rights Reserved Mellanox amp Mellanox logo BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale MLNX OS PhyX SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd Connect IB ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroX MetroDX ScalableHPC Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2877 Rev 2 1 1 0 0 Table of Contents Table of Contents euer e e a Le e e e Rh eS e e e o esee Eist of Figures iux disk RULES Da ER E Sa PAPE oe ok List of Tables uuu lex a esee uda r4 d pe e ek verd suo ee 10 Chapter 1 Mellanox OFED Overview 19 1 1 Introduction to Mellanox OFED 19 12 Mellanox OFED Package 19 122 ISO Image ote SORS NL e TEN oe an a ee ee ee a ILS 19 1 2 2 Software Compone
35. Rev 2 1 1 0 0 Step 2 Change the related interface in the example below bridge0 is created over eth5 DEVICE eth5 BOOTPROTO none STARTMODE on HWADDR 00 02 c9 2e 66 52 TYPE Ethernet NM_CONTROLLED no ONBOOT yes BRIDGE bridge0 Step3 Restart the service network Step 4 Attach a virtual NIC to VM ifconfig a eth6 Link encap Ethernet HWaddr 52 54 00 E7 77 99 inet addr 13 195 15 5 Bcast 13 195 255 255 Mask 255 255 0 0 inet6 addr fe80 5054 ff fee7 7799 64 Scope Link UP BROADCAST RUNNING MULTICAST MTU 1500 Metric 1 RX packets 481 errors 0 dropped 0 overruns 0 frame 0 TX packets 450 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 22440 21 9 KiB TX bytes 19232 18 7 KiB Interrupt 10 Base address 0xa000 4 13 4 Assigning a Virtual Function to a Virtual Machine This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine 4 13 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server Step 1 Run the virt manager Step 2 Double click on the virtual machine and open its Properties Step 3 Goto Details gt Add hardware gt PCI host device He Virtual Machine View Send Key mQ Add new virtual hardware JA x Adding Virtual Hardware This assistant will guide you through adding a new piece of virtual hardware First select what type of hardware you wish to add he D Hardware type Storage Network 7 Mouse J Inp
36. burn verify Binary image file lt image gt qq burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g I2C MTUSB 1 nofs burn Burn image in a non failsafe manner skip is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sector difference is detected byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages 5 Non interactive mode Assume the answer is yes to all ques tions no All Non interactive mode Assume the answer is no to all ques tions 216 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 1 1 0 0 Table 36 mstflint Switches Sheet 3 of 3 Affected Switch Relevant Description Commands vsd burn Write this string of up to 208 characters to VSD upon a burn lt string gt command burn Burn vsd as it appears in the given image do not keep existing use image VSD on Flash 5 dual image burn Make the burn process burn two images on Flash The current default failsafe burn process burns a single image in alternat ing locations V Print version info
37. mlnx affinity start 134 Mellanox Technologies Rev 2 1 1 0 0 Stop mlnx_affinity stop Restart mlnx_affinity restart mlnx_affinity can also be started by driver load unload gt To enable mlnx affinity by default Add the line below to the etc infiniband openib conf file RUN AFFINITY TUNER yes 7 2 7 3 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core utilization so there will be no interleaving between interfaces The following script can be used to separate each adapter s IRQs to different set of cores set irq affinity cpulist sh cpu list interface cpu list can be either a comma separated list of single core numbers 0 1 2 3 or core groups 0 3 Example Ifthe system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the follow ing etc init d irgbalancer stop set irq affinity cpulist sh 0 1 eth2 set irq affinity cpulist sh 2 3 eth3 set irq affinity cpulist sh 4 5 eth4 set irq affinity cpulist sh 6 7 eth5 Mellanox Technologies 135 Rev 2 1 1 0 0 Performance 7 2 8 Tuning Multi Threaded IP Forwarding gt To optimize NIC usage as IP forwarding 1 Set the following options in etc modprobe d mlx4 conf For MLNX OFED 2 0 x options mlx4 en inline thold 0 options mlx4 core high rate steer 1 For MLNX EN 1 5 10 options mlx4 en num lro 0 inline
38. mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0 1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Infiniband device mthca0 port 1 status default gid e80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 state ZNE phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mthca0 port 2 status default gid e80 0000 0000 0000 0002 c900 0101 d152 base lid 0x0 sm lid 0x0 State Ake BUDI phys state 5 LinkUp rate 10 Gb sec 4X Mellanox Technologies 199 9 10 Rev 2 1 1 0 0 2 List the status of specific ports of specific devices gt ibstatus mthca0 1 mlx4 0 2 Infiniband device mthca0 port 1 status InfiniBand Fabric Diagnostic Utilities default gid e80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 State VINT phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0 1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR ibportstate Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a switch port then ibportstate can be used to disable enable or reset the port validate the port s link width and speed
39. scst sourceforge net downloads html b Untar scst 1 0 1 1 tar ses Sesic il 0 1 1 segue HEL Q Ton c Install scst 1 0 1 1 as follows make amp amp make install B 2 How to Run A On an SRP Target machine 1 Please refer to SCST s README for loading scst driver and its dev handlers drivers scst_vdisk block or file IO mode nullio Regardless of the mode you always need to have lun 0 in any group s device list Then you can have any lun number following lun 0 it is not required to have the lun ad numbers in ascending order except that the first lun must always be 0 p Setting SRPT LOAD yes in etc infiniband openib conf is not enough as it only loads theib srpt module but does not load scst not its dev handlers Mellanox Technologies 233 Rev 2 1 1 0 0 The scst disk module pass thru mode of SCST is not supported by Mellanox OFED de Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst modprobe scst_vdisk c echo open vdisk0 dev md0 BLOCKIO gt proc scsi_tgt vdisk vdisk d echo open vdisk1 dev sda BLOCKIO gt proc scsi_tgt vdisk vdisk e echo open vdisk2 dev cciss cl1d0 BLOCKIO gt proc scsi_tgt vdisk vdisk f echo add vdisk0 0 gt proc scsi_tgt groups Default devices g echo add vdisk1 1 proc scsi tgt groups Default devices h echo add vdisk2 2 gt proc scsi_tgt groups Default devices Exa
40. torus 2QoS will generate the path S n I q r D with illegal turn at switch I and with hop I q using a VL with bit 1 set In contrast to the earlier examples the second hop after the illegal turn q r can be used to construct a credit loop encircling the failed switches 8 5 7 2 Multicast Routing Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically available in current switches there is no way to use SL VL values to separate multicast traffic from unicast traffic Thus torus 2QoS must generate multicast routing such that credit loops can 160 Mellanox Technologies J Rev 2 1 1 0 0 not arise from a combination of multicast and unicast path segments It turns out that it is possi ble to construct spanning trees for multicast routing that have that property For the 2D 6x5 torus Wee example above here is the full fabric spanning tree that torus 2QoS will construct where x is the root switch and each is a non root switch 4 I I I I I I 3 I I I I I I 2 I I I I I I i I I I I I I 0 0 3 2 3 4 a For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns However to construct a credit loop the union of multicast routing on this span ni
41. 0 Examples DefaultzOx7fff ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 limi 0 2134 2306 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited ShareIO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited ShareIO 0x80 0x123453 0x123454 0x123455 full 0x123456 0x123457 will be limited ShareIO 0x80 defmember limited 0x123456 0 123457 0x123458 full SharelO 0x80 defmember full 0x123459 0x12345a ShareIO 0x80 defmember full 0x12345b 0x12345c limited 0x12345d The following rule is equivalent to how OpenSM used to run prior to the partition manager Default 0x7fff ipoib ALL full Mellanox Technologies 151 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm Based on the minimum hops to each node where the path length is optimized 2 Algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat tree Routing Algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any
42. 1 0 0 Driver Features For example struct ibv_exp device attr attr ibv exp query device context amp attr if attr comp mask amp IBV EXP DEVICE ATTR WITH TIMESTAMP MASK if attr timestamp mask Time stamping is supported with mask attr timestamp mask if attr comp mask amp IBV EXP DEVICE WITH HCA CORE CLOCK if attr hca_core clock reporting the device s clock is supported attr hca core clock is the frequency in MHZ 4 6 2 2 Creating Time Stamping Completion Queue To get time stamps a suitable extended Completion Queue CQ must be created via a special call to ibv_create_cq_ex verb cq init attr flags IBV CQ TIMESTAMP cq init attr comp mask IBV CQ INIT ATTR FLAGS cq ibv create cq ex context cqe node NULL 0 amp cq init attr This CQ cannot report SL or SLID information The value of s1 and s1 id fields in struct ibv_wc_exare invalid Only the fields indicated by the wc flags field in ad struct wc ex contains a valid and usable value When using Time Stamping several fields of struct wc ex not available resulting in RoCE UD RoCE traffic with VLANS failure 4 6 2 3 Polling a Completion Queue Polling a CQ for time stamp is done via the ibv poll cq ex verb ret poll cq ex cq 1 amp wc ex sizeof wc if ret gt 0 CQ returned a we if wc ex wc flags amp IBV WC WITH TIMESTAMP This wc contains a t
43. 13 7 3 3RoCE Support RoCE is supported on Virtual Functions and VLANs may be used with it For RoCE the hyper visor GID table size is of 16 entries while the VFs share the remaining 112 entries When the 102 Mellanox Technologies Rev 2 1 1 0 0 number of VFs is larger than 56 entries some of them will have GID table with only a single entry which is inadequate if VF s Ethernet device is assigned with an IP address When setting num_vfs in mlx4_core module parameter it is important to check that the number of the assigned IP addresses per VF does not exceed the limit for GID table size 4 14 CORE Direct 4 14 1 CORE Direct Overview CORE Direct provides a solution for off loading the MPI collectives operations from the soft ware library to the network CORE Direct accelerates MPI applications and solves the scalability issues in large scale systems by eliminating the issues of operating systems noise and Jitter It addresses the collectives communication scalability problem by off loading a sequence of data dependent communications to the Host Channel Adapter HCA This solution provides the hooks needed to support computation and communication overlap Additionally it provides a means to reduce the effects of system noise and application skew on application scalability The relevant verbs to be used for CORE Direct 16 create qp ex ibv modify cq ibv query device ex ibv post task Samples programs for reference
44. 2 1 1 0 0 The example above shows two eIPoIB interfaces where eth4 runs traffic over ib0 and eth5 runs traffic over ibl Figure 3 An Example of a Virtual Network Host ib0 2 ib0 3 Lo 41 KVM GUEST1 L IPOIB LAN etel eise via port 1 etho J tapo MER A 4 k eei KVM GUEST2 773 73 bro vifo 3 The example above shows a few IPoIB instances that server the virtual interfaces at the Virtual Machines To display the services provided to the Virtual Machine interfaces cat sys class net eth0 eth vifs Example cat sys class net eth0 eth vifs SLAVE ib0 2 MAC 52 54 00 60 55 88 VLAN N A In the example above the ib0 2 IPoIB interface serves the MAC 52 54 00 60 55 88 with no VLAN tag for that interface 4 8 3 VLAN Configuration Over an elPolB Interface eIPoIB driver supports VLAN Switch Tagging VST mode which enables the virtual machine interface to have no VLAN tag over it thus allowing VLAN tagging to be handled by the Hyper visor gt To attach a Virtual Machine interface to a specific isolated tag Step 1 Verify the VLAN tag to be used has the same pkey value that is already configured on that ib port cat sys class infiniband mlx4 0 ports ib port gt pkeys Step2 Create a VLAN interface in the Hypervisor over the eIPoIB interface vconfig add eIPOIB interface vlan tag Step3 Attach
45. 2 5 Interrupt Moderation 0 00 usa g uyta sasi asua pasa 132 7 2 6 Tuning for NUMA Architecture 132 T2271 TRO Affinity SS deoa E OS S Rea LE es RES 134 7 2 8 Tuning Multi Threaded IP 136 Chapter 8 OpenSM Subnet Manager 137 Bl OVERVIEW dtu xod ALONE S esee ede nt oa deans via posed tian 137 8 2 Description 137 8 2 1 COpenshYSYhtaX su a whe een TRUST RINNER Casa ua ERR A Te Bd 137 8 2 2 Environment Variables u u ped ae quad gy qund em 145 9 2 3 Signaling erp Cue tomb bum 146 5 2 4 Running opensm os st kee eb by See NS SR eS 146 8 3 osmtest Description 146 8 3 Syntax ec eee Rea te wa dayne Se 147 8 3 2 Running osmtest liess 149 8 45 Partitions D Ee PEE eM Ed 149 8 4 1 Elle Format Jost bovis ETSPDP SU thee dae DERE USUS 149 8 5 Routing Algorithms 0 sas a 152 8 5 1 Effect of Topology Changes 153 8 5 2 Min Hop Algorithm 153 PDN Algorithm eret tque ss CORTO ERE CRINE eee 154 8 5 4 Fat tree Routing Algorithm 155 8 5 5 LASH Routing Algorithm
46. 3 Skprio 4 Skprio 5 Skprio 6 Mellanox Technologies 73 J Rev 2 1 1 0 0 Driver Features UP UP UP UP UP UP AnD BB CO N 4 5 8 3 Additional Tools 4 6 4 6 1 4 6 1 1 tc tool compiled with the sch module is required to support kernel v2 6 32 or higher This is a part of iproute2 package v2 6 32 19 or higher Otherwise an alternative custom sysfs interface is available mlnx qos tool package ofed scripts requires python gt 2 5 tc wrap py package ofed scripts requires python gt 2 5 Ethernet Time Stamping Ethernet Time Stamping Service Time Stamping is currently supported in ConnectX 3 ConnectX 3 Pro adapter cards only Time stamping is the process of keeping track of the creation of a packet A time stamping ser vice supports assertions of proof that a datum existed before a particular time Incoming packets are time stamped before they are distributed on the PCI depending on the congestion in the PCI buffers Outgoing packets are time stamped very close to placing them on the wire Enabling Time Stamping Time stamping is off by default and should be enabled before use gt To enable time stamping for a socket Call setsockopt with SO TIMESTAMPING and with the following flags SOF TIMESTAMPING TX HARDWARE try to obtain send time stamp in hardware SOF TIMESTAMPING TX SOFTWARE if SOF TIMESTAMPING TX HARDWARE is off or fails then do it in software SOF_
47. 4 1 2 5 Multiple Connections from Initiator InfiniBand Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port i e SRP connections use the same path the configuration is enabled using a different initiator_ext value for each SRP connection The initiator_ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator_ext value on each path The conventions is to use the Target port GUID as the initiator_ext value for the rele vant path If you use srp_daemon with n flag it automatically assigns initiator_ext values according to this convention For example id ext 200500A0B81146A1 ioc 9011 0002 90200402 dgid fe800000000000000002c90200402bed pkey ffff service id 200500a0b81146al initiator ext ed2b400002c90200 Notes 1 It is recommended to use the n flag for all srp_ daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the user or automat ically at startup by setting SRP DAEMON ENABLE to yes 4 1 2 6 High Availability HA Overview High Availability w
48. B2 JHowsto Run u aun uy al mene a ee this mut 233 B 3 How to Unload Shutdown 236 Appendix C mlx4 Module Parameters 237 1 mlx4 ib Parameters 237 CJ mlx4 core Parameters isses eee e ee ey rA ean 237 G 3 mlx4 en Parameters i tutte eI Races a 238 Appendix D mlx5 Module Parameters 239 Appendix E Lustre Compilation over MLNX OFED 240 8 Mellanox Technologies J Rev 2 1 1 0 0 List of Figures Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards 21 Figure 2 I O Consolidation Over InfiniBand 63 Figure 3 Example ofa Virtual Network 83 Figure QOS Manager usuyasa nigra det ders Ab PE S or RR RF Pau 166 Figure 5 Example QoS Deployment on InfiniBand Subnet 175 Mellanox Technologies 9 J List of Tables Table 1 Document Revision History 12 Table 2 Abbreviations and Acronyms 14 Table 3 Glossary a tette b t IOS RR ddA a doe P cb f l ar ea deh i ded a 15 Table4 Reference Documents 2 0 cece s 16 Table 5 Software and Hardware Requirements
49. BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 148 Mellanox Technologies Rev 2 1 1 0 0 0x08 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying DEBUG diagnostic high volume vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 8 3 2 Running osmtest To run osmtest in the default mode simply enter hostis osmtest The default mode runs all the flows except for the Quality of Service flow see Section 8 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file hostl osmtest f c Immediately afterwards run the following command to test opensm host1 osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed 8 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm par
50. FCA Al bin burn Step 4 Reboot your machine after the firmware burning is completed 42 Mellanox Technologies Rev 2 1 1 0 0 25 Instaling MLNX_OFED using YUM 2 5 1 Setting up MLNX_OFED YUM Repository Step 1 Download the tarball to your host The image s name has the format MLNX_OFED_LINUX lt ver gt lt OS label gt lt CPU arch gt tgz You can download it from http www mellanox com gt Products gt Software gt InfiniBand Drivers Step 2 Extract the MLNX OFED tarball package to a shared location in your network tar xzf MLNX OFED LINUX MLNX OFED version rhel6 4 x86 64 tgz Step3 Download and install Mellanox Technologies GPG KEY The key can be downloaded via the following link http www mellanox com downloads ofed R PM GPG KEY Mellanox wget http www mellanox com downloads ofed RPM GPG KEY Mellanox 2013 08 20 13 52 30 http www mellanox com downloads ofed RPM GPG KEY Mellanox Resolving www mellanox com 72 3 194 0 Connecting to www mellanox com 72 3 194 0 80 connected HTTP request sent awaiting response 200 OK Length 1354 1 3K text plain Saving to RPM GPG KEY Mellanox 100 gt 1 354 K s 08 2013 08 20 13 52 30 247 MB s RPM GPG KEY Mellanox saved 1354 1354 Step 4 Install the key sudo rpm import RPM GPG KEY Mellanox Step 5 Check that the key was successfully imported rpm q gpg pubkey qf NAME
51. H i H HH HHHH HH H HHH i Preparing x HHH HH HHH HH H HH Het i H HH HHHH HH H HHH i libibun ad devel HH H HH H H HH H HH H HH HH H HH H HH HHHH H HH H HH Preparing Na HHH HH H H HH H HH H HH HH H HH H HH HHHH HHH HHH HH HH libibumad devel THHHBHHHHHHHHHBHBHHHHHHBHHHHHHHBHBHHHHHHHBHBHHHNE Preparing i Het HH H H HH H HH H HH HH H HH H HHH HH HH H Het libibun ad stati C HH H HH H H HH H HH H HH HH H HH H HH HHHH HH H HH H HH Preparing tes HHH HH H H HH H HH H HH HH H HH H HH HHHH HH H HHH HH HH libibun C HH H HH H H HH H H HH HH H H HH HHHH HH H HH H HH i Prepari ng y HHH HH HHH H HH Het i H HH HHHH HH H Het i libibmad HH H HH H H Het i H HH HH H HH H HH HHHH i H HHH i Preparing is HHH HH HHH HH H HH HH H HH H HH HHHH HHH HHH i HH libibmad HH i HH H i H HH HH H i H HHH HH i H HH H HH HH Mellanox Technologies Preparing libibmad devel Preparing libibmad devel Preparing libibmad static Preparing libibmad static Preparing ibsim Prepari ibacm Preparin librdmac Preparin librdmac Preparin librdmac Preparin librdmac Preparin librdmac ee It gfe So Gi iE 1 e D lt devel 3 Q BOQ Ba B Q
52. IB fabric Unicast Linear For A table that exists in every switch providing the port through warding Tables which packets should be sent to each LID LFT Virtual Protocol A Mellanox Technologies technology that allows Mellanox Interconnet VPI channel adapter devices ConnectX to simultaneously con nect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports Related Documentation Table 4 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 Release 1 2 1 is provided by IBTA IEEE Std 802 3ae 2002 Amendment to IEEE Std 802 3 2002 Document PDF 8594996 Physical Layer Specifications Amendment Media Access Control MAC Parameters for 10 Gb s Operation 16 Mellanox Technologies The InfiniBand Architecture Specification that Part 3 Carrier Sense Multiple Access with Colli sion Detection CSMA CD Access Method and Parameters Physical Layers and Management Rev 2 1 1 0 0 Table 4 Reference Documents Document Name Description Firmware Release Notes for Mellanox See the Release Notes PDF file relevant to your adapter devices adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Manual See under docs folder of installed package MFT Release Notes Release Notes for the Mellanox Firmware Tools
53. IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Linux software package see www mellanox com gt Products gt InfiniBand VPI SW Drivers The binary code is exported by the device as an expansion ROM image A 2 FlexBoot Package The FlexBoot package is provided as a tarball tgz extension Uncompress it using the com mand tar zxf package file name gt The tarball contains PXE binary files with the mrom extension for the supported adapter devices See the release notes file FlexBoot flexboot version release notes txt for details The package includes the following files dhcpd conf sample DHCP configuration file dhcp patch patch file for DHCP v3 1 3 A 3 Burning the Expansion ROM Image A 3 4 Burning the Image on ConnectX 2 ConnectX 3 This section is valid for ConnectX 2 devices with firmware versions 2 9 1000 or later and ConnectX 3 firmware versions 2 30 3000 or later Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot lt flexboot_version gt release notes txt 2 Firmware Burning Tools Mellanox Technologies 221 Rev 2 1 1 0 0 You need to install the Mellanox Firmware Tools MFT package version 3 0 0 or later in order to burn the PXE ROM Image To download MFT see Firmware Tools under www mellanox com gt Products
54. MT47396 Infiniscale III Mellanox Technologies 0x0006 023 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 024 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 5 Dump all non empty mlids of switch with Lid 3 ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0 1 2 0 d 345679913 294513919 910194 MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 0xc022 0xc023 0xc024 0xc040 0xc041 0xc042 12 valid mlids dumped x M x x oM 206 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 1 1 0 0 9 12 smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsis smpquery h d e v D G s lt smlid gt V C ca name P ca port t timeout ms node name map lt node name map gt op dest dr path lid guid op params Output Files Table 33 lists the various flags of the command Table 33 smpquery Flags and Options Default Flag A Hi i If Not Description ianua Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level Ma
55. MXM Step 1 Remove MXM v1 1 rpm e mxm Step 2 Remove the pre compiled OpenMPI rpm e mlnx openmpi_gcc Step3 Install the new MXM and compile the OpenMPI with it To run OpenMPI without MXM run mpirun mca mtl mxm lt gt When upgrading to MXM v2 1 OpenMPI compiled with the previous versions of the MXM should be recompiled with MXM v2 1 5 3 2 Enabling in OpenMPI MXM v2 1 is automatically selected by OpenMPI up to v1 6 when the Number of Processes NP is higher or equal to 128 To enable MXM for any NP use the following OpenMPI parame ter mca mtl mxm np lt number gt From OpenMPI v1 7 MXM is selected when the number of processes is higher or equal to 0 i e by default To activate MXM for any NP run mpirun mca mtl mxm np 0 other mpirun parameters 5 3 3 Tuning MXM Settings The default MXM settings are already optimized To check the available MXM parameters and their default values run the opt mellanox mxm bin mxm dump config utility which is part of the MXM RPM MXM parameters can be modified in one of the following methods Modifying the default MXM parameters value as part of the mpirun mpirun x UD RX MAX BUFFERS 128000 lt gt Modifying the default MXM parameters value from SHELL export MXM UD RX MAX BUFFERS 128000 mpirun lt gt Mellanox Technologies 119 Rev 2 1 1 0 0 HPC Features 5 3 4 Configuring Multi
56. Preparing opensm libs Preparing opensm libs Preparing opensm Preparing opensm devel Preparing opensm devel Preparing opensm static Preparing opensm static Preparing infiniband diags Preparing fca 4 4 i H HHHH HH HH E i t t t t t i H t st db db H te st f t t t t t db db t se dt st dt i H HH HH HH 4 H HHH HH HH H H HH HH HH HH H H HH HH HH T T H HH HH HH 4 H HHHHH HH HH Mellanox Technologies Rev 2 1 1 0 0 Rev 2 1 1 0 0 Installation IMPORTANT NOTE The FCA Manager and FCA MPI Runtime library are installed in opt mellanox fca directory The FCA Manager will not be started automatically To start FCA Manager now type etc init d fca_managerd start There should be single process of FCA Manager running per fabric To start FCA Manager automatically after boot Lype etc init d fca managerd install service Check opt mellanox fca share doc fca RE
57. RH6 2x64 root hd0 0 kernel wvmlinuz RH6 2x64 2 6 32 220 e16 x86 64 root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0 console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max_cstate 1 Mellanox Technologies 131 Rev 2 1 1 0 0 Performance 7 2 5 7 2 6 7 2 6 1 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface ethl then turns off Rx interrupt moderation and last shows the new setting ethtool c ethl Coalesce parameters for ethl Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 6 rx frames 88 rx usecs irg 0 rx frames irq 0 gt ethtool C ethl adaptive rx off rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX off pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 rx usecs irq 0 rx frames irq 0 Tuning for NUMA Architecture Tuning for Intel Sandy Bridge Platform The Intel Sandy Bridge processor has an
58. RPMs Please wait Removing OFED RPMs Created tmp MLNX OFED LINUX 2 1 1 0 0 rhel6 2 x86 64 tgz mlnxofedinstall script For further information please see add kernel support option gt The minx add kernel support sh script can be executed directly from the below 2 3 2 Installation Script Mellanox OFED includes an installation script called minxofedinstall Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Procedure on page 32 Mellanox Technologies 29 J Rev 2 1 1 0 0 Installation Usage Options 30 Mellanox Technologies Rev 2 1 1 0 0 hugepages overcommit Setting 80 of MAX MEMORY as overcommit for huge page allocation q Set quiet no messages will be printed with fabric collector Install fabric collector package 2 3 2 1 mlnxofedinstall Return Codes Table 2 lists the mlnxofedinstal11 script return codes and their meanings Table 2 mInxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration This can occur when the required hardware is not present on the system 172 Prerequisites are not met For example missing the required software installed or the hardware is not configur
59. SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Path Bits are not implemented in OFED IV Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the fol lowing fields e SRC and DST to lists of port groups Service ID to a list of Service ID values or ranges e QoS Class to a list of QoS Class values or ranges 4 4 4 Features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma_resolve_add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 4 4 4 1 IPoIB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 4 4 4 2 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controlle
60. Simula Research Laboratory Use R lash Q option to activate the LASH algorithm QoS support has to be turned on in order that SL VL mappings are used ae LMC gt 0 is not supported by the LASH routing If this is specified the default routing algorithm is invoked instead For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do_mesh_analysis This will add an additional phase that analyses the mesh to try to determine the dimension and size of a mesh If it determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs Mellanox Technologies 157 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 6 8 5 7 8 5 7 1 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses short est paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a
61. Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the I option ibdiagnet of ibutils2 IB Net Diagnostic This version of ibdiagnet is included in the ibutils2 package and it is run by default after installing Mellanox OFED To use this ibdiagnet version run ibdiagnet Please see ibutils2 release notes txt for additional information and known issues Ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis i device dev name p port port num g guid GUID in hex vlr lt file gt r routing u fat tree o output path lt directory gt skip lt stage gt skip plugin library name gt P counter lt lt PM gt lt value gt gt pm pause time seconds ber test ber use data ber thresh lt value gt extended speeds dev type pm per lane 1s lt 2 5 5 10 14 25 FDR10 gt 1w 1x 4x 8x 122x w write
62. Window 112 4 20 5 Deallocating Memory Window 112 Chapter 5 HPC Features suska qusa a seus es 113 5 1 Shared Memory Access isle 113 5 1 1 Mellanox ScalableSHMEM 113 5 1 2 Running SHMEM with FCA 114 5 1 3 Running ScalableSHMEM with 114 5 1 4 Running SHMEM with Contiguous 115 5 1 5 Running ScalableSHMEM Application 115 5 2 Message Passing Interface 115 5 2 OVerview ot uy AIDS PES hie ole Mee ee eS 115 5 2 2 Prerequisites for Running 116 5 2 3 MPI Selector Which MPI Runs 117 5 2 4 Compiling MPI Applications 0 cee eect ee 118 5 3 MellanoX Messaging 118 5 3 1 Compiling OpenMPI with MXM 118 5 3 2 Enabling MXM in OpenMPI 119 5 3 3 Tuning MXM Settings etna 119 5 3 4 Configuring Multi Rail Support 120 5 3 5 Configuring MXM over the Ethernet Fabric 120 5 4 Fabric Collective Accelerator
63. a priority Flow steering rules may be inserted either by using ethtool or by using InfiniBand verbs The verbs abstraction uses a different terminology from the flow attribute ibv_flow_attr defined by a combination of speci fications struct ibv flow spec 4 12 1 Enable Disable Flow Steering Flow Steering is disabled by default and regular L2 steering is performed instead BO Steering When using SR IOV flow steering is enabled if there is an adequate amount of space to store the flow steering table for the guest master gt To enable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Set the parameter 1og num entry size to 1 by writing the option m1x4 core log num mgm entry size l Step3 Restart the driver To disable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Remove the options mlx4 core log num mgm entry size 1 Step3 Restart the driver 4 12 2 Flow Domains and Priorities Flow steering defines the concept of domain and priority Each domain represents a user agent that can attach a flow The domains are prioritized A higher priority domain will always super sede a lower priority domain when their flow specifications overlap Setting a lower priority value will result in higher priority In addition to the domain there is priority within each of the domains Each domain can have at most 2 12 priorities in accordance with its needs The following
64. affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalance stop The following command assigns the affinity of a single interrupt vector gt echo lt hexadecimal bit mask gt proc irg irq vector gt smp affinity Bit i in lt hexadecimal bit mask gt indicates whether processor core i is in lt irq vector gt s affinity or not IRQ Affinity Configuration It is recommended to set each IRQ to a different core For Sandy Bridge AMD systems set the irq affinity to the adapter s NUMA node For optimizing single port traffic run set irq affinity bynode sh lt numa node gt lt interface gt For optimizing dual port traffic run set irq affinity bynode sh numa node lt interfacel gt lt interface2 gt To show the current irq affinity settings run show irq affinity sh interface 7 2 7 2 Auto Tuning Utility MLNX OFED 2 0 x introduces a new affinity tool called mlnx affinity This tool can automati cally adjust your affinity settings for each network interface according to the system architecture Usage Start
65. apply to impact this feature by changing the following environment variable MLX4 STALL NUM LOOP integer default 400 The default value is optimized for most applications However several applications might benefit from increasing decreasing this value 7 2 6 2 Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system With a2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 With a4 socket system the PCIe adapter will be connected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 7 2 6 3 Recognizing NUMA Node Cores Torecognize NUMA node cores run the following command cat sys devices system node node X cpulist cpumap Example cat sys devices system node node1 cpulist 4 53555 74 9357159 105 cat sys devices system node node1 cpumap 0000aaaa Mellanox Technologies 133 Rev 2 1 1 0 0 Performance 7 2 6 3 1 Running an Application on a Certain NUMA Node 7 2 7 7 2 7 1 In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool For example if the adapter s NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only Torun an application run the following commands taskset c 8 15 ib write bw a or taskset Oxff00 ib write bw a IRQ Affinity The
66. bit add res bit adder carry MASK IS SET va bit position MASK IS SET compare add bit position amp new carry if bit add res atomic response bit position carry new carry amp amp MASK IS SET compare add mask bit position return atomic response Ethernet Tunneling Over IPoIB Driver elPolB The eth ipoib driver provides a standard Ethernet interface to be used as a Physical Interface PIF into the Hypervisor virtual network and serves one or more Virtual Interfaces VIF This driver supports L2 Switching Direct Bridging as well as other L3 Switching modes e g NAT This document explains the configuration and driver behavior when configured in Bridging mode Rev 2 1 1 0 0 In virtualization environment a virtual machine can be expose to the physical network by per forming the next setting Step1 Create a virtual bridge Step 2 Attach the para virtualized interface created by the eth_ipoib driver to the bridge Step 3 Attach the Ethernet interface in the Virtual Machine to that bridge The diagram below describes the topology that was created after these steps Virtual Interface s vifX Virtual Bridge s vbrX aka vSwitch Bridge Uplink s elPolB IPoib Uplink InfiniBand Fabric The diagram shows how the traffic from the Virtual Machine goes to the virtual bridge in the Hypervisor and from the bridge to the eIPoIB interface eIPoIB inter
67. bounded AR mode only Specifies how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value This option can be changed on the fly Default 30 MAX ERRORS lt N gt ERROR WINDOW DEN When number of errors exceeds MAX ERRORS of send receive errors or time outs in less than ERROR WINDOW seconds the AR Manager will abort returning control back to the Subnet Manager This option can be changed on the fly Values for both options 0 Oxffff MAX ERRORS 0 zero tolle rance abort configuration on first error Default 10 ERROR WINDOW 0 mecha nism disabled no error checking Default 5 LOG FILE lt full path AR Manager log file This option can be changed on the fly Default var log armgr log LOG SIZE size in MB This option defines maximal AR Manager log file size in MB The logfile will be truncated and restarted upon reaching this limit This option cannot be changed on the fly 0 unlimited log file size Default 5 8 8 5 1 1 Per switch AR Options A user can provide per switch configuration options with the following syntax Mellanox Technologies 181 Rev 2 1 1 0 0 OpenSM Subnet Manager SWITCH lt GUID gt lt switch option 1 gt lt switch option 2 gt The following are the per switch options
68. by aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack thus reducing the number of packets that have to be processed LRO is available in kernel versions 3 1 for untagged traffic Note LRO will be done whenever possible Otherwise GRO will be done Generic Receive Offload GRO is available throughout all kernels ethtool c eth lt x gt Queries interrupt coalescing settings ethtool C eth lt x gt adaptive rx onloff Enables disables adaptive interrupt moderation By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern ethtool C eth lt x gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N Sets the values for packet rate limits and for moderation time high and low values Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value 104 Mellanox Technologies Rev 2 1 1 0 0 Table 6 ethtool Supported Options Options Description ethtool C eth lt x gt rx usecs N rx Sets the interrupt coalescing settings when the adaptive frames N moderation is disabled Note usec settings correspond to the time to wait after the last p
69. calculation e g in case of host reboot This option becomes very handy when the cluster size is thousands of nodes lid matrix file M file name This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded ide SW erile menes This option specifies the name of the LFTs file from where switch forwarding tables will be loaded sadb file S file name gt This option specifies the name of the SA DB dump file from where SA database will be loaded root guid tile a path to file gt Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one Mellanox Technologies 139 Rev 2 1 1 0 0 OpenSM Subnet Manager to a line cn guid fille u path to file gt Set the compute nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line io guid file G path to file gt Set the I O nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line port shifting Attempt to shift port routes around to remove alignment problems in routing tables Scatter ports random seed Randomize best port chosen for a route max reverse hops H hop count Set the max number of hops the wrong way around an I O node is allowed to do connectivity for I O nodes on top swithces ids guid file m path to file Name of the ma
70. dimension as wired and many switches links in the fabric will not be placed into the torus 8 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with Q Since torus 2QoS depends on such functionality for correct operation always invoke OpenSM with Q when torus 2QoS 15 in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines sup ported by OpenSM except torus 2QoS there 15 a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used for unicast QoS configuration will be honored by torus 2QoS For multicast QoS configuration only SL values 0 and 8 should be used with torus 2QoS Since SL to VL map configuration must be under the complete control of torus 2QoS any con figuration via qos sl2vl qos swe 512 1 etc must and will be ignored and a warning will be generated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise if traffic is not delivered fairly across each of these two VL ranges Torus 2QoS will detect and warn if VL arbi tration is configured unfairly across VLs in the ra
71. each 1 INTF interface INTF Interface name a Show all interface s TCs 70 Mellanox Technologies Rev 2 1 1 0 0 Get Current Configuration Mellanox Technologies 71 Rev 2 1 1 0 0 Driver Features Set ratelimit 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2 Configure QoS map UP 0 7 to tc0 1 2 3 to tc1 and 4 5 6 to tc 2 set tc0 tc1 as ets and tc2 as strict divide ets 30 for tc0 and 70 for tc1 72 Mellanox Technologies J Rev 2 1 1 0 0 pip di up 2 wag 9 tc 2 ratelimit 2 Gbps tsa strict whaa Al pgs 5 wg G 4 5 8 2 tc and tc_wrap py The tc tool is used to setup sk prio to UP mapping using the queue discipline In kernels that do not support mgprio such as 2 6 34 an alternate mapping is created in sysfs The tc_wrap py tool will use either the sysfs or the tc tool to configure the sk prio to UP mapping Usage tc_wrap py i lt interface gt options Options version show program s version number and exit h help show this help message and exit u SKPRIO UP skprio up SKPRIO UP maps sk prio to UP LIST is lt 16 comma separated UP index of element is sk prio i INTF interface INTF Interface name Example set skprio 0 2 to UPO and skprio 3 7 to UP1 on eth4 UP 0 Skprio Skprio Skprio Skprio Skprio Skprio Skprio 1 Skprio 1 tos 8 1 0 1 skprio 12 skprio 13 skprio 14 skprio 15 ijs db Skprio
72. encodes into a path SL which datelines the path crosses as follows gil s for d 0 d lt torus dimensions Cha path crosses dateline d returns 0 or 1 sl path crosses dateline d lt lt d For a 3D torus that leaves one SL bit free which torus 2QoS uses to implement two QoS levels Torus 2QoS also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows for sl 0 sl lt 16 sl cdir port reports which torus coordinate direction a switch port points in and returns 0 1 or 2 sl2vl iport oport sl 0 1 amp 51 gt gt cdir oport 158 Mellanox Technologies Rev 2 1 1 0 0 Thus on a pristine 3D torus i e in the absence of failed fabric switches torus 2QoS consumes 8 SL values SL bits 0 2 and 2 VL values VL bit 0 per QoS level to provide deadlock free rout ing on a 3D torus Torus 2QoS routes around link failure by taking the long way around any 1D ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z I I I I I I 3 D I I I I I I 2 Ir Po I I I I I I 1 m
73. fabric requires that the radix of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm is to examine the cube formed by the eight switch locations bounded by the corners x y z and x 1 y 1 z 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that is consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm 1s based on examining the topology of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2QoS will detect and report when it has insufficient configuration for a torus with radix 4 dimensions In the event the torus is significantly degraded i e there are many missing switches or links it may happen that torus 2QoS is unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condition 162 Mellanox Technologies Rev 2 1 1 0 0 occurs if torus 2QoS is misconfigured i e the radix of a torus dimension as configured does not match the radix of that torus
74. file provides the CN order that may be used to create efficient communication pattern that will match the routing tables 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn guid file OpenSM options Mellanox Technologies 155 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 4 1 Routing between non CN Nodes The use of the cn_guid file option allows non CN nodes to be located on different levels in the fat tree In such case it is not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around Spinel Spine2 Spine 3 N E oq X N N1 Switch N2 Switch N3 N Going down to compute nodes To solve this problem a list of non CN nodes can be specified by G or V io_guid fileV option These nodes will be allowed to use switches the wrong way around a specific number of times specified by H or V max reverse hopsV With the proper max reverse hops and io guid file values you can ensure full connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt gt N3 With
75. for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos type high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sent is high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded If the 255 value is used the low priority VLs may be starved ae A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4AKB MTU a single packet will require 64 credits so in order to achieve effective VL arbitration for packets of 4AKB MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet qos ca max vls 15 qos ca high limit 6 qos ca vlarb high 0 4 qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 COs Ca SIAN Sq WO Wi 127 qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 qos swe vlarb lo
76. give a PKey with the value 0x8001 Step2 Create a child interface by running hostl echo lt PKey gt gt sys class net IB subinterface create child Example hostl echo 1 gt sys class net ib0 create child This will create the interface 1b0 8001 60 Mellanox Technologies Rev 2 1 1 0 0 Step 3 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 host1 ifconfig ib0 8001 1b0 8001 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 4 can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 3 3 3 Step 5 be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 8 OpenSM Subnet Manager 4 3 4 2 Removing a Subinterface To remove a child interface subinterface run echo subinterface PKey sys class net ib interface delete child Using the example of Step 2 echo 0x8001 gt sys class net ib0 delete child Note that when deleting the interface you must
77. gt InfiniBand VPI Drivers gt Firmware Tools Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run mst start mst status The device name will be of the form mt dev id pci cro confo 2 Create and burn the composite image Run flint dev lt mst device name gt brom lt expansion ROM image gt Example on Linux flint dev dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Example on Windows flint dev mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Removing the Expansion ROM Image Remove the expansion ROM image Run flint dev mst device name drom When removing the expansion ROM image you also remove Flexboot from the boot device list A 4 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB A 4 1 Installing the DHCP Server Install DHCP client server in embedded within the Linux Distribution A 4 2 Configuring the DHCP Server A 4 2 1 For ConnectX Family Devices When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions 1 Depending on the OS the device name ma
78. ibv devinfo 1 2 HCAs found mthca0 mlx4 0 2 Query the device mlx4 0 and print user available information for its Port 2 gt ibv devinfo d mlx4 0 i 2 hca id mlx4 0 fw ver 2 5 944 node guid 0000 0000 0007 3895 Sys image guid 0000 0000 0007 3898 vendor id 0x02c9 vendor part id 25418 hw ver 0xA0 board id MT 04A0140005 phys port cnt 2 ors 2 state PORT ACTIVE 4 max mtu 2048 4 active mtu 2048 4 sm lid 1 port lid iL port lmc 0x00 9 8 ibdev2netdev ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link Synopsys ibdev2netdev v h Mellanox Technologies 197 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Options v Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE ib0 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 2 DOWN 5 101 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 1 DOWN gt GU Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 2 DOWN S
79. in the overall job runtime Implementation is simple and transparent during the job runtime FCA is disabled by default and must be configured prior to using it from the Scal ableSHMEM ad gt To enable FCA by default in the ScalableSHMEM 1 Edit the opt mellanox openshmem 2 2 etc openmpi mca params conf file 2 Set the 11 fca enable parameter to 1 Scoll fca 1 1 3 Set the scoll fca np parameter to 0 Scoll fca np 0 gt To enable FCA in the shmemrun command line add the following mca scoll fca enable 1 mca scoll fca enable np 0 To disable FCA mca scoll fca enable 0 mca coll fca enable 0 For more details on FCA installation and configuration please refer to the FCA User Manual found in the Mellanox website 5 1 3 Running ScalableSHMEM with MXM MellanoX Messaging MXM library provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hardware This includes a variety of enhancements that take advantage of Mellanox networking hardware including Multiple transport support including RC XRC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication 114 Mellanox Technologies Rev 2 1 1 0 0 These enhancements significantly i
80. is matched first against the rules in the qos match rules section and only if there was no match the query 1s matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched 8 6 6 1 IPoIB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the mul ticast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent ipoib lt SL gt ipoib pkey Ox7fff SL any pkey Ox7fff SL 8 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The follow ing two match rules are equivalent sdp NES any Service id 0x0000000000010000 0x000000000001ffff SL 8 6 6 3 RDS Similar to SDP RDS PR query is matched by Service ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds SESS any Service id 0x00000000010648CA SL 172 Mellanox Technologies Rev 2 1 1 0 0 8 6 6 4 SRP Service ID for SRP varies from storage
81. minimum Default 0 cct Sets all the CC table entries to a specified value Values lt comma separated The first entry will remain 0 whereas last value list will be set to the rest of the table Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes ccti timer Sets for all SL s the given ccti timer Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes Table 25 Congestion Control Manager CC MGR Options File Option File Desctiption Values max errors error window When number of errors exceeds max_errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to proceed Values max errors 0 zero tollerance abort configuration on first error error window 0 mechanism dis abled no error checking Default 5 cc statistics cycle Enables CC MGR to collect statistics from all nodes every cc statistics cycle seconds Default 0 When the value is set to 0 no statistics are collected 186 Mellanox Technologies Rev 2 1 1 0 0 9 InfiniBand Fabric Diagnostic Utilities 9 1 Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric 92 Utilties Usage This section first describes common configuration interface and
82. modules mlnx en Step 5 load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd en sbin Step 6 If you plan to give your Ethernet device a static IP address then copy ifconfig Otherwise skip this step hostl cp sbin ifconfig tmp initrd en sbin Step 7 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be loaded The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network interface Step9 Save the init file 230 Mellanox Technologies Rev 2 1 1 0 0 Step 10 Close initrd host1 cd tmp initrd en host1 find cpio H newc o gt tmp new initrd en img hostl gzip tmp new init en img At this stage the modified initrd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it properly A 9 iSCSI Boot Mellanox FlexBoot enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remo
83. of nodes The CC table val Values 0 48K ues are calculated based on this number Default 0 base on the CCT calculation on the current subnet size Table 23 Congestion Control Manager Switch Options File Option File Description Values threshold Indicates how aggressive the congestion mark 0 0xf ing should be 0 no packet marking Oxf very aggressive Default Oxf marking rate The mean number of packets between marking Values 0 Oxffff eligible packets with a FECN Default Oxa packet size Any packet less than this size bytes will not be Values 0 0x3fc0 marked with FECN Default 0x200 Table 24 Congestion Control Manager CA Options File Option File Desctiption Values port control Specifies the Congestion Control attribute for Values this port 0 QP based congestion control 1 SL Port based congestion con trol Default 0 Mellanox Technologies 185 Table 24 Congestion Control Manager CA Options File Rev 2 1 1 0 0 OpenSM Subnet Manager Option File Desctiption Values ca_control_map An array of sixteen bits one for each SL Each bit indicates whether or not the corresponding SL entry is to be modified Values Oxffff ccti_increase Sets the CC Table Index CCTI increase Default 1 trigger_threshold Sets the trigger threshold Default 2 ccti min Sets the CC Table Index CCTI
84. pages and disables the allocator CONTIG Forces the usage of the contiguous pages allocator If contiguous pages are not available the allocation fails MLX MR MAX LOG2 CONTIG BS Sets the maximum contiguous block size order IZE Values 12 23 Default 23 MLX MR MIN LOG2 CONTIG BS Sets the minimum contiguous block size order IZE Values 12 23 Default 12 4 10 Shared Memory Region Shared Memory Region is only applicable to the mlx4 driver Shared Memory Region MR enables sharing MR among applications by implementing the Register Shared MR verb which is part of the IB spec Sharing MR involves the following steps Step 1 Request to create a shared MR The application sends a request via the ibv_reg_mr API to create a shared MR The application supplies the allowed sharing access to that MR If the MR was created successfully a unique MR ID is returned as part of the struct ibv_mr which can be used by other applications to register with that MR Mellanox Technologies 85 J Rev 2 1 1 0 0 Driver Features The underlying physical pages must not be Least Recently Used LRU or Anonymous To disable that you need to turn on the IBV ACCESS ALLOCATE MR bit as part of the sharing bits Usage Turns on via the ibv reg mr one or more of the sharing access bits The sharing bits are part of the ibv_reg_mr man page Turns on IBV ACCESS ALLOCATE MR bit Step 2 Request to re
85. provided policy on client requests The overall flow for such requests is as follows The request is matched against the defined matching rules such that the QoS Level def inition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 4 QoS Manager InfiniBand subnet with OFED 1 3 based nodes LS There are two ways to define QoS policy Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 8 6 2 Advanced QoS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by Port GUID Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group Node type where possible node types are CA SWITCH ROUTER ALL and SEL
86. the Contiguous Pages To activate set the below environment variables with values of PREFER CONTIG or CONTIG ForQP MLX QP ALLOC TYPE ForCQ MLX CQ ALLOC TYPE The following are all the possible values that can be allocated to the buffer Table 3 Buffer Values Possible Value Description ANON Use current pages ANON small ones Default value HUGE Force huge pages CONTIG Force contiguous pages PREFER CONTIG Try contiguous fallback to ANON small pages PREFER HUGE Try huge fallback to ANON small pages 84 Mellanox Technologies Rev 2 1 1 0 0 Table 3 Buffer Values Possible Value Description ALL Try huge fallback to contiguous if failed fallback to ANON small pages 1 Values are NOT case sensitive Usage The application calls the reg mr API which turns on the IBV ACCESS ALLOCATE MR bit and sets the input address to NULL Upon success the address field of the struct ibv_mr will hold the address to the allocated memory block This block will be freed implicitly when the ibv_dereg_mr is called The following are environment variables that can be used to control error cases contiguity Table 4 Parameters Used to Control Error Cases Contiguity Parameters Description MLX MR ALLOC TYPE Configures the allocator type ALL Default Uses all possible allocator and selects most effi cient allocator ANON Enables the usage of anonymous
87. the equivalent ibsrpdm command when etc srp_daemon conf is not empty 2 srp_daemon extensions to ibsrpdm To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port lt port num gt and to generate output suitable for echo you may execute host1 srp daemon c a o i lt InfiniBand HCA name gt p port number Mellanox Technologies 51 J Rev 2 1 1 0 0 Driver Features To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or tun ls sys class infiniband To both discover the SRP Targets and establish connections with them just add the e option to the above command Executing srp daemon over a port without the a option will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a Itis recommended to use the n option This option adds the initiator_ext to the connecting string See Section 4 1 2 5 for more details srp daemon has a configuration file that can be set where the default is etc srp_daemon conf Use the f to supply a different configuration file that configures the tar gets srp_daemon is allowed to connect to The configuration file can also be used to set values for additional parameters e g max_cmd_per_lun max_sect A continuous background daemon operation providing an automatic ongoing detection and co
88. thold 0 options mlx4 core high rate steer 1 2 Apply interrupt affinity tuning 3 Forwarding on the same interface set irq affinity bynode sh numa node interface 4 Forwarding from one interface to another set irq affinity bynode sh numa node interfacel lt interface2 gt 5 Disable adaptive interrupt moderation and set status values using ethtool C adaptive rx off 136 Mellanox Technologies Rev 2 1 1 0 0 8 OpenSM Subnet Manager 8 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Management Model 13 Subnet Management 14 and Subnet Administration 15 8 2 Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches
89. to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indi cators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 8 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits config F lt file name gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists create config c lt file name gt OpenSM will dump its configuration to the specified file and exit This is a way to generate OpenSM configuration file template guid g GUID in hex This option specifies the local port GUID value with which OpenSM should bind OpenSM may be Mellanox Technologies 137 Rev 2 1 1 0 0 OpenSM Subnet Manag
90. v a n p G M s lt smlid gt C ca name P lt ca_port gt t lt timeout_ms gt lt dest dr path lid guid lt star tlid gt lt endlid gt Output Files Table 32 lists the various flags of the command Table 32 ibportstate Flags and Options Default Flag ODER If Not Description Mandatory Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a ll Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v V ersion Optional Show version info a ll Optional Show all LIDs in range including invalid entries n o dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is acomma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The param eters lt startlid gt and lt endlid gt specify the MLID range s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries C Optional Use the specified channel adapter o
91. value gt V sys class infiniband_srp srp mlx hca number port number add_target See Section 4 1 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes Itis possible to include additional parameters in the echo command max per lun Default 62 max sect short for max sectors sets the request size of a command io class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 tl retry count a number in the range 2 7 specifying the IB RC retry count Default 2 comp vector a number in the range 0 n 1 specifying the MSI X completion vector Some HCA s allo cate multiple n MSI X vectors per HCA port If the IRQ affinity masks of these interrupts have been configured such that each MSI X interrupt is handled by a different CPU then the comp vector parame ter can be used to spread the SRP completion workload over multiple CPU s cmd sg entries a number in the range 1 255 that specifies the maximum number of data buffer descrip tors stored in the SRP CMD information unit itself With allow ext sg 0 the parameter cmd sg entries defines the maximum S G list length for a single SRP_CMD and commands whose S G list length exceeds this limit after S G list collapsing will fail initiator ext Please refer to Section 9 Multiple Connecti
92. vendor to vendor thus SRP query is matched by the tar get IB port GUID The following two match rules are equivalent srp target port guid 0x1234 SL any target port guid 0x1234 SL Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 8 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traffic and that s why it is the only ULP that did not appear in the qos ulps section 8 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are Max VLs the maximum number of VLs that will be on the subnet High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 VLArb low table Low priority VL Arbitration table IBA 7 6 9 template high table High priority VL Arbitration table IBA 7 6 9 template SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and
93. 0002c9030000103b GUE L t EN 0x0002c90300001038 POrt GUL GS e A Iq SIS 0x0002c90300001039 Dat Gap 128 1 EO E ns ss 0x634a 0 000000 0 LOC ALPOV dl Mendorld n 0x0002c9 9 13 perfquery Queries InfiniBand ports performance and error counters Optionally it displays aggregated counters for all ports of a node It can also reset counters after reading them or simply reset them Synopsis perfquery h d G a 1 r C ca name P lt ca_port gt R t timeout ms V lid guid port reset maskl Output Files Table 34 lists the various flags of the command Table 34 perfquery Flags and Options a Default Flag Optional If Not Description Mandatory Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d 210 Mellanox Technologies Rev 2 1 1 0 0 Table 34 perfquery Flags and Options Optional Default Flag n dator If Not Description y Specified G uid Optional Use GUID address argument In most cases it Is the Port GUID Example 0x08f1040023 a Optional Apply query to all ports l Optional Loop ports r Optional Reset the counters after reading them C Optional Use the specified channel adapter or router lt ca_name gt P ca port
94. 01 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF end port group 168 Mellanox Technologies Rev 2 1 1 0 0 Mellanox Technologies 169 Rev 2 1 1 0 0 OpenSM Subnet Manager 170 Mellanox Technologies J Rev 2 1 1 0 0 8 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include Default match rule that is applied to PR MPR query that didn t match any of the other match rules SDP SDP application with a specific target TCP IP port range SRP with a specific target IB port GUID RDS IPoIB with a default PKey IPoIB with a specific PKey Any ULP application with a specific Service ID in the PR MPR query Any ULP application with a specific PKey in the PR MPR query Any ULP application with a specific target IB port GUID in the PR MPR query Since any section of the policy file 1s optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows qos ulps default 0 default SL end qos ulps It is equivalent to the prev
95. 0b0 0x0007 0fb 0x00004c Jump addresses OK 0x0007 0f c 0x0007 2a7 0x0001ac FW Configuration OK FW image verification succeeded Image is bootable 218 Mellanox Technologies Rev 2 1 1 0 0 9 16 asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv asyncwatch Examples 1 Display asynchronous events ibv asyncwatch mlx4 0 async event FD 4 9 17 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectXG family adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analysis The following describes a work flow for local HCA adapter sniffing Run ibdump with the desired options Run the application that you wish its traffic to be analyzed Stop ibdump CTRL C or wait for the data buffer to fill in mem mode Open Wireshark and load the generated file How to Get Wireshark Download the current release from www wireshark org for a Linux or Windows environment See the ibdump release notes txt file for more details Although ibdump is a Linux application the generated pcap file may be analyzed on either operating system Synopsis ibdump options Output Files d ib dev lt dev gt use RDMA device lt dev gt default first device found The relevant devices can be listed by running the ibv devinfo command i ib port lt port gt use p
96. 213 Rev 2 1 1 0 0 Table 35 ibcheckerrs Flags and Options InfiniBand Fabric Diagnostic Utilities Optional Default Flag Ae dat If Not Description Specified C Optional Use the specified channel adapter or router lt ca_name gt P ca port Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout_ms msec gt lt lid guid gt Mandatory Use the specified port s or node s LID GUID with G flag with G option lt port gt Mandatory Use the specified port without G flag Examples 1 Check aggregated node counter for LID 0x2 gt ibcheckerrs 2 warn counter SymbolErrors 65535 threshold 10 lid 2 port 255 warn counter LinkRecovers 255 threshold 10 lid 2 port 255 warn counter LinkDowned 12 threshold 10 lid 2 port 255 warn counter RcvErrors 565 threshold 10 lid 2 port 255 warn counter XmtDiscards 441 threshold 100 lid 2 port 255 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port all FAILED 2 Check port counters for LID 2 Port 1 gt ibcheckerrs v 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 3 Check the LID2 Port 1 using the specified threshold file gt cat threshl SymbolErrors 10 LinkRecovers 10 LinkDowned 10 RevErrors 10 RcvRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDiscards 100 XmtConstr
97. 34a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 a 2 Verify the ConnectX firmware using its ID using the results of the example above gt mstflint d 04 00 0 v ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634f 0x000f20 DDR OK 0x00006350 0x0000 29b 0x008f4c DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x0004749c 0x0005913f 0x011ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 0x0007a124 0x0007bdff 0x001cdc DDR OK 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007 0af 0x000518 Configuration OK 0x0007
98. 4 Setting Performance Tuning on page 84 Section 8 4 1 File Format on page 149 Appendix C 2 mlx4 core Parameters page 237 Section 4 1 2 2 Manually Establishing an SRP Con nection on page 43 Section 4 1 2 3 SRP Tools ibsrpdm srp_daemon and srpd Service Script on page 45 Section 4 1 2 4 Automatic Discovery and Connection to Targets on page 47 Section 4 1 2 5 Multiple Connections from Initiator InfiniBand Port to the Target on page 48 Section 4 1 2 6 High Availability HA on page 48 Section 4 1 2 7 Shutting Down SRP on page 49 Section 4 15 Ethtool on page 103 12 Mellanox Technologies Rev 2 1 1 0 0 Table 1 Document Revision History Release Date Description 2 0 3 0 0 October 2013 Removed section Command Line Interface CLI Updated the following sections Appendix E Lustre Compilation over MLNX OFED page 240 August 2013 Updated the following sections Section 1 3 4 ULPs on page 23 Section 4 12 Flow Steering on page 87 and its subsec tions Section 1 3 3 Mid layer Core on page 23 Section 4 8 Ethernet Tunneling Over IPoIB Driver eIPoIB on page 80 Section 8 2 1 opensm Syntax on page 137 Appendix C mlx4 Module Parameters page 237 Added the following sections Section 1 5 RDMA over Converged Ethernet RoCE on page 25 Section 4 5 Quality of Service Et
99. 4 ib ko tmp initrd ib lib modules ib hostl cp infiniband hw mthca ib mthca ko tmp initrd ib lib modules ib hostl cp infiniband ulp ipoib ipoib helper ko tmp initrd ib lib modules ib host1 cp infiniband ulp ipoib ib ipoib ko tmp initrd ib lib modules ib Step 5 IB requires loading an IPv6 module If you do not have it in your initrd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko tmp initrd ib lib modules Step 6 load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd ib sbin Mellanox Technologies 227 Rev 2 1 1 0 0 Step 7 Step 8 Step 9 If you plan to give your IB device a static IP address then copy if config Otherwise skip this step hosti cp sbin ifconfig tmp initrd ib sbin If you plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below host1 cp path to DHCP client v3 1 3 gt dhclient tmp initrd ib sbin host1 cp path to DHCP client v3 1 3 dhclient script tmp initrd ib sbin host1 mkdir p tmp initrd ib var state d
100. 4X Label SAGA musun 4X LinkSpeedSupported 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps lilies Dee dAG 5 0 Gbps Now change the enabled link speed gt ibportstate C mlx4 0 D 0 1 speed 2 ibportstate mlx4 0 D 0 1 speed 2 Initial PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 LinkSpeedEnabled 2 5 Gbps After PortInfo set Port info DR path slid 65535 dlid 65535 0 port 1 LinkSpeedEnabled 5 0 Gbps IBA extension Show the new configuration gt ibportstate C mlx4 0 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 INKS TAL eae ramen E UR TS Initialize DhyslinkState e unn terrae LinkUp eW ERSUPPORES cd a 1X or 4X ina 1X or 4X 4X LinkSpeedSupported 2 5 Gbps or 5 0 Gbps IumicSpeecbimalllic diet erat ME 5 0 Gbps IBA extension M mnnn 5 0 Gbps 9 11 ibroute Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop Mellanox Technologies 203 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Synopsis ibroute h d v
101. 5 Gbps or 5 0 Gbps sa 2 5 Gbps or 5 0 Gbps binkspeeqactive no n s 5 0 Gbps 2 Query the status of two channel adapters using directed paths gt ibportstate C mlx4 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 TINK Lat Chara MEN En ERST a Initialize physpimksrapesm oco diva sedeo eb me LinkUp SHUI NOME 1X or 4X pal WERE TADS Cle an 1X or 4X 6600000 4 LinkSpeedSupported 2 5 Gbps or 5 0 Gbps mnkopecdEna ble d 5000005000008 2 5 Gbps or 5 0 Gbps S nsn 5 0 Gbps gt ibportstate C mthca0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 MATS CAS Down ewsilIKeseS e s TET Polling NKA ERS UPP om eq 1X or 4X inline 1X or 4X ak Waq ENAS ooo nas 4X LinkSpeedSupported 2 5 Gbps uma 2 5 Gbps 202 Mellanox Technologies Rev 2 1 1 0 0 SITS PES AC m 2 5 Cbps 3 Change the speed of a port First query for current configuration gt ibportstate C mlx4 0 D 01 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 Initialize PhysbinksState ae neta LinkUp UPS OIORAEES aaa 1X or 4X 1X or
102. ADME txt for quick start instructions Preparing HHH i HHH HH H HHH HH HHH i dapl HH i HH i HH HH HH HH TATATA HH i Preparing HHH HHH HHH HH HHH i HHH HH HHH i i dapl HH i HH i HH HH i HH He 41 4E 4L HH i Preparing HHH i HHH i HH i H i HHH HH HHH HHH i dapl deve HH i HH i HH HHH i HH TETTE HHH i i Preparing HHH i HHH HHH HH HHH HHH HH HHH i dapl deve HHH i HH i HH HH i HH E 41 4E 4L i i Preparing HHH i i HH HHH HHH HH HHH HHH i i dapl deve HH i HH i HH HH Lu HH HH 4E 4L i HH i Preparing HHH HHH HHH HH HHH i HHH HH HHH i dapl deve era pm HH i HH i HH HH i HH He 41 4E 4L HHH i Preparing Y HH HH i HH HH i HH Hd dy HH HHH i i dapl utils HH i HH i HH HH i HH HHH i i Preparing HHH HHH HHH HHH HH HHH i HHH HH HHH HHH i perftest HH i HH i HH HH i HH E 41 4E UL i i Preparing HHH i HHH i HH i i HHH HH HHH HHH i mstflint HH i HH i HH HH i HH TETTE HH i i Preparing HHH HHH HH HHH HHH HH HHH i i mft HH i HH i HH HH i HH E 41 4E 4L i i Preparing H
103. CBB ratio Similar to UPDN Fat Tree routing is constrained to rank ing rules 4 LASH Routing Algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distrib uting the paths between layers LASH is an alternative deadlock free topology agnostic routing algo rithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node 5 DOR Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh 6 Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2QoS provides deadlock free routing while supporting two quality of service QoS levels Additionally it can route around multiple failed fabric links or a single failed fabric switch without introducing deadlocks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled by A or ucast_cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTR
104. CSI Extensions for RDMA iSER extends the iSCSI protocol to RDMA It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies 4 2 2 iSER Initiator The iSER initiator is controlled through the iSCSI interface available from the iscsi initiator utils package Make sure iSCSI is enabled and properly configured on your system before proceeding with iSER Targets settings such as timeouts and retries are set the same as any other iSCSI targets If targets are set to auto connect on boot and targets are unreachable it may take a long time to continue the boot process if timeouts and max retries are set too high P Example for discovering and connecting targets over iSER iscsiadm m discovery o new o old t st I iser p lt ip port gt 1 iSER also supports RoCE without any additional configuration required To bond the RoCE interfaces set the ail over mac option in the bonding driver Mellanox Technologies 55 J Rev 2 1 1 0 0 Driver Features 4 3 4 3 1 4 3 2 IP over InfiniBand Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service The IPoIB driver ib_ipoib exploits the following capabilities e VLAN simulation over an InfiniBand network via child interfaces High Availability via Bonding Varies MTU values up to 4k in Da
105. ConnectX Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware 0 00 02 c9 03 00 0c 78 11 on PCIOZ2 00 0 open Link down TX 0 TXE 0 RX 0 RXE 0 Link status The socket is not connected Waiting for link up on netO ok After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from For ConnectX InfiniBand Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ 00 0 open ILink doun TX O TXE O RX O RXE 01 Link status The socket is not connected Waiting for link up on netO ok DHCP netO 02 02 c9 0c 78 112 ok neto 11 3 12 2 7255 255 255 9 Next server 11 3 12 121 Filename pxeilinux O Root path vtftpbootv t tp 11 3 12 1217pxellinux 9 Next FlexBoot attempts to boot as directed by the DHCP server Mellanox Technologies 225 Rev 2 1 1 0 0 A 8 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the initrd image must include a device driver module and be configured to load that driver This can be achieved by adding the device driver module into the initrd image and loading it The initrd image of some Linux distributions such as SuSE Linux Enterprise Server and Red Hat Enterprise Linux cannot be edited prior or durin
106. Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File 6 Failed to load required Package ibv devices Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv devices Examples 1 List the names of all available InfiniBand devices ibv devices device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 ibv devinfo Queries InfiniBand devices and prints about them information that is available for use from user space Synopsis ibv devinfo d device i port 1 v Output Files Table 29 lists the various flags of the command Table 29 ibv devinfo Flags and Options Optional Jew Flag m dator If Not Description y Specified d device Optional First found Run the command for the provided IB ib dev lt device gt device device device i lt port gt Optional All device Query the specified device port lt port gt ib port lt port gt ports 196 Mellanox Technologies Rev 2 1 1 0 0 Table 29 ibv_devinfo Flags and Options Optional Flag ae dator If Not Description y Specified l Optional Inactive Only list the names of InfiniBand list devices y Optional Inactive Print all available information about the verbose InfiniBand device s Examples 1 List the names of all available InfiniBand devices gt
107. F SM s port 166 Mellanox Technologies Rev 2 1 1 0 0 ID QoS Setup denoted by qos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts QoS Levels denoted by gos levels Each QoS Level defines Service Level SL and a few optional fields e MTU limit Rate limit PKey e Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by Source port group whether a source port is a member of a specified group Destination port group same as above only for destination port
108. Find the event plugin options option in the SM options file and add the following conf file cc mgr options file name Options string that would be passed to the plugin s event plugin options ccmgr conf file lt cc mgr options file name gt 2 Run the SM with the new options file opensm F options file name Mellanox Technologies 183 Rev 2 1 1 0 0 OpenSM Subnet Manager To turn CC OFF set enable to FALSE in the Congestion Control Manager configura tion file and run OpenSM ones with this configuration For the full list of CC Manager options with all the default values See Configuring Congestion Control Manager on page 183 For further details on the list of CC Manager options please refer to the IB spec 8 9 4 Configuring Congestion Control Manager Main Settings To fine tune CC mechanism and CC Manager behavior and set the CC manager main settings perform the following To enables disables Congestion Control mechanism on the fabric nodes set the follow ing parameter enable The values are lt TRUE FALSE gt The default is True CC manager configures CC mechanism behavior based on the fabric size The larger the fabric is the more aggressive CC mechanism is in its response to congestion To manu ally modify CC manager behavior by providing it with an arbitrary fabric size set the following parameter num hosts e The values are 0 48K The default is o b
109. H HH HH HH HH Ht dy HH HHH i srptool S HH i HH i HH HH Lu HH HH TATATA i HH i Preparing HHH HHH HHH HH HHH i HHH HH HHH i rds tools HH i HH i HH HH i HH Hd 4E 4L HH i Preparing HHH i HH i HH HH HH E 41 4 4L HH i rds devel HHH i HH HHH i HH Hed H HHH i i Preparing Ws i HH HH HH HH i HH METANET i i i ibutils2 iHi i HH i HH HH HH E 41 4E 4L i HH i i Preparing HH HH He dt dh i HH HH HH HH HHH HH HH HH HH HH ibutils HH B4 dE HH HH HH HH i HH He dy HH HH HH HH Preparing 2 HH HH HH HH HH HH T HH HH Ht db HH HH HH HH cc_mgr HH i HH i HH HH i E 4 4E UL i HH Preparing Pre HH HH HH HH HH HH HH He 4 4E UL HH HH HH dump pr 1HBHEHHHEHHHHEHSHEHHSHEHHHHHHEHESHHEHHHEHEHHUHEHHH EG 36 Mellanox Technologies Rev 2 1 1 0 0
110. H H HH HH H HH H Het HH HH H HH H HH HH libmlx4 HH HH HU HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH Preparing Pe HH HH HU HHH HH HH HH T HH H HHH HH HH H HH H HH HH libmlx4 devel HH H HH HHHH HH HH H HH HH H HH H HH HHHH HH H HH H HH HH Preparing ai THHHBHHHBHHHBHBHHHHHHHHHHHHBHHHHBHHHHHHHHHBHHRHRHHHI libmlx4 devel HH H HH HHHH HHH HH H HH HH H H i H HH i H HH H HH HH Preparing HH HH B4 EE HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH libmlx5 HH H HH HHHH HH Lu HH HH HH HH H HH HHHH HH H HH H HH HH Preparing vs HH HH HU HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH libmlx5 HH H HH HHHH HH 4 HH H HH HH i HH E 4 4E 4L HH HH Lu HH HH Prepari ng Pre HH HH HH HH HH HH HH HH HHHH HH H HH H HH HH libmlx5 devel HH H HH HHHH HH HH H HH HH H HH H HH HHHH HH H HH H HH HH Preparing Ans THHHHHHHBHHHBHBHHHHHHHHBHHHHHHHHBHBHHHHHHHHHHRHRHHI libmlx5 devel HH H HH HHHH HH H HH H HH H i H i HHHH H i HH HH Preparing HH HH HH HH HH HH H HH H HHH HH HH H HH H HH HH libexgb3 HH H HH HHHH HH HH HH HH H HH HH E 4 4E UL HH H HH H HH HH Preparing 6 THHHBHHHHHHHHHHBHHHHHHHHBHHHHHHHBHHBHHHHHHHBHBHHHNE libexgb3 HHH HH HHH HH T HH H HH HHHH HH H HHH i Preparing Lo HH HH HH HH H HH HH H HH H HH HHHH i H HH H HH i libexgb3 devel HHH HH HH HH HH HH HH HH E 4 4E 4L HHH HHH HH HH Preparing
111. HH HH mlnxofed docs HH HH HU Het HH HH HH HH HH HH HHH HH HH HH Preparing HHH HHH HH HHH Het HHH HH Het HHH HH HH mpitests mvapich2 1 9 HH HH HU HH HH HH HH T HH HH He dt dy H HH HH HH HH Preparing EN HH HH B4 EU HH HH HH HH HH HH uwy HH HH HH HH mpitests openmpi 1 6 5 HH HH B4 EE HH HH HH HHH HH HH HH HH HH HH Preparing HH HH B4 EE HH HH HH i HH TETTE HH HH HH HH mpitests openmpi 1 7 4 HH HH HU HHH HH HH HH HH HH He dt HH u HH HH HH Mellanox Technologies 37 J Rev 2 1 1 0 0 Installation Device 06 00 0 06 00 0 Network controller Mellanox Technologies MT27500 Family ConnectX 3 Link Width 8x PCI Link Speed 5Gb s Installation finished successfully Attempting to perform Firmware update Querying Mellanox devices firmware Device 1 Device 0000 06 00 0 Part Number MCX354A FCB 1 Description ConnectX 3 VPI adapter card dual port QSFP FDR 56Gb s and 40GigE PCIe3 0 x8 8GT s RoHS R6 PSID MT 1090110019 Versions Current Available FW 2 30 7384 2 30 8000 PXE 3 4 0146 3 4 0146 Status Upd
112. I benchmark tests OSU BW LAT Intel MPI Benchmark Presta OpenSM InfiniBand Subnet Manager Utilities Diagnostic tools Performance tests Firmware tools MFT Source code for all the OFED software modules for use under the conditions men tioned in the modules LICENSE files Documentation 1 2 5 Firmware The ISO image includes the following firmware items Firmware images mlx format for ConnectX 3 ConnectX 3 Pro Connect IB net work adapters Firmware configuration INI files for Mellanox standard network adapter cards and custom cards FlexBoot for ConnectX 3 HCA devices 1 2 4 Directory Structure The ISO image of MLNX OFED LINUX contains the following files and directories mlnxofedinstall This is the MLNX OFED LINUX installation script ofed uninstall sh This is the MLNX OFED LINUX un installation script e RPMS folders Directory of binary RPMs for a specific CPU architecture firmware Directory of the Mellanox IB HCA firmware images including Boot over IB src Directory of the OFED source tarball mlnx add kernel support sh Script required to rebuild MLNX OFED LINUX for customized kernel version on supported Linux Distribution docs Directory of Mellanox OFED related documentation 20 Mellanox Technologies Rev 2 1 1 0 0 1 3 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protocols ULPs interface wit
113. Mellanox TECHNOLOGIES Mellanox OFED for Linux User Manual Rev 2 1 1 0 0 www mellanox com Rev 2 1 1 0 0 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Mellanox TECHNOLOGIES
114. PCI device gt dc gt lt ini device file gt Step 4 Edit the ini file that you found in the previous step and add the following lines to the HCA section in order to support 63 VFs SRIOV enable total vfs 631 num pfs 1 sriov_en true 1 Some servers might have issues accepting 63 Virtual Functions or more In such case please set the number of total_vfs to any required value Step 5 Create a binary image using the modified ini file Run mlxburn fw fw name gt mlx conf modified ini file wrimage file name gt bin The file file name gt bin is a firmware binary file with SR IOV enabled that has 63 VFs It can be spread across all machines and can be burnt using mstflint which is part of the bundle using the following command mstflint dev PCI device image file name bin b After burning the firmware the machine must be rebooted If the driver is only restarted the machine may hang and a reboot using power OFF ON might be required 4 13 7 Configuring Pkeys and GUIDs under SR IOV 4 13 7 1 Port Type Management Port Type management is static when enabling SR IOV the connectx_port_config script will not work The port type is set on the Host via a module parameter port_type_array in mlx4_core This parameter may be used to set the port type uniformly for all installed Con nectX HCAs or it may specify an individual configuration for each HCA This parameter should be specified as
115. PKeyTable Default is not to allow both pkeys O This option enables QoS setup qos policy file Y lt QoS policy file gt This option defines the optional QoS policy file The default name is etc opensm qos policy conf congestion control EXPERIMENTAL This option enables congestion control configuration cc key key EXPERIMENTAL This option configures the CCkey to use when configuring congestion control OM Tatal This option will cause SM not to exit on fatal initialization issues if SM discovers duplicated guids or 12x link with lane reversal badly configured By default the SM will exit on these errors daemon B Run in daemon mode OpenSM will run in the background inactive I Start SM in inactive rather than normal init SM state perfmgr Start with PerfMgr enabled perfmgr sweep time sS sec PerfMgr sweep interval in seconds prefix routes file path to file This option specifies the prefix routes file Prefix routes control how the SA responds to path record queries for off subnet DGIDs Default file is etc opensm prefix routes conf Mellanox Technologies 143 Rev 2 1 1 0 0 OpenSM Subnet Manager consolidate ipv6_snm req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P Key consolidate ipv4 mask Use mask for IPv4 multicast groups multiplexing per MGID scope and P Key pid file lt path to file gt Sp
116. Port Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF Counter Description vport lt i gt rx unicast packet 5 Unicast packets received successfully vport lt i gt _rx_unicast_bytes Unicast packet bytes received successfully vport lt i gt _rx_multicast_pack ets Multicast packets received successfully vport lt i gt _rx_multicast_byte 5 Multicast packet bytes received successfully vport lt i gt _rx_broadcast_pac kets Broadcast packets received successfully vport lt i gt _rx_broadcast_byte 5 Broadcast packet bytes received successfully vport lt i gt _rx_dropped Received packets discarded due to out of buffer condition vport lt i gt _rx_errors Received packets discarded due to receive error condition vport lt i gt _tx_unicast_packet 5 Unicast packets sent successfully vport lt i gt _tx_unicast_bytes Unicast packet bytes sent successfully vport lt i gt _tx_multicast_pack ets Multicast packets sent successfully vport lt i gt _tx_multicast_byte 5 Multicast packet bytes sent successfully vport lt i gt _tx_broadcast_pac kets Broadcast packets sent successfully vport lt i gt _tx_broadcast_byte 5 Broadcast packet bytes sent successfully vport lt i gt _tx_errors Packets dropped due to transmit errors 110 Mellanox Technologies Rev 2 1 1 0 0 Table 12 SW Statistics
117. R No constraints on output port selection Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric switches It scans all the fabric switches deduces which switches support Adaptive Routing and configures the AR functionality on these switches Currently Adaptive Routing Manager supports only link aggregation algorithm Adaptive Rout ing Manager configures AR mechanism to allow switches to select output port out of all the ports that are linked to the same remote switch This algorithm suits any topology with several links between switches Especially it suits 3D torus mesh where there are several link in each direc tion of the X Y Z axis If some switches do not support AR they will slow down the AR Manager as it may get timeouts on the AR related queries to these switches 178 Mellanox Technologies Rev 2 1 1 0 0 8 8 2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug in i e it is a shared library libarmgr so that is dynamically loaded by the Subnet Manager Adaptive Routing Manager is installed as a part of Mellanox OFED installation 8 8 3 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing AR Manager can be enabled disabled through SM options file 8 8 3 4 Enabling Adaptive Routi
118. Rail Support Multi Rail support enables the user to use more than one of the active ports on the card by mak ing a better use of the resources It provides a combined throughput among the used ports gt To configure dual rail support Specify the list of ports you would like to use to enable multi rail support x PORTS cardName portNum Or x MXM IB PORTS cardName portNum 5 3 5 Configuring MXM over the Ethernet Fabric To configure MXM over the Ethernet fabric Step 1 Make sure the Ethernet port is active ibv devinfo ibv_devinfo displays the list of cards and ports in the system Please make sure in the ibv devinfo output that the desired port has Ethernet at the 1ink layer field and that its state 15 PORT_ACTIVE Step 2 Specify the ports you would like to use if there is a non Ethernet active port in the card x MXM RDMA PORTS mlx4 0 1 or x MXM IB PORTS mlx4 0 1 5 4 Fabric Collective Accelerator The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI process to the server CPUs As a system wide solution FCA does not require any additional hardware The FCA manager creates a topol ogy based collective tree and orchestrates an efficient collective operation using the CPUs in the servers that are part of the collective operation FCA accelerates MPI collective operation perfor mance by up to 100 tim
119. Results iiec m an eae Rade spe nae hea er 40 2 3 5 Post installation Notes uu y dne io bod ve sr Sani auiii kawsak N 41 2 3 6 Installation Logging 41 2 4 Updating Firmware After Installation 4 2 5 Installing MLNX OFED using YUM 43 2 5 1 Setting up MLNX OFED YUM Repository 43 2 5 2 Installing MLNX OFED using the YUM 1 44 2 5 3 Updating Firmware After 44 2 6 Uninstalling Mellanox OFED 44 2 7 Uninstalling Mellanox OFED using the YUM Tool 44 Chapter 3 Configuration Files 45 3 1 Persistent Naming for Network Interfaces 45 Chapter 4 Driver Features 46 Mellanox Technologies 3 Rev 2 1 1 0 0 44 SCSERDMA ProtOoGcOl ssid a usu u dee Ge pun ME ROI ok as 46 OVERVIEWS soos eke shin aiba sasi eset ah agrio ald 46 4 1 2 SRP Initiator aa u reete beni ahs PE eas 46 4 2 iSCSI Extensions for ROMA iSER 55 ADA OVERVIEW ede e ee Ailes Wl ettet kas 55 42 2 ISER Initiator 7 shes cese ouem E an ce 55 4 3 Pover InfiniBand 0 0 cece waqsa 56
120. SEU kt eee lasa 197 9 92 1bStatuSs csse eer ao RORIS PE ed ay yaa aa EGER 198 9 10 abportstate ese Grits we Qupa pU Ope a ae EDIDI AT a en 200 DSL case Sen teu EN ede Eee ud 203 9 12 Smpquety ren Ub tes D tse ste Sete eue E ae 207 0 13 perfquery vee i ves o E vorn S e CVV ENS EA ORDRE RN DE Mer d 210 9 14 3bch ckernts Re Set EC TRIES 213 Ost S ast FEnte s o aa L ite Aca aute cee clans 215 9 16 ibv_asyncwatch occ eee ee ace gd Re laa ee 219 OAT ibd mp verme REESE OE Tay aati iced Sauk Wade 219 Appendix A Mellanox FlexBoot 221 COVOLVIGW Tu hi ogee ERN CAUSE M ead pt ede oaa diea 221 AJ 221 Burning the Expansion ROM Image 221 A 4 Preparing the DHCP Server in Linux Environment 222 A 5 Subnet Manager OpenSM 224 Mellanox Technologies 7 Rev 2 1 1 0 0 A 6 BIOS Configuration emergano e 224 AT Operation EE Susu S Mee bak ad Ve Pere ies 224 A 8 Diskless Machines 226 ALD s usos ee He ERR E e ald e RE a 231 Appendix SRP Target 233 B 1 Prerequisites and Installation 233
121. See under docs folder of installed package Mellanox Technologies 17 J Rev 2 1 1 0 0 Support and Updates Webpage Please visit http www mellanox com gt Products gt InfiniBand VPI Drivers gt Linux SW Drivers for downloads FAQ troubleshooting future updates to this manual etc 18 Mellanox Technologies Rev 2 1 1 0 0 1 Mellanox OFED Overview 1 1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack which operates across all Mellanox network adapter solutions supporting 10 20 40 and 56 Gb s InfiniBand IB 10 40 and 56 Gb s Ethernet and 2 5 or 5 0 GT s PCI Express 2 0 and 8 GT s PCI Express 3 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are supported with major operating system distributions Mellanox OFED is certified with the following products Mellanox Messaging Accelerator VMA software Socket acceleration library that performs OS bypass for standard socket based applications Mellanox Unified Fabric Manager UFM software Powerful platform for managing demanding scale out computing fabric environments built on top of the OpenSM industry standard routing engine Fabric Collective Accelerator FCA FCA is a Mellanox MPl integrated software package that utilizes CORE Direct technology for implementing the MPI collectives communications
122. System Requirements ss cece cece tenets 90 4 13 2 Setting Up SR IOM espe wes ewes er an ER ee ae ee ret 90 4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup 94 4 13 4 Assigning a Virtual Function to a Virtual Machine 95 4 13 5 Uninstalling SR IOV Driver 96 4 13 6 Burning Firmware with SR IOV 96 4 13 7 Configuring Pkeys GUIDs under SR IOV 97 414 CORE DIitect vene Sede nee Ue 103 4 14 1 CORE Direct OVerview a4 XR dee agate ge oes 103 ATS Ethtool sisi deo er e teta dag tula OR oR AL aa ose is 103 4 16 Dynamically Connected Transport 105 AAT PeerDirectz ioa oh uA C ARCA ERES EE AP Qu LN PT EE 105 4 18 Inhne Recelve tede ote eddie Gah e oet qe s 106 4 18 1 Querying Inline Receive Capability 106 4 18 2 Activating 106 4 19 Ethernet Performance Counters 107 4 20 Memory Window a l ayasa uyashpa ysi nen ene usuka 111 4 20 1 Query Capabilities cco seca go e m e e ER ERE ea 112 4 20 2 Allocating Memory Window 112 4 20 3 Binding Memory Windows 112 4 20 4 Invalidating Memory
123. TC is always given the first chance to transmit Only if the highest strict priority TC has nothing more to transmit will the next highest TC be considered Non strict priority TCs will be considered last to transmit This property is extremely useful for low latency low bandwidth traffic Traffic that needs to get immediate service when it exists but is not of high volume to starve other transmitters in the sys tem 4 5 7 2 Minimal Bandwidth Guarantee ETS After servicing the strict priority TCs the amount of bandwidth BW left on the wire may be split among other TCs according to a minimal guarantee policy If for instance TCO is set to 80 guarantee and to 20 the TCs sum must be 100 then the BW left after servicing all strict priority TCs will be split according to this ratio Since this is a minimal guarantee there is no maximum enforcement This means in the same example that if TC1 did not use its share of 20 the reminder will be used by TCO 4 5 7 3 Rate Limit Rate limit defines a maximum bandwidth allowed for a TC Please note that 1096 deviation from the requested values is considered acceptable 4 5 8 Quality of Service Tools 4 5 8 1 mlnx qos mlnx qos is a centralized tool used to configure QoS features of the local host It communicates directly with the driver thus does not require setting up a DCBX daemon on the system The minx qos tool enables the administrator of the system to Inspect t
124. TIMESTAMPING RX HARDWARE return the original unmodified time stamp as generated by the hardware SOF TIMESTAMPING RX SOFTWARE if SOF TIMESTAMPING RX HARDWARE is off or fails then do it in software SOF_TIMESTAMPING RAW HARDWARE return original raw hardware time stamp SOF TIMESTAMPING SYS HARDWARE return hardware time stamp transformed to the system time base SOF TIMESTAMPING SOFTWARE return system time stamp generated in software SOF TIMESTAMPING TX RX determine how time stamps are generated SOF TIMESTAMPING RAW SYS determine how they are reported 74 Mellanox Technologies Rev 2 1 1 0 0 gt To enable time stamping for a net device Admin privileged user can enable disable time stamping through calling ioctl sock SIOCSHWT STAMP amp ifreq with following values Send side time sampling Enabled by ifreq hwtstamp config tx type when possible values for hwtstamp config tx type enum hwtstamp tx types No outgoing packet will need hardware time stamping should a packet arrive which asks for it no hardware time stamping will be done AH HWISTAMP TX OFF Enables hardware time stamping for outgoing packets the sender of the packet decides which are to be time stamped by setting SOF TIMESTAMPING TX SOFTWARE before sending the packet d HWTSTAMP TX ON Enables time stamping for outgoing packets just as HWTSTAMP TX ON does but also enables time stamp insert
125. WN going port groups unless they are leaf switches Switches of the same rank should have the same number of ports in each UP going port group Switches of the same rank should have the same number of ports in each DOWN going port group the CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and it should only comply with the following rules Tree rank should be between two and eight inclusively Allthe Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algo rithm allows leaf switches to have any number of CAs the closer the tree 1s to be fully popu lated the more effective the shift communication pattern will be In general even if the root list is provided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering
126. You can use subscrip tion manager to register Setting up Group Process Resolving Dependencies Running transaction check Package ar mgr x86 64 0 1 0 0 11 g22fff4a will be installed rds devel x86 64 0 2 0 6mlnx 1 rds tools x86 64 0 2 0 6mlnx 1 srptools x86 64 0 0 0 4mlnx3 OFED 2 0 2 6 7 11 ge863cb7 Complete 2 5 3 Updating Firmware After Installation Installing MLNX OFED using the YUM tool does not automatically update the firmware To update the firmware to the version included in MLNX OFED package you can either Run the minxofedinstall script with the fw update only flag or Update the firmware to the latest version available on Mellanox Technologies Web site as described in section Section 2 4 Updating Firmware After Installation on page 41 2 6 Uninstalling Mellanox OFED Use the script usr sbin ofed_uninstall sh to uninstall the Mellanox OFED package The script is part of the ofed scripts RPM 2 7 Uninstalling Mellanox OFED using the YUM Tool If MLNX_OFED was installed using the yum tool then it can be uninstalled as follow yum groupremove lt group name 1 The group name gt must be the same group name that was previously used to install MLNX OFED 44 Mellanox Technologies Rev 2 1 1 0 0 3 Configuration Files For the complete list of configuration files please refer to MLNX OFED configuration files txt at the following location docs readme_and_user_manual MLNX
127. _OFED configuration files txt 3 1 Persistent Naming for Network Interfaces To avoid network interface renaming after boot or driver restart use the etc udev rules d 70 persistent net rules file Example for Ethernet interfaces PCI device 0x15b3 0x1003 mlx4 core SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 fa c3 50 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth1 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 fa c3 51 14 0 0 ATTR type 1 KERNEL eth NAME eth2 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 e9 56 al ATTRidev 10 0 0 ATTR type 1 KERNEL eth NAME eth3 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 e9 56 a2 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth4 Example for IPoIB interfaces SUBSYSTEM net ACTION add DRIVERS ATTR dev_id 0x0 ATTR type 32 NAME ib0 SUBSYSTEM net ACTION add DRIVERS ATTR dev_id 0x1 ATTR type 32 NAME ib1 Mellanox Technologies 45 J Rev 2 1 1 0 0 Driver Features 4 Driver Features 4 1 SCSI RDMA Protocol 4 1 1 Overview As described in Section 1 3 4 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol off load and RDMA f
128. a max reverse hops value of 2 N1 N2 and N3 will all have routes between them Using max reverse hops creates routes that use the switch in a counter stream way This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or similar Also having routes the other way around can cause credit loops 8 5 4 2 Activation through OpenSM Use R ftree option to activate the fat tree algorithm LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm is invoked instead 8 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path rout ing algorithm that enables topology agnostic deadlock free routing within communication net works When computing the routing function LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock 156 Mellanox Technologies Rev 2 1 1 0 0 from HCA between and switch does not need virtual layers as deadlock will not arise gt LASH analyzes routes and ensures deadlock freedom between switch pairs The link between switch and HCA In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination
129. a maximum of 4 outstanding SMPs rereg on guid migr This option if enabled forces OpenSM to send port info with client reregister bit set to all nodes in the fabric when alias Guid migrates from one physical port to another aguid inout notice This option enables sending GID IN OUT notices on Alias GUIDs register delete request to registered clients sm assign guid func unig count base port Specifies the algorithm that SM will use when it comes to choose SM assigned alias GUIDs The default is uniq count console q off local This option activates the OpenSM console default off ignore guids i equalize ignore guids file This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm hop weights file w path to file This option provides the means to define a weighting factor per port for customizing the least weight hops for the routing Mellanox Technologies 141 Rev 2 1 1 0 0 OpenSM Subnet Manager port search ordering file Q path to file gt This option provides the means to define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR Moreover this option provides the means to define non default routing port order dimn ports file O path to file gt DEPRECATED This option provides the means to define a mapping between ports and dimension Order for controlling Dime
130. accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver e Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg 100 The only meaningful bonding policy in IPoIB is High Availability bonding mode num ber 1 or active backup Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only supported value is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions In the bonding master configuration file e g ifcfg bond0 in addition to Linux bond ing semantics use the following parameter MTU 65520 65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 4 3 2 IPoIB Mode Setting on page 56 and are configured with the same value For IPoIB slaves that work in datagram mode use MTU 2044 If you do not set the correct MTU do not set MTU at all performance of the interface might decrease In the bonding slave configuration file e g ifcfg ib0 use the same Linux Network Scripts semantics In particular DEVICE ib0 Inthe bonding slave configuration file e g ifcfg ib0 8003 the line TYPE InfiniBand is necessary when using bonding over devices configured with partitions p key For RHEL users In etc modprobe b bond conf add the following li
131. acket is sent received before triggering an inter rupt ethtool a eth lt x gt Queries the pause frame settings ethtool A eth lt x gt rx on off tx Sets the pause frame settings onloff ethtool g eth lt x gt Queries the ring size values ethtool G eth lt x gt rx lt N gt tx Modifies the rings size lt N gt ethtool S eth lt x gt Obtains additional device statistics ethtool t eth lt x gt Performs a self diagnostics test ethtool s eth lt x gt msglvl N Changes the current driver message level ethtool T eth lt x gt Shows time stamping capabilities ethtool 1 eth lt x gt Shows the number of channels ethtool L eth lt x gt rx lt N gt tx Sets the number of channels lt N gt 4 16 Dynamically Connected Transport Service Dynamically Connected transport DCT is currently at beta level Please be aware that the content below is subject to change Dynamically Connected transport DCT service is an extension to transport services to enable a higher degree of scalability while maintaining high performance for sparse traffic Utilization of DCT reduces the total number of QPs required system wide by having Reliable type QPs dynam ically connect and disconnect from any remote node DCT connections only stay connected while they are active This results in smaller memory footprint less overhead to set connections and higher on chip cache utilization and hence increased performance DCT is s
132. ad and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Mellanox Technologies 121 Rev 2 1 1 0 0 HPC Features 5 5 1 Installing ScalableUPC Mellanox ScalableUPC is installed as part of MLNX OFED package Mellanox OFED 1 8 5 includes ScalableUPC Rev 2 2 which is installed under opt mellanox bupc re If you have installed OFED 1 8 5 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Please note the binary distribution of ScalableUPC is compiled with the following defaults FCA support FCA is disabled at runtime by default and must be configured prior to using it from the ScalableUPC For further information please refer to FCA User Man ual MXM support enabled by default 5 5 2 FCA Runtime Parameters The following parameters can be passed to upcrun in order to change FCA support behavior Table 15 Runtime Parameters Parameter Description fca enable 0 1 Disables Enables FCA support at runtime default disable fca np value Enables FCA support for collective operations if the number of processes in the job is greater than the ca np value default 64 fca verbose level Sets verbosity level for the FCA modules fca ops op list op list comma separated list of collective op
133. addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation synopsis and options descriptions error codes and examples 9 2 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 1 On the command line specify the file name using the option t topology file name gt 2 Define the environment variable IBDIAG TOPO FILE To specify the local system name to an diagnostic tool use one of the following two options 1 On the command line specify the system name using the option s 1ocal system name gt 2 Define the environment variable IBDIAG SYS NAME 9 2 2 InfiniBand Interface Definition The diagnosti
134. against the peer port Synopsis ibportstate d e v V D G s smlid V ca port t timeout ms portnum op lt value gt Output Files Table 31 lists the various flags of the command Table 31 ibportstate Flags and Options C lt ca_name gt P lt dest dr path lid guid Default Flag E M If Not Description Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr show Optional Show send and receive errors time outs and others 200 Mellanox Technologies Rev 2 1 1 0 0 Table 31 ibportstate Flags and Options Continued Optional Default Flag MM If Not Description Specified v erbose Optional Increase verbosity level May be used several times for additional ver bosity vvv or v v v V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C ca name Optional Use the specified channel adapter or
135. age Run the following com mand and compare the result to the value provided on the download page host1 md5sum MLNX OFED LINUX ver OS label gt iso Mellanox Technologies 27 J Rev 2 1 1 0 0 Installation 23 Installing Mellanox OFED The installation script mlnxofedinstal1 performs the following Discovers the currently installed kernel Uninstalls any software stacks that are part of the standard operating system distribution or another vendor s commercial stack Installs the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identifies the currently installed InfiniBand and Ethernet network adapters and automat ically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages Pre existing configuration files will be saved with the extension conf rpmsave Ifyou need to install Mellanox OFED on an entire homogeneous cluster a common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh Tf your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the m1nx add kernel support sh script located un
136. ages Fabric discovery Duplicated GUIDs detection Links in INIT state and unresponsive links detection Counters fetch Error counters check Routing checks Link width and speed checks Alias GUIDs check Subnet Manager check Partition keys check Nodes information Return Codes 0 Success 1 Failure with description 9 4 ibdiagnet of ibutils IB Net Diagnostic This version of ibdiagnet is included in the ibutils package and it is not run by default after installing Mellanox OFED To use this ibdiagnet version and not that of the ibu Ad tils package you need to specify the full path opt bin ibdiagnet Ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis ibdiagnet c count v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pc P lt lt PM gt lt Value gt gt lw 1x 4x 12x 1s lt 2 5 5 10 gt skip lt ibdiag check s gt load_db lt db file gt Mellanox Technologies 191 Rev 2 1 1 0 0 Options lt count gt lt topo file gt lt sys name gt lt dev index gt port num o out dir lw 1x 4x 12x ls lt 2 5 5 10 gt pm pc P lt PM lt Tr
137. aintErrors 100 RcvConstraintErrors 100 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 214 Mellanox Technologies Rev 2 1 1 0 0 gt ibcheckerrs v T threshl 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 9 15 mstflint Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access If you purchased a standard Mellanox Technologies network adapter card please down load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Output Files Table 36 lists the various switches of the utility and Table 37 lists its commands Table 36 mstflint Switches Sheet 1 of 3 Affected Switch Relevant Description Commands h Print the help menu hh Print an extended help menu d evice All Specify the device to which the Flash is connected lt device gt guid burn sg GUID base value 4 GUIDs are automatically assigned to the lt GUID gt following values guid gt node GUID guid 1 gt portl guid 2 gt port2 guid 3 gt sys
138. ameter value in the attached rule When setting the flow type to NORMAL the incoming traffic will be steered according to the rule spec ifications ALL DEFAULT and MC DEFAULT rules options are valid only for Ethernet link type since InfiniBand link type packets always include QP number For further information please refer to the relevant man pages ibv destroy flow int ibv destroy flow struct ibv flow flow id Input parameters destroy flowrequires struct flow which is the return value of ibv create flowin case of success Output parameters Returns 0 on success or the value of errno on failure For further information please refer to the ibv destroy flow man page Ethtool Ethtool domain is used to attach an RX ring specifically its QP to a specified flow Please refer to the most recent ethtool manpage for all the ways to specify a flow Examples ethtool U eth5 flow type ether dst 00 11 22 33 44 55 loc 5 action 2 88 Mellanox Technologies Rev 2 1 1 0 0 All packets that contain the above destination MAC address are to be steered into rx ring 2 its underlying QP with priority 5 within the ethtool domain ethtool U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2 All packets that contain the above destination IP address and source port are to be steered into rx ring 2 When destination MAC is not given the user s destination MAC is filled automatically ethtool
139. ameters ScalableSHMEM contains support for environment module system http mod ules sf net The modules configuration file can be found at opt mellanox openshmem 2 2 etc shmem modulefile 5 2 Message Passing Interface 5 21 Overview Mellanox OFED for Linux includes the following Message Passing Interface MPI implementa tions over InfiniBand Open MPI 1 4 6 amp 1 6 1 an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH2 1 7 an MPI 1 implementation by Ohio State University Mellanox Technologies 115 Rev 2 1 1 0 0 HPC Features 5 2 2 5 2 2 1 These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 14 lists some useful MPI links Table 14 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mpi org MVAPICH 2 MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections Section 5 2 2 Prerequisites for Running MPL on page 116 Section 5 2 3 MPI Selector Which MPI Runs on page 117 Section 5 2 4 Compiling MPI Applications on page 118 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automati
140. an IB and Eth network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager 1 5 over Converged Ethernet RoCE RoCE allows InfiniBand IB transport applications to work over Ethernet network RoCE encapsulates the InfiniBand transport and the GRH headers in Ethernet packets bearing a dedi cated ether type 0x8195 Thus any VERB application that works in an InfiniBand fabric can work in an Ethernet fabric as well RoCE is enabled only for drivers that support VPI currently only mlx4 When working with RDMA applications over Ethernet link layer the following points should be noted The presence of a Subnet Manager SM is not required in the fabric Thus operations that require communication with the SM are managed in a different way in RoCE This does not affect the API but only the actions such as joining multicast group that need to be taken when using the API Since LID is a layer 2 attribute of the InfiniBand protocol stack it is not set for a port and is displayed as zero when querying the port With RoCE the alternate path is not set for RC QP and therefore APM is not supported Since the SM is not present querying a path is impossible Therefore the path record structure must be filled with the relevant values before establishing a connection Hence it is re
141. an options line in the file etc modprobe d mlx4_core conf For example to configure all HCAs to have Portl as IB and Port2 as ETH insert the following line options mlx4 core port type array 1 2 To set HCAs individually you may use a string of Domain bus device function x y For example if you have a pair of HCAs whose PFs are 0000 04 00 0 and 0000 05 00 0 you may specify that the first will have both ports as IB and the second will have both ports as ETH as follows options mlx4 core port type array 0000 04 00 0 1 1 0000 05 00 0 2 2 Mellanox Technologies 97 J Rev 2 1 1 0 0 Driver Features Only the PFs are set via this mechanism The VFs inherit their port types from their asso ciated PF 4 13 7 2 Virtual Function InfiniBand Ports Each VF presents itself as an independent vHCA to the host while a single HCA is observable by the network which is unaware of the vHCAs No changes are required by the InfiniBand sub system ULPs and applications to support SR IOV and vHCAs are interoperable with any exist ing non virtualized IB deployments Sharing the same physical port s among multiple VHCAs is achieved as follows Each vHCA port presents its own virtual GID table The virtual GID table for the InfiniBand ports consists of a single entry at index 0 that maps to a unique index in the physical GID table The VHCA of the PF maps to physical GID index 0 To obtain GIDs for other VHCAs alias GUIDs are reques
142. ance int default Mellanox Technologies 239 Rev 2 1 1 0 0 Appendix E Lustre Compilation over MLNX OFED This procedure applies to RHEL SLES OSs only p To compile Lustre version 2 3 65 and higher configure with o2ib usr src ofa kernel default make rpms To compile older Lustre versions EXTRA LNET INCLUDE I usr src ofa_kernel default include include usr src ofa kernel default include linux compat 2 6 h configure with o2ib usr src ofa kernel default EXTRA LNET INCLUDE I usr src ofa_kernel default include include usr src ofa kernel default include linux compat 2 6 h make rpms For Lustre 2 1 3 due to a duplicate definition of INVALID UID macro the following patch must be applied lustre 2 1 3 lustre include lustre cfg h 2012 09 17 14 26 46 000000000 0200 lustre 2 1 3 lustre include lustre cfg h new 2013 09 07 10 45 07 121772824 0200 288 7 288 9 include lustre lustre user h ifndef INVALID UID define INVALID UID 1 endif 240 Mellanox Technologies
143. and revoke remote access rights to a registered region in a dynamic fashion with less of a performance penalty grant different remote access rights to different remote agents and or grant those rights over different ranges within registered region For further information please refer to the InfiniBand specification document Mellanox Technologies 111 Rev 2 1 1 0 0 Driver Features 4 20 1 4 20 2 4 20 3 4 20 4 4 20 5 Memory Windows API cannot co work with peer memory clients PeerDirect P Query Capabilities Memory Windows are available if and only the hardware supports it To verify whether Memory Windows are available run ibv query device For example truct ibv device attr device attr ibv query device context amp device attr if device attr device cap flags amp IBV DEVICE MEM WINDOW device attr device cap flags amp IBV DEVICE MW TYPE 2B Memory window is supported Allocating Memory Window Allocating memory window is done by calling the alloc mw verb type mw IBV MW TYPE 2 IBV MW TYPE 1 mw ibv alloc mw pd type mw Binding Memory Windows After allocated memory window should be bound to a registered memory region Memory Region should have been registered using the IBV ACCESS MW BIND access flag Binding Memory Window type 1 is done via the ibv bind mw verb struct ibv mw bind mw bind ret bind mw qp mw amp mw bind Binding m
144. anox Technologies 67 J Rev 2 1 1 0 0 Driver Features 4 5 5 4 5 6 4 The the UP is mapped to the TC as configured by the m1nx_qos tool or by the 11dpad daemon If DCBX is used With RoCE there only be 4 predefined ToS values for the purpose of QoS mapping Raw Ethernet QP Quality of Service Mapping Applications open a Raw Ethernet QP using VERBs directly The following is the RoCE QoS mapping flow 1 The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP Sets attrs ah attrs sl up e Calls modify with av set in the mask 2 The UP is mapped to the TC as configured by the mlnx_gos tool or by the 11dpad daemon if DCBX is used When using Raw Ethernet QP mapping the TOS sk prio to UP mapping is lost Performing the Raw Ethernet QP mapping forces the QP to transmit using the given UP If packets with VLAN tag are transmitted UP in the VLAN tag will be overwritten with the given UP Map Priorities with tc wrap py mlnx qos Network flow that can be managed by QoS attributes is described by a User Priority UP A user s sk_prio 1s mapped to UP which in turn is mapped into TC Indicating the UP When the user uses sk_prio it is mapped into a UP by the tc tool This is done by the tc wrap py tool which gets a list of lt 16 comma separated UP and maps the sk prio to the specified UP For example tc wrap py iet
145. are also appropriate in case you wish to burn newer firmware that you have downloaded from Mellanox Technologies Web site http www mella Ad nox com gt Downloads gt Firmware Mellanox Technologies 41 J Rev 2 1 1 0 0 Installation Step 1 Start mst host1 mst start Step 2 Identify your target InfiniBand device for firmware update 1 Get the list of InfiniBand device names on your machine host1 mst status MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre I2C module is not loaded MST devices dev mst mt25418 pciconf0 PCI configuration cycles access bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is A0 dev mst mt25418 pci cro PCI direct access bus dev fnz02 00 0 bar 0xdef00000 Size 0x100000 Chip revision is A0 dev mst mt25418 pci msix0 PCI direct access bus dev fnz02 00 0 bar 0xdeefe000 Size 0x2000 dev mst mt25418 pci uar PCI direct access bus dev fnz02 00 0 barz0xdc800000 Size 0x800000 2 Your InfiniBand device is the one with the postfix pci cr0 In the example listed above this will be dev mst mt25418 pci cro Step3 Burn firmware Burma firmware image from mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 gt flint d dev mst mt25418 pci cr0 i fw 25408 2 1 8000 MCX353A
146. are the domains at a descending order of priority User Verbs allows a user application QP to be attached into a specified flow when using ibv create flowand ibv destroy flow verbs Mellanox Technologies 87 J Rev 2 1 1 0 0 Driver Features ibv create flow struct flow ibv create flow struct qp qp struct flow attr flow Input parameters struct the attached QP struct ibv flow attr attaches the QP to the flow specified The flow contains mandatory control parameters and optional L2 L3 and L4 headers The optional headers are detected by setting the size and num_of_specs fields struct ibv_flow_attr can be followed by the optional flow headers structs struct ibv flow spec ib struct ibv flow spec eth struct ibv flow spec ipv4 struct ibv flow spec tcp udp For further information please refer to the ibv create flow man page Be advised that from MLNX_OFED v2 0 3 0 0 and higher the parameters both the value and the mask should be set in big endian format Each header struct holds the relevant network layer parameters for matching To enforce the match the user sets a mask for each parameter The supported masks are All one mask include the parameter value in the attached rule Note Since the VLAN ID in the Ethernet header is 12bit long the following parameter should be used flow spec eth mask vlan tag htons OxOfff All zero mask ignore the par
147. ase on the CCT calculation on the current subnet size Thesmaller the number value of the parameter the faster HCAs will respond to the con gestion and will throttle the traffic Note that if the number is too low it will result in suboptimal bandwidth To change the mean number of packets between marking eligi ble packets with a FECN set the following parameter marking rate The values are to ox 1 The default is oxa You can set the minimal packet size that can be marked with FECN Any packet less than this size bytes will not be marked with FECN To do so set the following param eter packet size The values are 0 0x3 c0 The default is ox200 184 Mellanox Technologies Rev 2 1 1 0 0 When number of errors exceeds max errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to pro ceed To do so set the following parameter max errors error window The values are max errors 0 zero tollerance abort configuration on first error error window 0 mechanism disabled no error checking 0 48K The default is 5 8 9 4 1 Congestion Control Manager Options File Table 22 Congestion Control Manager General Options File Option File Description Values enable Enables disables Congestion Control mechanism Values TRUE FALSE on the fabric nodes Default True num hosts Indicates the number
148. ash gt gt Min number of 10 InfiniBand Fabric Diagnostic Utilities packets to be sent across each link default Enable verbose mode Provides a report of the fabric qualities Specifies the topology file name Specifies the local system name Meaningful only if a topology file is specified Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system Specifies the local device s port num used to connect to the IB fabric Specifies the directory where the output files will be placed default tmp Specifies the expected link width Specifies the expected link speed Dump all the fabric links pm Counters into ibdiagnet pm Reset all the fabric links pmCounters If any of the provided pm is greater then its provided value print it to screen skip lt skip option s gt Skip the executions of the selected checks Skip wt lt file name gt load_db lt file h help V version Vars Output Files options one or zero guids pm lo Write out the This flag is use changes from the named ibdiag ibn holds the IBNL f To use these fil variable named I The directory is directory provid name Load subne subnet discovery Note Some of and therefore wo These checks are status Prints the hel more can be specified dup guids gical state part ipoib all discovered topology into the given file ful if you later
149. ate required Found 1 device s requiring firmware update Device 1 Updating FW Done A restart is needed for updates to take effect Log File tmp MLNX OFED LINUX 2 1 0 0 9 10740 10gs fw update log In case your machine has the latest firmware no firmware update will occur and the installation script will print at the end of installation a message similar to the following Device 1 Device 0000 06 00 0 Part Number MCX354A FCB 1 Description ConnectX 3 VPI adapter card dual port OSFP FDR IB 56Gb s and 40GigE PCIe3 0 x8 8GT s RoHS R6 PSID MT 1090110019 Versions Current Available FW 2306 2 30 8000 PXE 3 4 0146 3 4 0146 Status Up to date 38 Mellanox Technologies Rev 2 1 1 0 0 In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates P Error message Device 1 Device 0000 05 00 0 Part Number Description ESTEE MT_0DB0110010 Versions Current Available FW 2 92 WOOO N A Status No matching image found Step 4 In case the installation script performed firmware updates to your network adapter hardware it will ask you to reboot your machine Step 5 The script adds the following lines to etc security limits conf for the userspace com ponents such as MPI soft memlock unlimited hard memlock unlimited These settings unlimit the am
150. bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband srp srp mlx4 0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command 3 Discover reachable SRP Targets given an InfiniBand HCA name and port rather than by just runing sys class infiniband mad umad N where lt N gt is a digit srpd The srpd service script allows automatic activation and termination of the srp daemon utility on all system live InfiniBand ports srp daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsr pdm functionality described above srp daemon can also Establish an SRP connection by itself without the need to issue the echo command described in Section 4 1 2 2 Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit Enable High Availability operation together with Device Mapper Multipath Haveaconfiguration file that determines the targets to connect to l srp daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c a o is equivalent to ibsrpdm c These srp_daemon commands can behave differently than
151. c login i e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote computers and or servers SSH Configuration The following steps describe how to configure password less access over SSH Step 1 Generate an ssh key on the initiator machine host1 host1 ssh keygen t rsa Generating public private rsa key pair Enter file in which to save the key home username ssh id rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home lt username gt ssh id_rsa Your public key has been saved in home lt username gt ssh id_rsa pub The key fingerprint is 38 1b 29 d 4 08 00 4a 0e 50 0 05 44 e7 9 05 lt username gt host1 Step 2 Check that the public and private keys have been generated host1 cd home lt username gt ssh hosti 18 host1 ls la total 40 2 root root 4096 Mar 5 04 57 dewar 103 moor woo ANG Were 2 2 1 root root 1675 Mar 5 04 57 id rsa 116 Mellanox Technologies Rev 2 1 1 0 0 rw r r 1 root root 404 Mar 5 04 57 id rsa pub Step3 Check the public key hostl cat id rsa pub ssh rsa AAAAB3NzaClyc2EAAAAB I WAAAQEA1zVY8VBHQh90kZN70A11bUQ74RXm4 zHeczyVxpYHaDPyDmqe zbYMKrCIVz d1 0bH ZkCOrpLYviU0oUHd3f vNTEMs0gcGg08PysU 12FyYjira2P1lxyg mkHLGGqVut fEMmABZ3wNCUg6J2X 3G ui
152. c tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the fol lowing options 1 On the command line specify the port number using the option p lt local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option 1 index of local device gt 2 Define the environment variable IBDIAG DEV IDX Mellanox Technologies 187 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities 9 223 Addressing 9 3 This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies The following addressing modes can be used to define the IB ports Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination Using port LIDs Tool option I In this mode the source and destination ports are defined by means of their LIDs If the fabric is con figured to allow multiple LIDs per port then using any of them is valid for defining a port
153. ca0 root lab104 ibsrpdm c d dev infiniband umad0 id_ext 0002c90200226cf4 ioc_ guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf 5 pkey ffff service id 0002c90200226cf4 root lab104 echo id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf 5 pkey ffff service id 0002c90200226cf4 gt sys class infiniband srp srp mthca0 1 add target OR You can edit etc infiniband openib conf to load the SRP driver and SRP High Avail ability HA daemon automatically that is set SRP LOAD yes and SRPHA ENABLE yes To set up and use the HA feature you need the dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of an SRP Target setup file kkkkkkkkkkkkkkkkkkkkkkk srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk bin sh modprobe scst scst_threads 1 modprobe scst vdisk scst vdisk ID 100 echo open vdisk0 dev cciss cid0 BLOCKIO gt proc scsi tgt vdisk vdisk echo open vdiskl dev sdb BLOCKIO proc scsi tgt vdisk vdisk echo open vdisk2 dev sdc BLOCKIO proc scsi tgt vdisk vdisk echo open vdisk3 dev sdd BLOCKIO proc scsi tgt vdisk vdisk echo add vdisk0 0 gt proc scsi tgt groups Default devices echo add vdiskl 1 gt proc scsi tgt groups Default devices echo add vdisk2 2 proc scsi tgt groups Default devices echo add vdi
154. cant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE over Converged Ethernet 14 Mellanox Technologies Rev 2 1 1 0 0 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description SDP Sockets Direct Protocol SL Service Level SRP SCSI RDMA Protocol MPI Message Passing Interface EoIB Ethernet over Infiniband QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane vHBA Virtual SCSI Host Bus adapter uDAPL User Direct Access Programming Library Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man agers in particular It is included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Glossary Sheet 1 of 2 Channel Adapter An IB device that terminates an IB link and executes transport CA Host Channel functions This may be an HCA Host CA or a TCA Target Adapter HCA CA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant commu nication IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to admin
155. ce To configure the GUID at index n on port port num cd sys class infiniband mlx4 0 iov ports port num admin guids echo your desired guid n Example cd sys class infiniband mlx4 0 iov ports 1 admin guids echo 0x002fffff8118 gt 3 1 echo 0x0 means let the SM assign a value to that GUID echo Oxffffffffffffffff means delete that GUID echo any other value means request the SM to assign this GUID to this index Step3 Read the administrative status of the GUID index To read the administrative status of GUID index m on port n cat sys class infiniband mlx4 0 iov ports n admin guids m Step 4 Check the operational state of a GUID sys class infiniband mlx4_0 iov ports lt n gt gids where n 1 or 2 The values indicate what gids are actually configured on the firmware hardware and all the entries are R O Step 5 Compare the value you read under the admin_guids directory at that index with the value under the gids directory to verify the change requested in Step 3 has been accepted by the SM and programmed into the hardware port GID table Mellanox Technologies 99 J Rev 2 1 1 0 0 Driver Features If the value under admin guids lt m gt is different that the value under gids lt m gt the request is still in progress 4 13 7 2 3Partitioning IPoIB Communication using PKeys PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by mapping a non de
156. ce levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS Up to 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served Packets carry class of service marking in the range 0 to 15 in their header SL field Each switch can map the incoming packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL The Subnet Administrator controls the parameters of each communication flow by pro viding them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The following subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack Mellanox Technologies 63 Rev 2 1 1 0 0 Driver Features 4 4 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chro nology approach
157. ce the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage The user can override the node list manually If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 3 Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet Up Down routing does not allow LID routing communication between switches that are located inside spine switch systems The reason is that there is no way to allow a LID route between them that does not break the Up Down rule One ramification of this is that you cannot run SM on switches other than the leaf switches of the fabric UPDN Algorithm Usage Activation through OpenSM Use R updn option instead of old u to activate
158. ch Down sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev mlx4 0 port 1 gt eth5 Down mlx4 0 port 1 gt 1 0 Down mlx4_0 port 2 gt ibl Down mlx4 1 port 1 gt eth2 Down mlx4_1 port 2 gt eth3 Down 9 9 ibstatus Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h device name gt lt port gt Output Files Table 30 lists the various flags of the command Table 30 ibstatus Flags and Options Default Flag Pd If Not Description y Specified h Optional Print the help menu device Optional All devices Print information for the specified device May specify more than one device 198 Mellanox Technologies Rev 2 1 1 0 0 Table 30 ibstatus Flags and Options Optional Default Flag n dator If Not Description y Specified port Optional but All ports of Print information for the specified port only requires the specified of the specified device specifying a device device name Examples 1 List the status of all available InfiniBand devices and their ports gt ibstatus Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 0007 3896 base lid 0x3 sm lid 0x3 State 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Infiniband device
159. cling T The hop I r uses a separate VL so it cannot contribute to a credit loop encircling T Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2QoS can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus 5 te I I I I I I 4 Tr Dj ov q I I I I I I 3 mejl Uj I I I I I I 2 R I I I I I I 3 m S n T 0 p I I I I I I I I I I I I x 0 x 2 3 4 5 Suppose switches T and R have failed and consider the path from S to D Torus 2QoS will gen erate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that torus 2QoS cannot route without deadlock two failed switches adjacent in a dimension that is not the last dimension routed by DOR here the failed switches are O and T 8 I I I I I I 4 I I I I I I I t I I I I I I 2 j jfr I I I I I I SOSER AEE I I I I I I p puk I I I I I I x 0 1 2 3 4 5 In a pristine fabric torus 2QoS would generate the path from S to D as S n O T r D With failed switches and T
160. commended working with RDMA CM to establish a connection as it takes care of filling the path record structure The GID table for each port is populated with N 1 entries where N is the number of IP addresses that are assigned to all network devices associated with the port including VLAN devices alias devices and bonding masters The only exception to this rule is a bonding master of a slave in a DOWN state In that case a matching GID to the IP address of the master will not be present in the GID table of the slave s port The first entry in the GID table at index 0 for each port is always present and equal to the link local IPv6 address of the net device that is associated with the port Note that even if the link local IPv6 address is not set index 0 is still populated Mellanox Technologies 25 J Rev 2 1 1 0 0 Mellanox OFED Overview GID format can be of 2 types IPv4 and IPv6 IPv4 GID is a IPv4 mapped IPv6 address while IPv6 GID is the IPv6 address itself 1 For the IPv4 address A B C D the corresponding IPv4 mapped IPv6 address is ffffA B C D 26 Mellanox Technologies Rev 2 1 1 0 0 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed 2 1 Hardware and Software Requirements Table 1 Software and Hardware Requirements Requirements Description Platforms A serv
161. d the following lines for the machine s you wish to boot from the iSCSI target Filename option root path iscsi iscsi_target_ip iscsi_target_ign The following is an example for configuring an IB ETH device to boot from an iSCSI target host host1 filename Mellanox Technologies 231 Rev 2 1 1 0 0 232 Mellanox Technologies J Rev 2 1 1 0 0 Appendix B SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possible to work with and support a lot of IO modes on real or virtual devices in the back end 1 scst vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices B 1 Prerequisites and Installation 1 SRP targer is part of the OpenFabrics OFED software stacks Use the latest OFED distribu tion package to install SRP target On distribution default kernels you can run scst_vdisk blockio mode to obtain good performance shoal 2 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from http
162. d to same target port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group c Ifnone prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node 8 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign_lids option is specified negceugn dS This option causes OpenSM to reassign LIDs to all end nodes Specify ing r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID Ifa link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obviously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 8 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by defau
163. default value for this option is 0 A number in the range 1 2048 specifying the maximum S G list length the SCSI layer is allowed to pass to ib srp Specifying a value that exceeds cmd sg entries is only safe with partial memory descriptor list support enabled allow ext sg 1 A number in the range 0 n 1 specifying the MSI X completion vector Some HCA s allocate multiple n MSI X vectors per HCA port If the IRQ affinity masks of these interrupts have been configured such that each MSI X interrupt 1s handled by a different CPU then the comp vector parameter can be used to spread the SRP completion workload over multiple CPU s Mellanox Technologies 49 J Rev 2 1 1 0 0 Driver Features tl retry count A number in the range 2 7 specifying the IB RC retry count 4 1 2 3 SRP Tools ibsrpdm srp daemon and srpd Service Script To assist in performing the steps in Section 6 the OFED distribution provides two utilities Ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above echo command Step 2 A service script srpd which may be started at stack startup The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below seve
164. der the docs directory On Redhat and SLES distributions with errata kernel installed there is no need to use the mlnx add kernel support sh script The regular installation can be performed and weak updates mechanism will create symbolic links to the MLNX kernel modules Usage mlnx_add_kernel_support sh m mlnx ofed path to MLNX OFED directory make iso make tgz make iso Create MLNX OFED ISO image make tgz Create MLNX OFED tarball Default t tmpdir local work dir kmp Enable KMP format if supported k kernel kernel version Kernel version to use s kernel sources path to the kernel sources Path to kernel headers v verbose n name Name of the package to be created y yes Answer yes to all questions 1 The firmware will not be updated if you run the install script with the without fw update option 28 Mellanox Technologies Rev 2 1 1 0 0 Example The following command will create a MLNX OFED LINUX ISO image for RedHat 6 3 under the tmp directory MLNX OFED LINUX 2 1 1 0 0 rhel6 3 x86 64 mlnx add kernel support sh m tmp MLNX OFED LINUX 2 1 1 0 0 rhel6 3 x86 64 make tgz Note This program will create MLNX OFED LINUX TGZ for rhel6 2 under tmp directory All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y See log file tmp mlnx ofed i1s0 21642 10g Building OFED
165. e spinlocks enabled Used by applications that are single threaded and would like to save the overhead of taking spinlocks MLX5 SIZE 64 completion queue entry size is 64 bytes default 128 completion queue entry size is 128 bytes 22 Mellanox Technologies Rev 2 1 1 0 0 MLX5 SCATTER TO CQE Small buffers are scattered to the completion queue entry and manipulated by the driver Valid for RC transport Default is 1 otherwise disabled 1 3 3 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and kernel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 3 4 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Unreliable Datagram UD by default but it may also be configured to be Reliable Connected RC The interface supports unicast multicast and broadcast For details see Chapter 4 3 IP over InfiniBand ISER iSCSI Extensions for RDMA 1SER extends the iSCSI protocol to RDMA It permits data to be trans
166. e options Failed to intract with IB fabric Failed to use local device or local port Failed to use Topology File Failed to load requierd Package FW NH ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing is used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to the one specified in the p option Otherwise an error is reported When ibd
167. eatures provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The kSRP Target resides in an IO unit and provides storage services Section 4 1 2 describes the SRP Initiator included in Mellanox OFED for Linux This package however does not include an SRP Target 4 1 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D available from http www t10 org The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp t10 drafts spc3 spe3r2 1b pdf Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t1O drafts sbc2 sbc2r16 pdf Basic functionality task management and limited error handling 4 1 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or change the value of SRP LOAD in etc infiniband openib conf to yes For the changes to take effect run etc init d openibd restart de srp sg tablesize This is the maximum number of gather scatter entries per I O LL gt When loading the ib_srp module it is possible to set the module parameter adi default 12 46 Mellano
168. ecifies the file that contains the process ID of the opensm daemon The default is var run opensm pid max seq redisc Specifies the maximum number of failed discovery loops done by the SM before completing the whole heavy sweep cycle mc secondary root guid GUID in hex This option defines the guid of the multicast secondary root switch mc primary root guid GUID in hex This option defines the guid of the multicast primary root switch guid routing order no scatter Don t use scatter for ports defined in guid routing order file pr full world queries allowed This option allows OpenSM to respond full World Path Record queries path record for each pair of ports in a fabric enable crashd This option causes OpenSM to run Crash Daemon child process that allows backtrace dump in case of fatal terminating signals log prefix prefix text Prefix to syslog messages from OpenSM verbose v This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the D option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing The V is equivalent to D OxFF d 2 See the D option for more information about log verbosity gt 144 Mellanox Technologies Rev 2 1 1 0 0 This option sets the log verbosity level A flags field must follow t
169. ed torus mesh x radix m M t T y radix m M t T z radix m M t T Either torus or mesh must be the first keyword in the configuration and sets the topology that torus 2QoS will try to construct A 2D topology can be configured by specifying one of x radix y radix or z radix as 1 An individual dimension can be configured as mesh open or torus looped by suffixing its radix specification with one of m M t or T Thus mesh 3T 4 5 and torus 3 4M 5M both specify the same topology Note that although torus 2QoS can route mesh fabrics its ability to route around failed compo nents is severely compromised on such fabrics A failed fabric componentis very likely to cause a disjoint ring see UNICAST ROUTING in torus 2QoS 8 xp link sw0 GUID swl GUID yp link sw0 GUID swl GUID zp link sw0 GUID swl GUID xm link sw0 GUID swl GUID ym link sw0 GUID swl GUID zm link sw0 GUID swl GUID These keywords are used to seed the torus mesh topology For example xp link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the positive x direction while xm link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the negative x direction AII the link keywords for a given seed must specify the same from switch In general it is not necessary to configure both the positive and negative directions fo
170. ed Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic_response va if compare add va amp compare add mask then Mellanox Technologies 79 J Rev 2 1 1 0 0 Driver Features 4 7 1 2 4 8 80 Mellanox Technologies va va amp swap mask swap amp swap mask return atomic response The additional operands are carried in the Extended Transport Header Atomic response genera tion and packet format for MskCmpSwap is as for standard IB Atomic operations Masked Fetch and Add MFetchAdd The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the field boundary parameter specifies the field boundaries The pseudocode below describes the operation bitvadder 1 bil b2 co value ci bl b2 co value amp 2 return value amp 1 define MASK IS SET mask attr 11 mask amp attr bit position 1 carry 0 atomic response 0 tox 1 e Q to 63 ur x dej bit position bit position 1
171. ed correctly 173 Failed to start the mst driver Mellanox Technologies 31 J Rev 2 1 1 0 0 Installation 2 3 3 Installation Procedure Step 1 Login to the installation machine as root Step 2 Mount the ISO image on your machine host1 mount o ro loop MLNX OFED LINUX ver 0S label gt lt CPU arch gt iso mnt Step 3 the installation script mlnxofedinstall Logs dir tmp MLNX OFED LINUX 2 1 0 0 9 10740 logs This programwill install the MLNX OFED LINUX package on your machine Note that all otherMellanox OFED or Distribution IBpackages willbe removed Do you want to continue y N y Uninstalling the previous version of MLNX OFED LINUX bin rpm nosignature e allmatches nodeps libmverbs libmverbs i686 libmverbs devel libmverbs devel i686 libmge libmge i686 libmqe devel libmgqe devel i686 Starting MLNX_OFED LINUX 2 1 0 0 9 installation Installing mlnx ofa kernel RP Preparing td HHH i H H HHH i H HH HHH i H HHH HH i H HHH i mlnx ofa_kernel HHH i H H HH H HHH HH HHH H HHH HH i H HHH i HH Installing kmod mlnx ofa kernel 2 1 RPM Preparing HHH HHH HH HHH HHH HH Het HHH HHH HH HHH HHH i kmod ml nx ofa kernel HH H HH H HHH HH H i H HH HH H HH H HH H HHH HH H HH H HH Installing mlnx ofa kernel devel RPM Preparing AD HHH i H H HH H i H HH HH H i H HH HH H HHH i mlnx ofa_kernel devel HHH i H H HHH HH H HH HHH HH H HH H HHH HH H HHH i Installing kmod kerne
172. emory window type 2B is done via the ibv post send verb and a specific Work Request WR with opcode IBV WR BIND MW Prior to binding please make sure to update the existing rkey ibv inc rkey mw gt rkey Invalidating Memory Window Before rebinding Memory Window type 2 it must be invalidated using the ibv post send verb and a specific WR with opcode IBV WR LOCAL INV Deallocating Memory Window Deallocating memory window is done using the dealloc mw verb dealloc mw mw 112 Mellanox Technologies Rev 2 1 1 0 0 5 HPC Features 5 1 Shared Memory Access The Shared Memory Access SHMEM routines provide low latency high bandwidth communi cation for use in highly parallel scalable programs The routines in the SHMEM Application Pro gramming Interface API provide a programming model for exchanging data between cooperating parallel processes The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program The SHMEM parallel programming library is an easy to use programming model which uses highly efficient one sided communication APIs to provide an intuitive global view interface to shared or distributed memory systems SHMEM s capabilities provide an excellent low level interface for PGAS applications A SHMEM program is of a single program multiple data SPMD style All the SHMEM pro cesses referred as processing elements PEs start simultaneously and run
173. engine R engine name This option chooses routing engine s to use instead of default Min Hop algorithm Multiple routing engines can be specified Separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail If all configured routing engines fail OpenSM will always attempt to route with Min Hop unless no fallback is included in the list of routing engines Supported engines updn file ftree lash dor torus 2QoS 138 Mellanox Technologies Rev 2 1 1 0 0 do mesh analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing lash start vl lt vl number gt Sets the starting VL to use for the lash routing algorithm Defaults to 0 sm sl sl number Sets the SL to use to communicate with the SM SA Defaults to 0 connect roots z This option enforces routing engines up down and fat tree to make connectivity between root switches and in this way be IBA compliant In many cases this can violate pure deadlock free algorithm so use it carefully cem Cees c This option enables unicast routing cache to prevent routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing
174. equency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled 7 1 3 4 AMD Processors The following table displays the recommended BIOS settings in machines with AMD based pro CeSsors Table 19 Recommended BIOS Settings for AMD Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Disabled HPC Optimizations Enabled CPU frequency select Max performance 128 Mellanox Technologies Rev 2 1 1 0 0 Table 19 Recommended BIOS Settings for AMD Processors BIOS Option Values Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 7 2 Performance Tuning for Linux You can use the Linux sysct command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on differe
175. er lmc 1 priority smkey bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port LMC gt his option specifies the subnet s LMC value he number of LIDs assigned to each port is 2 LMC MC values 0 allow multiple paths between ports lt 1 LMC value must the range 0 7 L L MC values gt 0 should only be used if the subnet topology actually provides multiple paths between ports i e multiple interconnects between switches Without 1 OpenSM defaults to LMC 0 which allows one path between any two ports ESDESPINORISBYS This option specifies the SM s PRIORITY This will effect the handover cases where master is chosen by priority and GUID Range goes from 0 lowest priority to 15 highest k SM Key This option specifies the SM s SM Key 64 bits This will effect SM authentication Note that OpenSM version 3 2 1 and below used the default value 1 in a host byte order it is fixed now but you may need this option to interoperate with old OpenSM running on a little endian machine reassign lids r routing This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID
176. er platform with an adapter card based on one of the following Mellanox Technologies InfiniBand HCA devices e MT27508 ConnectX 3 VPI IB EN firmware fw ConnectX3 e MT4113 Connect IB IB firmware fw Connect IB For the list of supported architecture platforms please refer to the Mellanox OFED Release Notes file Required Disk Space 1GB for Installation Device ID For the latest list of device IDs please visit Mellanox website Operating System Linux operating system For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file Installer Privileges The installation requires administrator privileges on the target machine 2 2 Downloading Mellanox OFED Step 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHost entries in the display The following example shows a system with an installed Mellanox HCA lspci v grep Mellanox 06 00 0 Network controller Mellanox Technologies MT27500 Family ConnectX 3 Subsystem Mellanox Technologies Device 0024 Step 2 Download the ISO image to your host The image s name has the format OFED LINUX ver OS label gt lt CPU arch gt iso You can download it from http www mellanox com gt Products gt Software InfiniBand Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO im
177. erations fca ops lt gt list Enables disables only the specified operations fca ops lt gt Enables disables all operations By default all operations are enabled Allowed operation names are barrier br bcast bt reduce rc allgather ag Each operation can be also enabled disabled via environment variable GASNET FCA ENABLE BARRIER GASNET ENABLE BCAST GASNET ENABLE REDUCE Note All the operations are enabled by default 5 5 2 1 Enabling FCA Operations through Environment Variables in ScalableUPC This method can be used to control UPC FCA offload from environment using job scheduler srun utility The valid values are 1 enable 0 disable To enable a specific operation with shell environment variables in ScalableUPC export GASNET FCA ENABLE BARRIER 1 export GASNET FCA ENABLE BCAST 1 export GASNET FCA ENABLE REDUCE 1 122 Mellanox Technologies ovo Rev 2 1 1 0 0 5 5 2 2 Controlling FCA Offload in ScalableUPC using Environment Variables gt To enable FCA module under ScalableUPC export GASNET FCA ENABLE CMD LINE 1 gt To set FCA verbose level export GASNET FCA VERBOSE CMD LINE 10 gt To set the minimal number of processes threshold to activate FCA export GASNET FCA NP CMD LINE 1 ScalableUPC contains modules configuration file http modules sf net which can be found at opt mellanox bupc 2 2 etc bupc modulefile
178. erify the configuration by entering the ifconfig command with the appropriate interface identifier ib argument The following example shows how to verify the configuration host1 ifconfig 100 b0 Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 inet addr 11 4 3 175 11 4 255 255 Mask 255 255 0 0 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s 4 3 4 Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri mary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to Create a subinterface Section 4 3 4 1 Remove a subinterface Section 4 3 4 2 4 3 4 1 Creating a Subinterface In the following procedure 160 is used as an example of an IB subinterface To create child interface subinterface follow this procedure Step 1 Decide on the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For example a value of 1 will
179. ernet ib Infiniband auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds The port link type can be configured for each device in the system at run time using the sbin connectx_port_config script This utility will prompt for the PCI device to be modified if there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device This utility also has a non interactive mode sbin connectx port config d device PCI device ID gt c conf portl port2 124 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 2 1 1 0 0 62 Auto Sensing Auto Sensing enables the NIC to automatically sense the link type InfiniBand or Ethernet based on the link partner and load the appropriate driver stack InfiniBand or Ethernet For example if the first port is connected to an InfiniBand switch and the second to Ethernet switch the NIC will automatically load the first switch as InfiniBand and the second as Ethernet 6 2 1 Enabling Auto Sensing Upon driver start up 1 Sense the adapter card s port type If a valid cable or module is connected QSFP SFP or SFP with EEPROM in the cable module Set the port type to the sensed link type IB Ethernet Otherwise
180. es providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime MLNX OFED v2 0 or later comes with a pre installed version of FCA v2 x FCA is built on the following main principles Topology aware Orchestration The MPI collective logical tree is matched to the physical topology The collective logical tree is constructed to assure Maximum utilization of fast inter core communication Distribution of the results Communication Isolation 120 Mellanox Technologies Rev 2 1 1 0 0 Collective communications are isolated from the rest of the traffic in the fabric using a private virtual network VLane eliminating contention with other types of traffic After MLNX_OFED installation FCA can be found at opt mellanox fca folder For further information on configuration instructions please refer to the FCA User Manual 5 5 ScalableUPC Unified Parallel C UPC is an extension of the C programming language designed for high per formance computing on large scale parallel machines The language provides a uniform program ming model for both shared and distributed memory hardware The programmer is presented with a single shared partitioned address space where variables may be directly read and written by any processor but each variable is physically associated with a single processor UPC uses a Single Program Multiple Data SPMD model of computation in which the amount of paral
181. es setting of the usual linear routing table LFT de Therefore no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing 8 8 4 Querying Adaptive Routing Tables When Adaptive Routing is active the content of the usual Linear Forwarding Routing Table on the switch is invalid thus the standard tools that query LFT e g smpquery dump Ifts sh and others cannot be used To query the switch for the content of its Adaptive Routing table use the smparquery tool that is installed as a part of the Adaptive Routing Manager package To see its usage details run smparquery h 8 8 5 Adaptive Routing Manager Options File The default location of the AR Manager options file is etc opensm ar mgr conf To set an alter native location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin _option option in the file options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F options file name AR Manager options file contains two types of parameters 1 General options Options which describe the AR Manager behavior and the AR parameters that will be applied to all the switches in the fabric 2 Per switch options Options which describe specific switch behavior Note the following Adapt
182. etects the new virtual interface that is attached to the same bridge as the eIPoIB interface and creates a new IPoIB instances for it in order to send receive data As a result number of IPoIB interfaces ibX Y are shown as being created destroyed and are being enslaved to the corresponding et hx interface to serve any active VIF in the system according to the set configuration This process is done automatically by the ipoibd service gt To see the list of IPoIB interfaces enslaved under eth ipoib interface cat sys class net ethX eth vifs For example cat sys class net eth5 eth vifs SLAVE ib0 1 MAC 9a c2 1 d7 3b 63 VLAN N A SLAVE ib0 2 MAC 52 54 00 60 55 88 VLAN N A SLAVE ib0 3 MAC 52 54 00 60 55 89 VLAN N A Each ethX interface has at lease one ibX Y slave to serve the PIF itself In the VIFs list of ethX you will notice that ibX 1 is always created to serve applications running from the Hypervisor on top of the ethX interface directly For InfiniBand applications that require native IPoIB interfaces e g CMA the original IPoIB interfaces ibX can still be used For example CMA and ethX drivers can co exist and make use of IPoIB ports CMA can use 160 while eth0 ipoib interface will use ibX Y interfaces gt To see the list of eIPoIB interfaces cat sys class net eth_ipoib interfaces For example cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl 82 Mellanox Technologies Rev
183. f High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA srp daemon c e R 300 i InfiniBand HCA name p port number This step can be performed by executing srp daemon sh which sends its log to var log srp daemon log Now it is possible to access the SRP LUNs on dev mapper It is possible for regular non SRP LUNs to also be present the SRP LUNs may be identified by their names You can configure the etc multipath conf file to change Adi multipath behavior occur if the SRP LUNs are in the black list of multipath Edit the blacklist section in gt It is also possible that the SRP LUNs will not appear under dev mapper This can etc multipath conf and make sure the SRP LUNs are not black listed Automatic Activation of High Availability Set the value of SRP DAEMON ENABLE in etc infiniband openib conf to yes For the changes in openib conf to take effect run etc init d openibd restart Start srpd service run service srpd start e From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper It is possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name Itis possible to see the output o
184. f the SRP daemon in var log srp daemon log Shutting Down SRP SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop or as a by product of a complete system shutdown Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three cases 1 Without High Availability 54 Mellanox Technologies Rev 2 1 1 0 0 When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Stop service srpd Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shut down etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown 4 2 iSCSI Extensions for ROMA iSER iSCSI Extensions for ROMA iSER is currently at beta level Please be aware that the content below is subject to change 4 2 1 Overview iS
185. face is the Ethernet interface that enslaves the IPoIB interfaces in order to send receive packets from the Ethernet interface in the Virtual Machine to the IB fabric beneath 4 8 1 Enabling the elPolB Driver Once the mlnx_ofed driver installation is completed perform the following Step 1 Open the etc infiniband openib cont file and include E IPOIB LOAD yes Step2 Restart the InfiniBand drivers etc init d openibd restart Mellanox Technologies 81 Rev 2 1 1 0 0 Driver Features 4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver When eth ipoib is loaded number of eIPoIB interfaces are created with the following default naming scheme ethx where X represents the ETH port available on the system To check which eIPoIB interfaces were created cat sys class net eth ipoib interfaces For example on a system with dual port HCA the following two interfaces might be created eth4 and eths cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl These interfaces can be used to configure the network for the guest For example if the guest has a VIF that is connected to the Virtual Bridge bro then enslave the eIPoIB interface to bro by running brctl addif br0 ethX d In RHEL KVM environment there are other methods to create configure your virtual net work e g macvtap For additional information please refer to the Red Hat User Manual The IPoIB daemon ipoibd d
186. fault full membership PKey to virtual index 0 and mapping the default PKey to a virtual pkey index other than zero The below describes how to set up two hosts each with 2 Virtual Machines Host 1 vm 1 will be able to communicate via IPoIB only with Host2 vm1 and Host1 vm2 only with Host2 vm2 In addition Host1 Dom0 will be able to communicate only with Host2 Dom0 over 160 vm1 and vm2 will not be able to communicate with each other nor with Dom0 This is done by configuring the virtual to physical PKey mappings for all the VMs such that at virtual PKey index 0 both vm 1s will have the same pkey and both vm 2s will have the same PKey different from the vm 1 s and the Dom0 s will have the default pkey different from the vm s pkeys at index 0 OpenSM must be used to configure the physical Pkey tables on both hosts The physical Pkey table on both hosts Dom0 will be configured by OpenSM to be mde 0 zi fef index 1 0xb000 index 2 0xb030 The vml s virt to physical PKey mapping will be pkey idx 0 1 pkey idx 1 0 The vm2 s virt to phys pkey mapping will be pkey idx 0 2 pkey idx 1 0 so that the default pkey will reside on the vms at index instead of at index 0 The IPoIB QPs are created to use the PKey at index 0 As a result the Dom0 vm1 and vm2 IPoIB QPs will all use different PKeys gt To partition IPoIB communication using PKeys Step 1 Create a file etc opensm partitions conf on the host on
187. ferred directly into and out of SCSI buffers without intermediate data copies For further information please refer to Chapter 4 2 iSCSI Extensions for ROMA iSER SRP SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI soft ware to be readily used on InfiniBand architecture The SRP driver known as the SRP Initia tor differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an I O unit and provides storage services See Chapter 4 1 SCSI RDMA Protocol and Appen dix B SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that promotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface is defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifi cation is timed to coincide with OFED release of the Open Fabrics www openfabrics org soft ware stack For more information about the DAT collaborative go to the following site http www datcollaborative org Mellanox Technologies
188. fo speed width check pkey aguid skip plugin lt library name gt Skip the load of the given library name Applicable skip plugins libibdiagnet_cable diag plugin libibdiagnet_cable diag plugin 2 1 1 pe Reset all the fabric PM counters P counter lt lt PM gt lt value gt gt If any of the provided PM is greater then its provided value than print it pm pause time seconds Specifies the seconds to wait between first counters sample and second counters sample If seconds given is 0 than no second counters sample will be done default 1 Mellanox Technologies 189 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Provides a BER test for each port Calculate BER for each port and check no BER value has exceeds the BER threshold default threshold 10 12 ber use data Indicates that BER test will use the received data for calculation ber thresh value Specifies the threshold value for the BER test The reciprocal number of the BER should be provided Example for 10 12 than value need to be 1000000000000 or 0 804 51000 10 12 I threshold given is 0 than all BER values for all ports will be reported extended speeds dev type Collect and test port extended speeds counters dev type sw all pm per lane List all counters per lane when available ls 2 5 5 10 14 25 FDR10 Specifies the expected link speed
189. g OSM CACHE DIR Mellanox Technologies 145 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 2 3 8 2 4 8 2 4 1 8 3 opensm stores certain data to the disk such that subsequent runs are consistent The default directory used is var cache opensm The following file is included in it guid21id stores the LID range assigned to each GUID Signaling When OpenSM receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSRI can be used to trigger a reopen of var 10og opensm log for logrotate pur poses Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter host1 opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file message registers only general major events the second file opensm log includes details of reported errors All errors reported in opensm 1og should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly If a fatal non recoverable error occurs opensm exits Running O
190. g Capabilities via ethtool gt To display Time Stamping capabilities via ethtool Show Time Stamping capabilities ethtool T eth lt x gt Example ethtool T eth0 Time stamping parameters for p2pl Capabilities hardware transmit SOF TIMESTAMPING TX HARDWARE software transmit SOF TIMESTAMPING TX SOFTWARE hardware receive SOF TIMESTAMPING RX HARDWARE software receive SOF TIMESTAMPING RX SOFTWARE software system clock SOF TIMESTAMPING SOFTWARE hardware raw clock SOF TIMESTAMPING RAW HARDWARE PTP Hardware Clock none Hardware Transmit Timestamp Modes off HWTSTAMP TX OFF on HWTSTAMP TX ON Hardware Receive Filter Modes none HWTSTAMP FILTER NONE all HWTSTAMP FILTER ALL 4 6 0 Time Stamping RoCE Time Stamping is currently at beta level Please be aware that everything listed here is subject to change ba RoCE Time Stamping allows you to stamp packets when they are sent to the wire received from the wire The time stamp is given in a raw hardware cycles but could be easily converted into hardware referenced nanoseconds based time Additionally it enables you to query the hardware for the hardware time thus stamp other application s event and compare time 4 6 2 1 Query Capabilities Time stamping is available if and only the hardware reports it is capable of reporting it To verify whether RoCE Time Stamping is available run ibv ex query device Mellanox Technologies 77 J Rev 2 1
191. g the installation process If you need to install Linux distributions over Flexboot please replace your initrd images with the images found at www mellanox com gt Products gt Adapter IB VPI SW gt FlexBoot Download Tab 8 1 Case 1 InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section A 8 1 1 for an example ib_addr ko ib core ko ib mad ko jb sa ko jb jb uverbs ko jb ucm ko jb umad ko iw cm ko rdma cm ko rdma ucm ko mlx4 core ko mlx4 ib ko jb mthca ko ipoib_helper ko this module is not required for all OS kernels Please check the release notes ib ipoib ko A 8 1 1 Example Adding an IB Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section 4 3 3 1 IPoIB Config uration Based on DHCP and is connected to the client machine 3 An initrd file 226 Mellanox Technologies Rev 2 1 1 0 0 4 To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the IB Driver to the initrd File executed by users with expertise in the boot process Improper application of this pro The fol
192. g the underlying networking infrastructure provided by Mellanox HCA switch hard ware This includes a variety of enhancements that take advantage of Mellanox networking hard ware including Multiple transport support including RC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication These enhancements significantly increase the scalability and performance of message commu nications in the network alleviating bottlenecks within the parallel communication libraries The latest MXM software can be downloaded from the Mellanox website MLNX OFED v2 0 or later comes with a pre installed version of MXM v2 x and OpenMPI compiled with MXM v2 x Compiling OpenMPI with MXM Step 1 Install MXM from RPM rpm ihv mxm x y z 1 x86 64 rpm MXM will be installed automatically in the opt mellanox mxm folder Step 2 Enter OpenMPI source directory and run cd OMPI HOME configure with mxm opt mellanox mxm other configure parameters make all amp amp make install oe MLNX OFED v2 0 or later comes with a pre installed version of MXM v2 x and OpenMPI compiled with MXM v2 x To check the version of MXM installed on your host run rpm qi mxm 118 Mellanox Technologies Rev 2 1 1 0 0 gt To upgrade MLNX_OFED v2 0 or later with a newer
193. ged packets sent by the VF are silently dropped The default behavior is VGT The feature may be controlled on the Hypervisor from userspace via iprout2 netlink ip link set dev DEVICE group DEVGROUP up down NUM mac LLADDR vlan VLANID qos VLAN QOS spoofchk on off 1 use ip link set dev PF device vf NUM vlan vlan id qos lt qos gt where NUM 0 max vf num vlan id 0 4095 4095 means set qos 0 7 For example ip link set dev eth2 vf 2 qos 3 sets VST mode for VF 2 belonging to PF eth2 with qos 3 ip link set dev eth2 vf 2 4095 sets mode for VF 2 back to VGT 4 13 7 3 2Additional Ethernet VF Configuration Options Guest MAC configuration By default guest MAC addresses are configured to be all zeroes In the mInx_ofed guest driver if a guest sees a zero MAC it generates a random MAC address for itself If the administrator wishes the guest to always start up with the same MAC he she should configure guest MACS before the guest driver comes up The guest MAC may be configured by using ip link set dev PF device vf NUM mac lt LLADDR gt For legacy guests which do not generate random MACS the adminstrator should always configure their MAC addresses via ip link as above Spoofchecking Spoof checking is currently available only on upstream kernels newer than 3 1 ip link set dev PF device vf NUM spoofchk on off 4
194. gister to a shared MR A new verb called ibv reg shared mr is added to enable sharing an MR To use this verb the application supplies the MR ID that it wants to register for and the desired access mode to that MR The desired access is validated against its given permissions and upon successful creation the physical pages of the original MR are shared by the new MR Once the MR is shared it can be used even if the original MR was destroyed The request to share the MR can be repeated multiple times and an arbitrary number of Memory Regions can potentially share the same physical memory locations Usage Uses the handle field that was returned from the ibv reg mr as the mr handle Supplies the desired access mode for that MR Supplies the address field which can be either NULL or any hint as the required output The address and its length are returned as part of the ibv_mr struct To achieve high performance it is highly recommended to supply an address that is aligned as the origi nal memory region address Generally it may be an alignment to 4M address For further information on how to use the ibv reg shared mr verb please refer to the ibv_reg_shared_mr man page and or to the ibv_shared_mr sample program which demonstrates a basic usage of this verb Further information on the ibv_shared_mr sample program can be found in the ibv_shared_mr man page 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand
195. gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate MTU and scope should be specified as defined in the IBTA specifica tion for example mtu 4 for 2048 To use 4K MTU edit that entry to mtu 5 5 indicates 4K MTU to that specific partition PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet SELF means subnet manager s port An empty list means that there are no ports in this partition Notes White space is permitted between delimiters The line can be wrapped after after a Partition Definition and between A PartitionName does not need to be unique but PKey does need to be unique Ifa PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note Itis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions 150 Mellanox Technologies Rev 2 1 1 0
196. h the hardware and with the kernel and user space The application level also shows the versatility of markets that Mellanox OFED applies to Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards BackEnd Dra Block etntool HPC Appi l i UDAPL MPI uverbs rdmacm Sockets Layer SCSI TCP Mid Layer nem IP E Netdevice TJ iSER crow IPoIB verbs CMA ib_core mlx4_en mlx5 ib IB mix4 ib IB and RoCE Adapter Driver 5 core Adapter Driver 4 core Mellanox VPI Device HCA NIC The following sub sections briefly describe the various components of the Mellanox OFED stack 13 1 mlx4 VPI Driver m1x4 is the low level driver implementation for the ConnectX family adapters designed by Mel lanox Technologies ConnectX family adapters can operate as an InfiniBand adapter or as an Ethernet NIC The OFED driver supports InfiniBand and Ethernet NIC configurations To accommodate the supported configurations the driver is split into the following modules mlx4 core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mlx4 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer Mellanox Technologies 21 Rev 2 1 1 0 0 Me
197. h0 u 1 5 maps sk prio 0 of etho device to UP 1 and sk prio 1to UP 5 Setting set egress map in VLAN maps the skb priority of VLAN to a v1an qos The v1an qos is represents a UP for the VLAN device n RoCE set option with RbMA OPTION ID TOS could be used to set the UP When creating QPs the s1 field in modify command represents the UP Indicating the TC 68 Mellanox Technologies Rev 2 1 1 0 0 After mapping the skb priority to UP one should map the UP into a TC This assigns the user priority to a specific hardware traffic class In order to do that m1nx qos should be used m1nx qos gets a list of a mapping between UPs to TCs For example m1nx qos ietho p 0 0 0 0 1 1 1 1 maps UPs 0 3 to Tco and Ups 4 7 to Tc1 4 5 7 Quality of Service Properties The different QoS properties that can be assigned to a TC are Strict Priority see Strict Priority e Minimal Bandwidth Guarantee ETS see Minimal Bandwidth Guarantee ETS Rate Limit see Rate Limit 4 5 7 1 Strict Priority When setting a TC s transmission algorithm to be strict then this TC has absolute strict prior ity over other TC strict priorities coming before it as determined by the TC number TC 7 is highest priority TC 0 is lowest It also has an absolute priority over non strict TCs ETS This property needs to be used with care as it may easily cause starvation of other TCs A higher strict priority
198. has not been configured by the installation The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port like any other network adapter card 1 e you need to prepare a file called ifcfg ib lt n gt for each port The first port on the first HCA in the host is called interface ib0 the second port is called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 3 3 1 or on a static configuration Section 4 3 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 3 3 3 4 3 3 1 IPoIB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP is performed similarly to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp If IPoIB configuration files are included ifcfg ib lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine etc sysconfig network on a SuSE machine A patch for DHCP is required for supporting IPoIB For further information please see the REAME which is available under the docs dhcp directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hard ware address To overcome this problem DHCP over InfiniBand messages convey a c
199. hat would be otherwise illegal 1 e not allowed by DOR rules Torus 2QoS will introduce such a turn as close as possible to the failed switch in order to route around it n the above example suppose switch T has failed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns i e those turns that are illegal under DOR This causes the first hop after any such turn to use a separate set of VL values and pre vents deadlock in the presence of a single failed switch For any given path only the hops after a turn that is illegal under DOR can contribute to a credit loop that leads to deadlock So in the example above with failed switch T the location of the illegal turn at I in the path from S to D requires that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I 1 hop r D cannot contribute to a credit loop Mellanox Technologies 159 Rev 2 1 1 0 0 OpenSM Subnet Manager because they cannot be used to construct a loop encir
200. hcp host1 touch tmp initrd ib var state dhcp dhclient leases host1 cp bin uname tmp initrd ib bin host1 cp usr bin expr tmp initrd ib bin hostili cp sbin ifconfig tmp initrd ib bin host1 cp bin hostname tmp initrd ib bin Create a configuration file for the DHCP client as described in Section 4 3 3 1 and place it under tmp initrd ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number For a ConnectX device interface 1 0 send dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 Step 10 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd_ib init and add the following lines at the point you wish the IB driver to be loaded Hm The order of the following commands for loading modules is critical echo loading ipv6 sbin insmod lib modules ipv6 ko echo loading IB driver sbin insmod lib modules ib ib addr ko sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insmod lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod lib modules ib iw cm ko 228 Mellanox Technologies Rev 2 1 1 0 0 sbin insmod lib modules ib rdma cm ko sbin insmod
201. he D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without D OpenSM defaults to ERROR INFO 0x3 Specifying D 0 disables all messages Specifying D 0xFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option debug d lt number gt This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 d1 Force single threaded dispatching Ignore other SM nodes d2 Force log flushing after each log message d3 Disable multicast support d10 Put OpenSM in testability mode Without d no debug options are enabled scel Display this usage info then exit 8 2 2 Environment Variables The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet lst opensm fdbs opensm mcfdbs By default this directory is var lo
202. he current QoS mappings and configuration Mellanox Technologies 69 J Rev 2 1 1 0 0 Driver Features The tool will also display maps configured by TC and vconfig set egress map tools in order to give a centralized view of all QoS mappings Set UP to TC mapping Assign a transmission algorithm to each TC strict or ETS Set minimal BW guarantee to ETS TCs Set rate limit to TCs For unlimited ratelimit set the ratelimit to 0 Usage mlnx_gos i interface options Options version show program s version number and exit h help show this help message and exit Sf SLI maps UPs to TCs LIST is 8 comma seperated TC numbers Example 0 0 0 0 1 1 1 1 maps UPs 0 3 to TCO and UPs C s LIST tsa LIST Transmission algorithm for each TC LIST is comma seperated algorithm names for each TC Possible algorithms strict etc Example ets strict ets sets TCO TC2 to ETS and 1 to strict The rest are unchanged t LIST tcbw LIST Set minimal guaranteed BW for ETS TCs LIST is comma seperated percents for each TC Values set to TCs that are not configured to ETS algorithm are ignored but must be present Example if TCO TC2 are set to ETS then 10 0 90 will set TCO to 10 and TC2 to 90 Percents must sum to 100 r LIST ratelimit LIST Rate limit for TCs in Gbps LIST is a comma seperated Gbps limit for each TC Example 1 8 8 will limit to 1Gbps and TC1 TC2 to 8 Gbps
203. he multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes SM report Number of nodes and systems Hop count information maximal hop count an example path and a hop count histo gram All CA to CA paths traced Credit loop report mgid mlid HCAs multicast group and report Partitions report IPoIB report w Incase the IB fabric includes only CA then CA to CA paths are not reported Furthermore if a topology file is provided ibdiagnet uses the names defined in it for adi the output reports Mellanox Technologies 193 9 5 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Error Codes Failed to fully discover the fabric Failed to parse command lin
204. he system network administrators The following is the general mapping traffic to Traffic Classes flow 1 The application sets the required Type of Service ToS 2 The ToS is translated into a Socket Priority sk prio 3 The sk prio is mapped to a User Priority UP by the system administrator some applica tions set sk prio directly 4 The UP is mapped to TC by the network system administrator 5 TCs hold the actual QoS parameters QoS can be applied on the following types of traffic However the general QoS flow may vary among them Plain Ethernet Applications use regular inet sockets and the traffic passes via the ker nel Ethernet driver e Applications use the ROMA API to transmit using QPs Raw Ethernet QP Application use VERBs API to transmit using a Raw Ethernet QP 4 5 3 Plain Ethernet Quality of Service Mapping Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver The following is the Plain Ethernet QoS mapping flow 66 Mellanox Technologies Rev 2 1 1 0 0 1 The application sets the ToS of the socket using setsockopt IP TOS value 2 ToS is translated into the sk_prio using a fixed translation TOS 0 lt gt sk prio 0 TOS 8 lt gt sk prio 2 TOS 24 sk prio 4 TOS 16 lt gt sk prio 6 3 The Socket Priority is mapped to the UP Ifthe underlying device is a VLAN device egress map is used controlled by the vconfig command This i
205. hernet on page 66 and its subsections Section 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand on page 86 Section 4 13 7 Configuring Pkeys and GUIDs under SR IOV on page 97 and its subsections Section 4 15 Ethtool on page 103 Appendix E Lustre Compilation over MLNX OFED page 240 2 0 2 0 5 April 2013 Initial release Mellanox Technologies 13 J Rev 2 1 1 0 0 About this Manual This Preface provides general information concerning the scope and organization of this User s Manual Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g IKB 1024 bytes and IMB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g Kb 1024 bits FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand iSER iSCSI RDMA Protocol LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most signifi
206. hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimension Use R dor option to activate the DOR algorithm Torus 2QoS Routing Algorithm Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics The torus 2QoS routing engine can provide the following functionality on a 2D 3D torus Free of credit loops routing Two levels of QoS assuming switches support 8 data VLs Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL values Very short run times with good scaling properties as fabric size increases Unicast Routing Torus 2QoS is a DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension It
207. iagpath queries for the performance counters along the path between the source and destination ports it always traverses the LID route even if a directed route Is specified If along the LID route one or more links are not in the ACTIVE state ibdi agpath reports an error Moreover the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source Synopsis ibdiagpath n lt src name dst name gt 1 lt src lid dst lid gt d lt pl p2 p3 gt c lt count gt v t lt topo file gt s sys name ic lt dev index gt c p lt port num gt o out dir lw 1x 4x 12x 1s lt 2 5 5 10 gt pm pc P lt lt PM counter gt lt Trash Limit gt gt 194 Mellanox Technologies Rev 2 1 1 0 0 Options n lt src name dst name gt Names of the source and destination ports as defined in the topology file source may be omit ted gt local port is assumed to be the source 1 src lid dst lid Source and destination LIDs source may be omit ted gt the local port is assumed to be the source a o Directed route from the local node which is the source and the destination node C count The minimal number of packets to be sent across each link default 100 V Enable verbose mode t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the loca
208. ign and terminated by EOL Any keyword should be the first non blank in the line unless it s a comment Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules Any section subsection of the policy file is optional Examples of Advanced Policy File As mentioned earlier any section of the policy file 1s optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file gos levels gos level name DEFAULT sl 0 end qgos level end qos levels Port groups section is missing because there are no match rules which means that port groups are not referred anywhere and there 1s no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example 1s equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning port groups port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment use SRP Targets port guid 0x100000000000
209. imestamp timestamp wc_ex timestamp Timestamp is given in raw hardware time 78 Mellanox Technologies Rev 2 1 1 0 0 that are opened with the create cq ex versb should be always be polled with the ibv poll cq ex verb 4 6 2 4 Querying the Hardware Time Querying the hardware for time is done via the ibv query values ex verb For example ret ibv query values ex context IBV VALUES HW CLOCK amp queried values if ret amp amp queried values comp mask amp IBV VALUES HW CLOCK queried time queried values hwclock To change the queried time in nanoseconds resolution use the IBV VALUES HW CLOCK Ns flag along with the hwclock ns field ret ibv query values ex context IBV VALUES HW CLOCK NS amp queried values if ret amp amp queried values comp mask amp IBV VALUES HW CLOCK NS queried time ns queried values hwclock ns Querying the Hardware Time is available only on physical functions native machines 4 7 Atomic Operations Atomic Operations are applicable to the mlx4 driver only 4 7 1 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those defined by the IB spec Atomicity guarantees Atomic Ack generation ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec 4 7 1 1 Mask
210. ing support RFS cannot function if LRO is enabled LRO be disabled via ethtool of the rest The lowest priority domain serves the following users The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP using L2 flow specifications Mellanox Technologies 89 J Rev 2 1 1 0 0 Driver Features The mlx4 ipoib driver when it attaches its QP to his configured GIDS Fragmented UDP traffic cannot be steered It is treated as other protocol by hardware from the first packet and not considered as UDP traffic F We recommend using 1ibibverbs v2 0 3 0 0 and 1ibmlx4 v2 0 3 0 0 and higher as of MLNX OFED v2 0 3 0 0 due to API changes 4 13 Single Root IO Virtualization SR IOV Single Root IO Virtualization SR IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus This technology enables multiple virtual instances of the device with separate resources Mellanox adapters are capable of exposing in ConnectX 3 adapter cards 63 virtual instances called Virtual Functions VFs These virtual functions can then be provisioned separately Each VF can be seen as an addition device con nected to the Physical Function It shares the same resources with the Physical Function and its number of ports equals those of the Physical Function SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual
211. inipath psm HH HH HH HH HH HH HH HH He 4E 4L HH Preparing e HHH HH HH HH HH HH HH HH Ht HHH HHH HH infinipath psm devel HH HH HH i HH HH i HH E 41 4E 4L HH HH HH i Preparing es HHH HH HH HH HH HHH HH Ht dy HH HHH i i mvapich2 HH HH HH HH HH HH Lu HH HH 4E 4L i HH HH i Preparing tes HHH i HH HHH HH HHH HH HH METANET HHH HHH i HH openshmem HHH i HH i HH HHH i HH Hd 4E 4L HH HHH i Preparing 5 HHH HH HH HH HH HHH i HHH HH HH H HH i hcoll HH HH Ht db HH HH HH HH L HH HH E 41 4E 4L HH HH HH Preparing HHH HH HH HH HH HH HH HH He dy HHH HHH HH HH libibprof HH HH HH HH HH i HH E 41 4E 4L HH HH HH Preparing HHH i HH HH HH Het i HH Hed dy HH HHH i libvma HH HH HH HH HH HH HH HH E 41 4E 4L HH HH Changing max locked memory to unlimited in etc security limits conf Please log out from the shell and login again in order to update this change Read more about this topic in the VMA s User Manual VMA README txt is installed at usr share doc libvma 6 5 8 0 README txt Please refer to VMA journal for the latest changes usr share doc libvma 6 5 8 0 journal txt Preparing HH HH HU HH HH HH HH HH E 4 4E 4L HH HH
212. integrated PCI express controller Thus every PCIe adapter OS is connected directly toa NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCle adapter is connected In order to identify which NUMA node is the adapter s node the system BIOS should support ACPI SLIT gt To see if your system supports PCIe adapter s NUMA node detection cat sys class net interface device numa_node cat sys devices PCI root PCIe function numa_node 132 Mellanox Technologies Rev 2 1 1 0 0 Example for supported system cat sys class net eth3 device numa node 0 Example for unsupported system cat sys class net ib0 device numa node 1 7 2 6 1 1 Improving Application Performance on Remote NUMA Node Verbs API applications that mostly use polling will have an impact when using the remote NUMA node libmlx4 has a build in enhancement that recognizes an application that is pinned to a remote NUMA node and activates a flow that improves the out of the box latency and throughput However the NUMA node recognition must be enabled as described in section Tuning for Intel Sandy Bridge Platform on page 132 In systems which do not support SLIT the following environment variable should be applied MLX4 LOCAL CPUS 0x bit mask of local NUMA node Example for local NUMA node which its cores are 0 7 LOCAL CPUS 0xff Additional modification can
213. inuz 2 6 32 36 x86 64 ro root dev VolGroup00 LogVol00 rhgb quiet intel iommu on initrd initrd 2 6 32 36 x86 64 img 1 Please make sure the parameter intel iommu on exists when updating the boot grub grub conf file otherwise SR IOV cannot be loaded Mellanox Technologies 91 Rev 2 1 1 0 0 Driver Features 1 2 Step 5 Step 6 Step 7 Step 8 Install the MLNX_OFED driver for Linux that supports SR IOV Verify the HCA is configured to support SR IOV root selene mstflint dev lt PCI Device gt dc Verify in the HCA section the following fields appear HCA num pfs 1 total vfs 0 63 sriov_en true HCA parameters can be configured during firmware update using the m1nxofedinstall script and running the enable sriov installation parameter If the current firmware version is the same as one provided with MLNX OFED run it in combination with the force fw update parameter This configuration option is supported only in HCAs whose configuration file INI is included in MLNX OFED de Parameter Recommended Value num pfs 1 Note This field is optional and might not always appear total vfs 63 sriov en true Ifthe HCA does not support SR IOV please contact Mellanox Support support mellanox com Create the text file etc modprobe d mlx4 core conf if it does not exist otherwise delete its contents Insert an option line in the etc modprobe d mlx4 co
214. ion directly into Sync packets In this case transmitted Sync packets will not received a time stamp via the socket error un HWTSTAMP TX ONESTEP SYNC In Note for send side time stamping currently only HWTSTAMP TX OFF and HWTSTAMP TX ON are supported Mellanox Technologies 75 J Rev 2 1 1 0 0 Driver Features Receive side time sampling Enabled by ifreq hwtstamp_config rx_ filter when possible values for hwtstamp config gt rx filter enum hwtstamp rx filters time stamp no incoming packet at all HWTSTAMP FILTER NONE time stamp any incoming packet HWTSTAMP FILTER ALL return value time stamp all packets requested plus some others HWTSTAMP FILTER SOME PTP vi UDP any kind of event packet HWTSTAMP FILTER PTP V1 14 EVENT PTP vi UDP Sync packet HWTSTAMP FILTER PTP V1 L4 SYNC PTP vl UDP Delay req packet HWTSTAMP FILTER PTP V1 14 DELAY REQ PTP v2 UDP any kind of event packet HWTSTAMP FILTER PTP V2 14 EVENT PTP v2 UDP Sync packet HWTSTAMP FILTER PTP V2 L4 SYNC PTP v2 UDP Delay req packet HWTSTAMP FILTER PTP V2 14 DELAY REQ 802 51 Ethernet any kind of event packet HWTSTAMP FILTER PTP V2 L2 EVENT 802 AS1 Ethernet Sync packet HWTSTAMP FILTER PTP V2 L2 SYNC 802 AS1 Ethernet Delay req packet HWTSTAMP FILTER PTP V2 L2 DELAY REQ PTP v2 802 AS1 a
215. ion ROM image In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file 2 3 5 Post installation Notes Most of the Mellanox OFED components can be configured or reconfigured after the installation by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 2 3 6 Installation Logging While installing MLNX OFED the install log for each selected package will be saved in a sepa rate log file The path to the directory containing the log files will be displayed after running the installation script in the following format Logs dir tmp MLNX_OFED_LINUX lt version gt lt PID gt logs Example Logs dir tmp MLNX OFED LINUX 2 1 0 0 2 31701 logs 2 4 Updating Firmware After Installation In case you ran the m1nxofedinstall script with the without fw update option and now you wish to manually update firmware on your adapter card s you need to perform the following steps If you need to burn an Expansion ROM image please refer to Burning the sion ROM Image on page 221 The following steps
216. ious example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible key words qos ulps default 0 default SL Sdp port num 30000 0 SL for application running on top of SDP when a destination TCP IPport is 30000 sdp port num 10000 20000 s sdp 1 4 default SL for any other application running on top of SDP rds 2 2 is SM TOW IMIS EHE ipoib pkey 0x0001 0 SL for IPoIB on partition with pkey 0x0001 ipoib 4 default IPoIB partition pkey 0x7FFF any service id 0x6234 6 match any PR MPR query with a specific Service ID Mellanox Technologies 171 Rev 2 1 1 0 0 OpenSM Subnet Manager any pkey 0x0ABC 6 match any PR MPR query with a specific PKey srp target port guid 0x1234 5 SRP when SRP Target is located on a specified IB port GUID any target port guid 0x0ABC 0xFFFFF 6 match any PR MPR query with a specific target port GUID end qos ulps Similar to the advanced policy definition matching of PR MPR queries is done in order of appearance in the QoS policy file such as the first match takes precedence except for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query
217. istration activities traversing the IB connectivity only Local Identifier ID An address assigned to a port data sink or source point by the Subnet Manager unique within the subnet used for directing packets within the subnet Local Device Node The IB Host Channel Adapter HCA Card installed on the System machine running IBDIAG tools Mellanox Technologies 15 J Rev 2 1 1 0 0 Table 3 Glossary Sheet 2 of 2 Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Man ager The Subnet Manager that is authoritative that has the refer ence configuration information for the subnet See Subnet Manager Multicast Forward ing Tables A table that exists in every switch providing the list of ports to forward received multicast packet The table is organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot Card NIC and provides one or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in the ager role of a Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administra An application normally part of the Subnet Manager that tor SA implements the interface for querying and manipulating subnet management data Subnet Manager One of several entities involved in the configuration and con SM trol of the an
218. ive Routing configuration file is case sensitive You can specify options for nonexisting switch GUID These options will be ignored until a switch with a matching GUID will be added to the fabric Adaptive Routing configuration file is parsed every AR Manager cycle which in turn is executed at every heavy sweep of the Subnet Manager Ifthe AR Manager fails to parse the options file default settings for all the options will be used 180 Mellanox Technologies Table 20 Adaptive Routing Manager Options File Rev 2 1 1 0 0 8 8 5 1 General AR Manager Options Option File Description Values ENABLE lt true false gt AR MODE lt bounded free gt Enable disable Adaptive Routing on fabric switches Note that if a switch was identified by AR Man ager as device that does not support AR AR Manager will not try to enable AR on this switch If the firmware of this switch was updated to support the AR the AR Manager will need to be restarted by restarting Subnet Man ager to allow it to configure the AR on this switch This option can be changed on the fly Adaptive Routing Mode free no constraints on output port selection bounded the switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets This option can be changed on the fly Default true Default bounded AGEING TIME lt usec gt Applicable to
219. l mft mlnx any RPM Preparing de HHH i H H HH H HH H HH HH H HH H HH H HHH HHH HHH HH HH kmod kernel mft mlnx HHH i H H HH H i H HH HH H i H HH H4 gd H HH H HH i Installing knem mlnx RPM Preparing ae HHH i H H Het H HH HH H HH H HH H HHH H HHH i knem ml nx HH H H H HH H HH H HH HH H HH H HH H HHH H HH H i HH Installing kmod knem mlnx 1 1 90mlnx2 RP Preparing HHH T RREH 1i T 4 1i T Lu T T Tritt kmod knem mlnx HH H HH H H HH H HH H HH HH H HH H HH HHHH i H HHH i Installing ummunotify mlnx RP Preparing HHH i H H HHH i H HH HHH i H HHH HH H HHH i ummunotify mlnx i HHHH HHH H HH Het i H HHH HH HH H HHH HH Installing kmod ummunotify mlnx 1 0 RPM Preparing fa HHH HH H H HH H HH H HH HH H HH H HH HHHH HH H HHH HH HH kmod ummunotify mlnx HH H HH H H HH H i H HH HH H i H HH HHHH i H HH H HH i Installing mpi selector RPM Preparing Ae HH H HH H H HH H HH H HH HH H HH H HH HHHH i H HH H HH mpi sel ector HHH HH H H HH H HH H HH HH H HH H HH HHHH HHH HHH HH HH 32 Mellanox Technologies Rev 2 1 1 0 0 Installing user level RPMs
220. l system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p lt port num gt Specifies the local device s port number used to connect to the IB fabric o lt out dir gt Specifies the directory where the output files will be placed default tmp lw lt 1x 4x 12x gt Specifies the expected link width ls lt 2 5 5 10 gt Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm pc Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values Output Files Table 28 ibdiagpath Output Files Output File Description ibdiagpath log A dump of all the application reports generated according to the provided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links Mellanox Technologies 195 9 6 9 7 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Error Codes 1 The path traced is un healthy 2 Failed to parse command line options 3 More then 64 hops are required for traversing the local port to the Source port and then to the
221. lelism is fixed at program startup time typically with a single thread of execution per processor In order to express parallelism UPC extends ISO C 99 with the following constructs An explicitly parallel execution model A shared address space Synchronization primitives and a memory consistency model Memory management primitives The UPC language evolved from experiences with three other earlier languages that proposed parallel extensions to ISO C 99 AC Split C and Parallel C Preprocessor PCP UPC is not a superset of these three languages but rather an attempt to distill the best characteristics of each UPC combines the programmability advantages of the shared memory programming paradigm and the control over data layout and performance of the message passing programming para digm Mellanox ScalableUPC is based on Berkely UPC package see http upce lbl gov and contains the following enhancements GasNet library used within UPC integrated with Mellanox FCA which off loads from UPC collective operations For further information on FCA please refer to the Mellanox website e GasNet library contains MXM conduit which offloads from UPC all P2P operations as well as some synchronization routines For further information on MXM please refer to the Mellanox website Mellanox OFED 1 8 includes ScalableUPC 2 1 which is installed under opt mellanox bupc pe If you have installed OFED 1 8 you do not need to downlo
222. lib modules ib rdma ucm ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4 ib ko sbin insmod lib modules ib ib mthca ko The following command loading ipoib_helper ko is not required for all OS kernels Please check the release notes e sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib ipoib ko Step 11 In case of interoperability issues between 1SCSI and Large Receive Offload LRO change the last command above as follows to disable LRO sbin insmod lib modules ib ib ipoib ko lro 0 Step 12 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin dhclient cf sbin dhclient conf ibl Step 13 Save the init file Step 14 Close initrd hostl cd tmp initrd ib host1 find cpio H newc o gt tmp new initrd ib img hostl gzip tmp new init ib img Step 15 At this stage the modified initrd including the IB driver is ready and located at tmp new init ib img gz Copy itto the original initrd location and rename it prop erly A 8 2 Case Il Ethernet Ports The Ethernet driver requires loading the following modules in the specified order see the exam ple below mlx4 core ko mlx4 en ko A 8 2 1 Example Adding an Etherne
223. lient iden tifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 4 2 Configuring the DHCP Server on page 222 4 3 3 1 1 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate config uration file needs to be created By default the DHCP server looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of the Mel lanox OFED for Linux installation The DHCP server must run on a machine which has loaded the IPoIB module Mellanox Technologies 57 J Rev 2 1 1 0 0 Driver Features To run the DHCP server from the command line enter dhcpd lt IB network interface name gt d Example host1 dhcpd ib0 d 4 3 3 1 2 DHCP Client Optional A DHCP client be used if you need to prepare a diskless machine with IB driver See Step 8 under Example Adding an IB Driver to initrd Linux
224. line keywords apply to a new seed specification For maximum resiliency no seed specification should share a switch with any other seed specifi cation Multiple seed specifications should use dateline configuration to ensure that torus 2QoS can grant path SL values that are constant regardless of which seed was used to initiate topology discovery portgroup max ports max ports This keyword specifies the maximum number of parallel inter switch links and also the maximum number of host ports per switch that torus 2QoS can accommodate The default value is 16 Torus 2QoS will log an error message during topology discovery 1f this parameter needs to be increased If this keyword appears multiple times the last instance prevails port order pl p2 p3 This keyword specifies the order in which CA ports on a destination switch are visited when computing routes When the fabric contains switches connected with multiple parallel links routes are distributed in a round robin fashion across such links and so changing the order that CA ports are visited changes the distribution of routes across such links This may be advantageous for some specific traffic patterns The default is to visit CA ports in increasing port order on destination switches Duplicate values in the list will be ignored EXAMPLE Look for a 2D since x radix is one 4x5 torus 25 5 y is radix 4 torus dimension need both ym link and yp link configuration yp
225. link 0x200000 0x200005 sw y 0 z 0 sw y 1 z 0 ym link 0x200000 0x20000f sw y 0 z 0 gt sw y 3 z 0 is not radix 4 torus dimension only need one of m link or zp link configuration zp link 0x200000 0x200001 sw y 0 z 0 gt sw 0 2 1 next seed yp link 0x20000b 0x200010 sw 0 y 2 z 1 sw y 3 z 1 ym link 0x20000b 0x200006 sw 0 y 2 z 1 sw y 1 z 1 2 link 0x20000b 0x20000c 4 sw y 2 z 1 gt sw 2 2 2 y dateline 2 Move the dateline for this seed 2 dateline 1 back to its original position If OpenSM failover is configured for maximum resiliency one instance should run on a host attached to a switch from the first seed and another instance should run on a host attached to a switch from the second seed Both instances should use this torus 2QoS conf to ensure path SL values do not change in the event of SM failover port order defines the order on which the ports would be chosen for routing pong 10 1A 25 29 AS 2Q 27 30 Mellanox Technologies 165 N N Rev 2 1 1 0 0 OpenSM Subnet Manager 8 6 Quality of Service Management in OpenSM 8 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the
226. llanox OFED Overview mlx4 en A 10 40GigE driver under drivers net ethernet mellanox mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer 1 3 2 mlx5 Driver m1x5 is the low level driver implementation for the Connect IBTM adapters designed by Mella nox Technologies Connect IB operates as an InfiniBand adapter The mlx5 driver is com prised of the following kernel modules mlx5 core Acts as a library of common functions e g initializing the device after reset required by the Connect IB adapter card mlx5 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer libmIx5 libmlx5 is the provider library that implements hardware specific user space functionality If there is no compatibility between the firmware and the driver the driver will not load and a mes sage will be printed in the dmesg The following are the Libmlx5 environment variables e MLX5 FREEZE ON ERROR Causes the process to hang in a loop when completion with error which is not flushed with error or retry exceeded occurs Otherwise disabled e MLX5 POST SEND PREFER BF Configures every work request that can use blue flame will use blue flame Otherwise blue flame depends on the size of the message and inline indication in the packet MLX5 SHUT UP BF Disables blue flame feature Otherwise do not disable MLX5 SINGLE THREADED Allspinlocks are disabled Otherwis
227. lowing code lines are an excerpt from a sample IPoIB configuration file Static settings all values provided by this file IPADDR ib0 11 4 3 175 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 TEADDR NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on the first eth n interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 TPADPRE NOEL NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 4 3 3 3 Manually Configuring IPoIB This manual configuration persists only until the next reboot or driver restart manually configure IPoIB for the default IB partition VLAN perform the following steps Step 1 configure the interface enter the ifconfig command with the following items The appropriate IB interface ib0 ibl etc The IP address that you want to assign to the interface The netmask keyword Mellanox Technologies 59 J Rev 2 1 1 0 0 Driver Features The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface host1 ifconfig ib0 11 4 3 175 netmask 255 255 0 0 Step 2 Optional V
228. lowing procedure modifies critical files used in the boot procedure It must be cedure may prevent the diskless machine from booting Step 1 Back up your current initrd file Step 2 Make anew working directory and change to it hostl mkdir tmp initrd ib hostl cd tmp initrd ib Step3 Normally the initrd image is zipped Extract it using the following command host1 gzip dc initrd image cpio id The initrd files should now be found under tmp initrd_ib Step 4 Create a directory for the InfiniBand modules and copy them host1 mkdir p tmp initrd ib lib modules ib host1 cd lib modules uname r updates kernel drivers hostl cp infiniband core ib addr ko tmp initrd ib lib modules ib hostl cp infiniband core ib core ko tmp initrd ib lib modules ib hostl cp infiniband core ib mad ko tmp initrd ib lib modules ib host1 cp infiniband core ib sa ko tmp initrd ib lib modules ib host1 cp infiniband core ib cm ko tmp initrd ib lib modules ib host1 cp infiniband core ib uverbs ko tmp initrd ib lib modules ib hostl cp infiniband core ib ucm ko tmp initrd ib lib modules ib hostl cp infiniband core ib umad ko tmp initrd ib lib modules ib hostl cp infiniband core iw cm ko tmp initrd ib lib modules ib host1 cp infiniband core rdma cm ko tmp initrd ib lib modules ib host1 cp infiniband core rdma ucm ko tmp initrd ib lib modules ib host1 cp net mlx4 mlx4 core ko tmp initrd ib lib modules ib host1 cp infiniband hw mlx4 mlx
229. lt if no routing algorithm is specified It can also be invoked by specifying R minhop The Min Hop algorithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter 1s supplied by i lt equalize ignore guids file gt ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis Mellanox Technologies 153 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 3 8 5 3 1 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a statisti cal histogram is built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node then it is marked as a root node Sin
230. lw ix 4x 8x 12x Specifies the expected link width w write topo file file name gt Write out a topology file for the discovered topology t topo file file Specifies the topology file name out ibnl dir directory The topology file custom system definitions ibnl directory screen num errs num Specifies the threshold for printing errors to screen default 5 smp window lt num gt Max smp MADs on wire default 8 gmp window num Max gmp MADs on wire default 128 max hops max hops Specifies the maximum hops for the discovery process default 64 V version Prints the version of the tool h help Prints help information without plugins help if exists H deep help Prints deep help information including plugins help Output Files Table 26 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 26 ibdiagnet of ibutils2 Output Files Output File Description ibdiagnet2 Ist Fabric links in LST format ibdiagnet2 sm Subnet Manager ibdiagnet2 pm Ports Counters ibdiagnet2 fdbs Unicast FDBs ibdiagnet2 mcfdbs Multicast FDBx ibdiagnet2 nodes_info Information on nodes 190 Mellanox Technologies Rev 2 1 1 0 0 Table 26 ibdiagnet of ibutils2 Output Files Output File Description ibdiagnet2 db_csv ibdiagnet internal database An ibdiagnet run performs the following st
231. machines direct hardware access to network resources hence increasing its performance In this chapter we will demonstrate setup and configuration of SR IOV in a Red Hat Linux envi ronment using Mellanox ConnectX VPI adapter cards family 4 13 1 System Requirements To set up an SR IOV environment the following is required MLNX OFED Driver Aserver blade with an SR IOV capable motherboard BIOS Hypervisor that supports SR IOV such as Red Hat Enterprise Linux Server Version 6 Mellanox ConnectX amp VPI Adapter Card family with SR IOV capability 4 13 2 Setting Up SR IOV Depending on your system perform the steps below to set up your BIOS The figures used in this section are for illustration purposes only For further information please refer to the appropriate BIOS User Manual 90 Mellanox Technologies Rev 2 1 1 0 0 Step 1 Enable SR IOV in the system BIOS BIOS SETUP UTILITY SR I0U Supported Enabled Step2 Enable Intel Virtualization Technology BIOS SETUP UTILITY Intel R Virtualization Tech Enabled Step 3 Install a hypervisor that supports SR IOV Step 4 Depending on your system update the boot grub grub conf file to include a similar command line load parameter for the Linux kernel For example to Intel systems add default 0 timeout 5 splashimage hd0 0 grub splash xpm gz hiddenmenu title Red Hat Enterprise Linux Server 2 6 32 36 x86 645 root hd0 0 kernel vml
232. mere Processors 128 Table 23 Recommended BIOS Settings for AMD Processors 128 Table 24 Adaptive Routing Manager Options File 181 Table 25 Adaptive Routing Manager Pre Switch Options 182 Table 26 Congestion Control Manager General Options File 185 Table 27 Congestion Control Manager Switch Options 185 Table 28 Congestion Control Manager CA Options File 185 Table 29 Congestion Control Manager CC MGR Options File 186 Table 30 ibdiagnet of ibutils2 Output Files 190 Table 31 ibdiagnet of ibutils Output Files 192 Table 32 ibdiagpath Output Files 195 Table 33 ibv_devinfo Flags and Options 196 Table 34 ibstatus Flags and Options 198 Table 35 ibportstate Flags and Options 200 10 Mellanox Technologies Rev 2 1 1 0 0 Table 36 ibportstate Flags and Options 204 Table 37 smpquery Flags and Options 207 Table 38 perfquery Flags and Options
233. mple 2 working with scst vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst b modprobe scst vdisk echo open vdisk0 dev md0 gt proc scsi tgt vdisk vdisk d echo open vdisk1 10G file gt proc scsi tgt vdisk vdisk e echo add vdisk0 0 gt proc scsi_tgt groups Default devices f echo add vdisk1 1 gt proc scsi_tgt groups Default devices 2 Run For all distributions except SLES 11 gt modprobe ib srpt For SLES 11 gt modprobe f ib srpt For SLES 11 please ignore the following error messages in var log messages when loading ib srpt to SLES 11 distribution s kernel ib srpt no symbol version for scst unregister ib srpt Unknown symbol scst unregister ib srpt no symbol version for scst register ib srpt Unknown symbol scst register ib srpt no symbol version for scst unregister target template ib srpt Unknown symbol scst unregister target template B On Initiator Machines On Initiator machines manually perform the following steps 234 Mellanox Technologies Rev 2 1 1 0 0 1 Run modprobe ib srp 2 Run ibsrpdm c d dev infiniband umadX to discover a new SRP target umad0 port 1 of the first HCA umadl port 2 of the first HCA umad2 port 1 of the second HCA 3 echo new target info gt sys class infiniband srp srp mthca0 1l add target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in the system i e mth
234. munications library that supports a unique set of parallel programming features including point to point and collective routines syn chronizations atomic operations and a shared memory paradigm used between the processes of a parallel programming application Mellanox ScalableSHMEM is based on the API defined by the OpenSHMEM org consortium The library works with the OpenFabrics RDMA for Linux stack OFED and also has the ability to utilize MellanoX Messaging libraries MXM as well as Mellanox Fabric Collective Accelera tions FCA providing an unprecedented level of scalability for SHMEM programs running over InfiniBand The latest ScalableSHMEM software can be downloaded from the Mellanox website Mellanox Technologies 113 Rev 2 1 1 0 0 HPC Features 5 1 2 Running SHMEM with FCA The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI or ScalableSHMEM process onto Mella nox InfiniBand managed switch CPUs As a system wide solution FCA utilizes intelligence on Mellanox InfiniBand switches Unified Fabric Manager and MPI nodes without requiring addi tional hardware The FCA manager creates a topology based collective tree and orchestrates an efficient collective operation using the switch based CPUs on the MPI ScalableSHMEM nodes FCA accelerates MPI ScalableSHMEM collective operation performance by up to 100 times providing a reduction
235. ncrease the scalability and performance of message com muni cations in the network alleviating bottlenecks within the parallel communication libraries 5 1 4 Running SHMEM with Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over contiguous pages It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr gt To activate MLNX_OFED 2 0 and the contiguous pages allocator with SHMEM Run the following argument to enable compound pages with SHMEM opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 If using compound pages is not possible then the user will fall back to regular hugepages mechanism To force use of compound pages allocator Run the following command opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 x MR FORCE CONTIG PAGES 1 For further information on the Contiguous Pages please refer to Section 4 9 Contiguous Pages on page 84 5 1 5 Running ScalableSHMEM Application The ScalableSHMEM framework contains the shmemrun utility which launches the executable from a service node to compute nodes This utility accepts the same command line parameters as mpirun from the OpenMPI package For further information please refer to OpenMPI MCA parameters documentation at http www open mpi org faq category running Run shmemrun help to obtain ScalableSHMEM job launcher runtime par
236. nes alias bond0 bonding For SLES users It is necessary to update the MANDATORY DEVICES environment variable in etc sysconfig net work config with the names of the IPoIB slave devices e g ib0 ibl etc Otherwise bonding mas ter may be created before IPoIB slave interfaces at boot time Itis possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bond ing master m Restarting openibd does no keep the bonding configuration via Network Scripts You have to restart the network service in order to bring up the bonding master After the configuration is saved restart the network service by running etc init d network ae restart 62 Mellanox Technologies Rev 2 1 1 0 0 4 4 Quality of Service InfiniBand 4 4 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand Servers J B Unified I O Administrator IB Ethernet Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager The basic need is to differentiate the servi
237. ng To enable Adaptive Routing perform the following 1 Create the Subnet Manager options file Run opensm c options file name 2 Add armgr to the event plugin name option in the file Event plugin name s event plugin name armgr 3 Run Subnet Manager with the new options file opensm F options file name Adaptive Routig Manager can read options file with various configuration parameters to fine tune AR mechanism and AR Manager behavior Default location of the AR Manager options file Is etc opensm ar mgr conf To provide an alternative location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin options option in the file options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F options file name See an example of AR Manager options file with all the default values in Example of Adaptive Routing Manager Options File on page 182 8 8 3 2 Disabling Adaptive Routing There are two ways to disable Adaptive Routing Manager 1 By disabling it explicitly in the Adaptive Routing configuration file 2 Byremoving the armgr option from the Subnet Manager options file Mellanox Technologies 179 Rev 2 1 1 0 0 OpenSM Subnet Manager Adaptive Routing mechanism is automatically disabled once the switch receiv
238. ng the Scaling Governor If the following modules are loaded CPU scaling is supported and you can improve perfor mance by setting the scaling mode to performance freq_table acpi cpufreq this module is architecture dependent It is also recommended to disable the module cpuspeed this module is also architecture depen dent gt To set the scaling mode to performance use echo performance gt sys devices system cpu cpu7 cpufreg scaling governor gt To disable cpuspeed use service cpuspeed stop 7 2 4 2 Kernel Idle Loop Tuning The mlx4_en kernel module has an optional parameter that can tune the kernel idle loop for bet ter latency This will improve the CPU wake up time but may result in higher power consump tion To tune the kernel idle loop set the following options in the etc modprobe d mlx4 conf file For MLNX OFED 2 0 x options mlx4 core enable sys tune 1 For MLNX EN 1 5 10 options mlx4 en enable sys tune 1 7 2 4 3 OS Controlled Power Management Some operating systems can override BIOS power management configuration and enable c states by default which results in a higher latency To resolve the high latency issue please follow the instructions below 1 Edit the boot grub grub conf file or any other bootloader configuration file 2 Add following kernel parameters to the bootloader command intel idle max cstate 0 processor max_cstate 1 3 Reboot the system Example title
239. ng tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop In addi tion if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link or switch failures that result in a fabric for which torus 2QoS can generate credit loop free unicast routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example consider that same 2D 6x5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree 4 I l l l l I 3 l l l l I 2 l l l l l 1 l l l l l y 0 Mellanox Technologies 161 Rev 2 1 1 0 0 OpenSM Subnet Manager
240. nge 0 3 and also in the range 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via qos vlarb high qos vlarb low etc 8 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the cor rect path SL for connection setup Applications that use rdma cm for connection setup will auto matically meet this requirement If a change in fabric topology causes changes in path SL values required to route without credit loops in general all applications would need to repath to avoid message deadlock Since torus 2QoS has the ability to reroute after a single switch failure without changing path SL values repathing by running applications is not required when the fabric is routed with torus 2QoS Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover provided that all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that prevent routing that is free of credit loops and will log warnings and refuse to route If no fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that ca
241. nnection capability See Section 4 1 2 4 4 1 2 4 Automatic Discovery and Connection to Targets Make sure that the ib_srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running To connect to all the existing Targets in the fabric run srp daemon e o This util ity will scan the fabric once connect to every Target it detects and then exit srp_daemon will follow the configuration it finds in etc srp_daemon conf Thus it will ignore a target that is disallowed in the configuration file ade To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric e To execute SRP daemon as a daemon on all the ports srp daemon sh found under usr sbin Srp daemon sh sends its log to var log srp_daemon log Start the srpd service script run service srpd start It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRP DAEMON ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Section 4 1 2 6 For the changes in openib conf to take effect run etc init d openibd restart 52 Mellanox Technologies Rev 2 1 1 0 0
242. nning Subnet Manager with Adaptive Routing Manager 179 8 84 Querying Adaptive Routing Tables 180 8 8 5 Adaptive Routing Manager Options File 180 8 9 Congestion Control RR TU a ET Be 183 8 9 1 Congestion Control Overview 183 8 9 2 Running OpenSM with Congestion Control Manager 183 8 9 3 Configuring Congestion Control 183 8 9 4 Configuring Congestion Control Manager Main Settings 184 Chapter 9 InfiniBand Fabric Diagnostic Utilities 187 SOVEIMICW uen dudas k im akaqa sh bom ate eben aoe ed bstn 187 92 Ut litiesUsagez zov mnn aus ol e EIS ake 187 9 2 1 Common Configuration Interface and Addressing 187 9 2 2 InfiniBand Interface 187 9 2 3 Addressing veda eid upas S E Ede pe ER Bie 188 9 3 ibdiagnet of ibutils2 IB Net Diagnostic 188 9 4 ibdiagnet of ibutils IB Net Diagnostic 191 9 5 ibdiagpath IB diagnosticpath 194 9 67 ADV devices c satis he eaves ee e E gap Paced ata dd eren 196 OT ADV GOVAN TOs xu n anat at a BO e hu s V gi ciae ina 196 9 85 IDCEV2NEtd EVE i er on er er
243. nsion Order Routing DOR honor guid2lid x This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and is valid By default this is FALSE const multicast This option forces OpenSM to conserver previously built multicast trees log file f lt log file name gt This option defines the log to be the given file By default the log goes to var log opensm log For the log to go to standard output use f stdout log limit L size in MB This option defines maximal log file size in MB When Specified the log file will be truncated upon reaching this limit erase log file This option will cause deletion of the log file if it previously exists By default the log file is accumulative Pconfig P lt partition config file gt This option defines the optional partition configuration file The default name is etc opensm partitions conf no part enforce N DEPRECATED This option disables partition enforcement on switch external ports 142 Mellanox Technologies Rev 2 1 1 0 0 part enforce both inj out off This option indicates the partition enforcement type for switches Enforcement type can be outbound only out inbound only in both or disabled off Default is both allow both pkeys W This option indicates whether both full and limited membership on the same partition can be configured in the
244. nt systems The results are significantly dependent on the CPU and chipset efficiency 7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better throughput sysctl w net ipv4 tcp sack 1 Increase the maximum length of processor input queues sysctl w net core netdev max backlog 250000 Increase the TCP maximum and default buffer sizes using setsockopt sysctl w net core rmem max 4194304 Sysctl w net core wmem max 4194304 Sysctl w net core rmem default 4194304 Sysctl w net core wmem default 4194304 Sysctl w net core optmem max 4194304 Increase memory thresholds to prevent packet dropping Sysctl w net ipv4 tcp rmem 4096 87380 4194304 Sysctl w net ipv4 tcp wmem 4096 65536 4194304 Enable low latency mode for TCP sysctl w net ipv4 tcp low latency 1 7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Mellanox Technologies 129 Rev 2 1 1 0 0 Performance Enable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp sack 1 7 2 3 Pre
245. ntation of the SRP target A decimal number specifying the maximum number of 512 byte sectors to be transferred via a single SCSI command A decimal number specifying the maximum number of outstanding commands for a single LUN A hexadecimal number specifying the SRP I O class Must be either Oxff00 rev 10 or 0x0100 rev 16a The I O class defines the format of the SRP initiator and target port identifiers A 16 digit hexadecimal number specifying the identifier extension por tion of the SRP initiator port identifier This data is sent by the initiator to the target in the SRP_LOGIN_REQ request A number in the range 1 255 that specifies the maximum number of data buffer descriptors stored in the SRP_CMD information unit itself With allow_ext_sg 0 the parameter cmd_sg entries defines the maxi mum S G list length for a single SRP_CMD and commands whose S G list length exceeds this limit after S G list collapsing will fail Whether ib srp is allowed to include a partial memory descriptor list in an SRP_CMD instead of the entire list If a partial memory descriptor list has been included in an SRP CMD the remaining memory descrip tors are communicated from initiator to target via an additional RDMA transfer Setting allow_ext_sg to increases the maximum amount of data that can be transferred between initiator and target via a single SCSI command Since not all SRP target implementations support par tial memory descriptor lists the
246. nts s sie else i siasa enn eee nee CC 19 L23 Firmwares s t uyarkachu eat Ge ee 20 1 2 4 Directory Structure l tenet nnn nee 20 1 37 E Peg Geen PER nc pug 21 13 1 mlx4 VPE DHVeOE c bbs t ELEM Tm De ant ae Eu rs 21 1 3 2 mlx5 Dtivets ic eh an Set RU NUR EGER NEUEN ERN RUE ERR XR EU 29 1 3 3 M d layet Core vee rV PUTET bh B Weg huayu s NT 23 1 354 CUTS osa le n M OS ED A 23 1 3 5 MP iie ite Kea Pe Tee pe Ces etre p ape VR ATE NRI AERE ua A NEL gU 24 1 3 6 InfiniBand Subnet 24 1 3 7 Diagnostic Utilities dc e e n 24 1 3 8 Mellanox Firmware eens 24 1 4 Quality of Service een 25 1 5 over Converged Ethernet RoCE 25 Chapter 2 27 2 1 Hardware and Software Requirements 27 2 2 Downloading Mellanox OFED 27 2 3 Installing Mellanox OFED 28 2 3 1 Pre installation Notes u las eect eect een eens 28 23 2 Installation Scripts 22 coxa ssd see s OR Paes Ce DUTIES 29 2 3 3 Installation Procedure 11 on sd kids nee 32 2 3 4 Installation
247. ny layer any kind of event packet HWTSTAMP FILTER PTP V2 EVENT PTP v2 802 AS1 any layer Sync packet HWTSTAMP FILTER PTP V2 SYNC PTP v2 802 AS1 any layer Delay req packet HWTSTAMP FILTER PTP V2 DELAY REQ Note for receive side time stamping currently only HWTSTAMP FILTER NONE and HWTSTAMP FILTER ALL are supported 4 6 1 2 Getting Time Stamping Once time stamping is enabled time stamp is placed in the socket Ancillary data recvmsg can be used to get this control message for regular incoming packets For send time stamps the outgo ing packet is looped back to the socket s error queue with the send time stamp s attached It can be received with recvmsg flags MSG ERRQUEUE The call returns the original outgoing packet data including all headers preprended down to and including the link layer the scm timestamping control message and a sock extended err control message with ee errno ENOMSG and ee origin SO EE ORIGIN TIMESTAMPING A socket with such 76 Mellanox Technologies Rev 2 1 1 0 0 a pending bounced packet is ready for reading as far as select is concerned If the outgoing packet has to be fragmented then only the first fragment is time stamped and returned to the sending socket When time stamping is enabled VLAN stripping is disabled For more info please refer to Documentation networking timestamping txt in kernel org 4 6 1 3 Querying Time Stampin
248. o establish RC connection provide the CMA the target IP and port number ULPs might also provide QoS Class The CMA then creates Ser vice ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes 4 4 3 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four sub sections I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription 64 Mellanox Technologies Rev 2 1 1 0 0 II Fabric Setup Defines how the SL2VL and VLArb tables should be setup In OFED this part of the policy is ignored SL2VL and VLArb tables should be config ured in the OpenSM options file opensm opts II QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds
249. o uniform the mapping of device function numbers to InfiniBand device numbers as defined for other module parameters e g num vfs and probe vf For example to map mlx4 15 to device function number 04 00 0 in the current version we use options mlx4 ib dev assign str 04 00 0 15 as opposed to the previous version in which we used options mlx4 ib dev assign str 04 00 0 f C 2 mlx4 core Parameters Set 4k mtu Obsolete attempt to set 4K MTU to all ConnectX ports int debug level Enable debug tracing if 0 int msi x 0 don t use MSI X 1 gt 1 limit number of MSI X irgs to msi_x non SRIOV only int enable sys tune Tune the cpu s for better performance default 0 int block loopback Block multicast loopback packets if 0 default 1 int num vfs Either single value e g 5 to define uniform num vfs value for all devices functions or a string to map device function numbers to their num vfs values e g 0000 04 00 0 5 002b 1c 0b a 15 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for num vfs value e g 15 string probe vf Either single value e g 3 to define uniform number of VFs to probe by the pf driver for all devices functions or a string to map device function numbers to their probe vf values e g 0000 04 00 0 3 002b 1c 0b a 13 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for probe vf value e g 13 string Mellanox Techn
250. ologies 237 log num mgm entry size high rate steer fast drop enable 64b cqe ege log num mac log num vlan log mtts per seg port type array log num gp log num srq log rdmarc per qp log num cq log num mcg log num mpt log num mtt enable gos internal err reset mlx4 en Parameters inline thold udp rss pf ctx pfcrx Rev 2 1 1 0 0 log mgm size that defines the num of qp per mcg for example 10 gives 248 range 7 log num mgm entry size 12 To activate device managed flow steering when available set to 1 int Enable steering mode for higher packet rate default off int Enable fast packet drop when no recieve WQEs are posted int Enable 64 byte CQEs EQEs when the FW supports this if non zero default 1 int Log2 max number of MACs per ETH port 1 7 int Obsolete Log2 max number of VLANs per ETH port 0 7 int Log2 number of MTT entries per segment 0 7 default 0 int Either pair of values e g 1 2 to define uniform port1 port2 types configuration for all devices functions or a string to map device function numbers to their pair of port types values e g 0000 04 00 0 1 2 002b 1c 0b a 1 1 Valid port types 1 ib 2 eth 3 auto 4 N A If only a single port is available use the N A port type for port2 e g 1 4 log maximum number of QPs per HCA default 19 int log maximum number of SRQs per HCA default 16 int log n
251. ons To list the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk I This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 4 1 2 2 1 SRP sysfs Parameters Interface for making ib srp connect to a new target One can request ib srp to connect to a new target by writing a comma separated list of login parameters to this sysfs attribute The supported parameters are id ext A 16 digit hexadecimal number specifying the eight byte identifier extension in the 16 byte SRP target port identifier The target port iden tifier is sent by ib srp to the target in the SRP LOGIN REQ request 48 Mellanox Technologies ioc guid dgid pkey service_id max_sect max_cmd_per_lun io class initiator ext cmd sg entries allow ext sg sg tablesize comp vector Rev 2 1 1 0 0 A 16 digit hexadecimal number specifying the eight byte I O controller GUID portion of the 16 byte target port identifier A 32 digit hexadecimal number specifying the destination GID A four digit hexadecimal number specifying the InfiniBand partition key A 16 digit hexadecimal number specifying the InfiniBand service ID used to establish communication with the SRP target How to find out the value of the service ID is specified in the docume
252. order to clean old scsi hosts representing targets that no longer exists Constraints between parameters dev loss tmo fast io fail tmo reconnect delay cannot be all disabled or neg ative values reconnect delay must be positive number fast io fail tmo must be smaller than SCSI block device timeout fast io fail tmo must be smaller than dev 1oss tmo 4 1 2 1 2 SRP Remote Ports Parameters Several SRP remote ports parameters are modifiable online on existing connection To modify dev loss tmo to 600 seconds echo 600 sys class srp remote ports port xxx dev loss tmo To modify fast io fail tmo to 15 seconds echo 15 sys class srp remote ports port xxx fast io fail tmo To modify reconnect delay to 10 seconds echo 20 sys class srp remote ports port xxx reconnect delay Mellanox Technologies 47 J Rev 2 1 1 0 0 Driver Features 4 1 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 4 1 2 4 explains how to do this automatically Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id ext GUID value ioc_guid GUID value dgid port GID value pkey ffff service id service 0
253. orks using the Device Mapper DM multipath and the SRP daemon Each initiator Is connected to the same target from several ports HCAs The DM multipath is responsi ble for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fab ric and send the ib srp module requests to connect to them as well Operation When a path from port1 to a target fails the ib srp module starts an error recovery process If this process gets to the reset host stage and there is no path to the target from this port ib srp will remove this scsi host After the scsi host is removed multipath switches to another path to this target from another port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path Mellanox Technologies 53 J Rev 2 1 1 0 0 Driver Features 4 1 2 7 Manual Activation o
254. ors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors Table 17 Recommended BIOS Settings for Intel Sandy Bridge Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Enabled Hyper Threading HPC disabled Data Centers enabled CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled Mellanox Technologies 127 Rev 2 1 1 0 0 Performance 7 1 3 3 Intel Nehalem Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalem based processors Configuring the Completion Queue Stall Delay Table 18 Recommended BIOS Settings for Intel Nehalem Westmere Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Disabled Hyper Threading Disabled Recommended for latency and message rate sensitive applications CPU fr
255. ort port of IB device default 1 w write file dump file name default sniffer pcap stands for stdout enables piping to tcpdump or tshark 0 output lt file gt alias for the w option Do not use for backward compatibility Mellanox Technologies 219 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities Examples 1 Run ibdump 220 Mellanox Technologies J Rev 2 1 1 0 0 Appendix A Mellanox FlexBoot A 1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology FlexBoot supports remote Boot over InfiniBand BoIB and over Ethernet Using Mellanox Virtual Protocol Interconnect VPI technologies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target iSCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox Con nectX products FlexBoot is based on the open source project iPXE available at http www ipxe org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then it connects to a DHCP server to obtain its assigned IP address and network parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some other service For an InfiniBand port Mellanox FlexBoot implements a network driver with
256. orts 1 pkey_idx 0 echo 0 gt 0000 02 00 2 ports 1 pkey_idx 1 echo 2 gt 0000 02 00 2 ports 1 pkey_idx 0 vml pkey index 0 will be mapped to physical pkey index 1 and vm2 pkey index 0 will be mapped to physical pkey index 2 Both vm1 and vm2 will have their pkey index 1 mapped to the default pkey Step d Host2 do the following cd sys class infiniband mlx4 0 iov echo 0 gt 0000 03 00 1 ports 1 pkey_idx 1 echo 1 gt 0000 03 00 1 ports 1 pkey_idx 0 echo 0 gt 0000 03 00 2 ports 1 pkey idx 1 echo 2 gt 0000 03 00 2 ports 1 pkey_idx 0 Stepe Once the VMs are running you can check the VM s virtualized PKey table by doing on the vm cat sys class infiniband mlx4 0 ports 1 2 pkeys 0 1 Step3 Start up the VMs and bind VFs to them Step 4 Configure IP addresses for ib0 on the host and on the guests Mellanox Technologies 101 Rev 2 1 1 0 0 Driver Features 4 13 7 3 Ethernet Virtual Function Configuration when Running SR IOV 4 13 7 3 1VLAN Guest Tagging VGT and VLAN Switch Tagging VST When running ETH ports on VGT the ports may be configured to simply pass through packets as is from VFs Vlan Guest Tagging or the administrator may configure the Hypervisor to silently force packets to be associated with a VLan Qos Vlan Switch Tagging In the latter case untagged or priority tagged outgoing packets from the guest will have the VLAN tag inserted and incoming packets will have the VLAN tag removed Any vlan tag
257. ount of memory that can be pinned by a user space application If desired tune the value unlimited to a specific amount of RAM Step 6 For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be running on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 8 OpenSM Subnet Manager Step 7 InfiniBand only Run the hca_self_test ofed utility to verify whether or not the Infini Band link is up The utility also checks for and displays additional information such as HCA firmware version Kernel architecture Driver version Number of active HCA ports along with their states Node GUID Mellanox Technologies 39 J Rev 2 1 1 0 0 Installation 2 3 4 Note For more details on hca self test ofed seethefilehca self test readme under docs hca self test ofed Performing Adapter Device Self Test Numbertor CAs Detected m al Per Devices check Ea nn PASS Kernel Mitel UU ons 86_64 Versioni MLNX OFED LINUX 2 1 1 0 0 OFED 2 1 1 0 0 3 0 76 0 11 default POSE Wrens WIM CWSI manas PASS Ola A 3 0 WI aaa v2 30 8000 Hirmware Check OMEN ER PASS Poe iw 55 Number of CA Ports Active al Port State of Port 1 on CA 0 VPI UP 4X FDR InfiniBand
258. ow Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCIOZ 00 0 open ILink dowun TX O TXE O RX O RXE 0 1 Link status The socket is not connected Waiting for link up on netO ok Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 Mellanox Technologies 223 Rev 2 1 1 0 0 option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 A 5 Subnet Manager OpenSM This section applies to ports configured as InfiniBand only FlexBoot requires a Subnet Manager to be running on one of machines in IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accom plish this Note that OpenSM may be run on the same host running the DHCP server but it is not mandatory For details on OpenSM see OpenSM Subnet Manager on page 137 To use OpenSM caching for large InfiniBand clusters gt 100 nodes it is recommended to use OpenSM options described in Section 8 2 1 opensm Syntax page 137 A 6 BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add
259. p file with set of the IDs which will be used by Up Down routing algorithm instead of node GUIDs format guid id per line guid routing order file X path to file Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the guids provided in the given file one to a line torus config path to file This option defines the file name for the extra configuration info needed for the torus 2QoS routing engine The default name is etc opensm torus 20QoS conf once o This option causes OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state sweep s interval This option specifies the number of seconds between subnet sweeps Specifying s 0 disables sweeping Without s OpenSM defaults to a sweep interval of 10 seconds 140 Mellanox Technologies Rev 2 1 1 0 0 timeout t lt milliseconds gt This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds retries lt number gt This option specifies the number of retries used for transactions Without retries OpenSM defaults to 3 retries for transactions maxsmps n lt number gt This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs Without maxsmps OpenSM defaults to
260. penSM As Daemon OpenSM can also run as daemon To run OpenSM in this mode enter host1 etc init d opensmd start osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administra tor osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 8 3 2 osmtest has the following test flows Multicast Compliancy test Event Forwarding test Service Record registration test RMPP stress test 146 Mellanox Technologies Rev 2 1 1 0 0 Small SA Queries stress test 8 3 1 Syntax osmtest OPTIONS where OPTIONS are flow This option directs osmtest to run a specific flow Flow Description C create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file 5 run service registration deregistration and lease test e run event forwarding test flood the SA with queries according to the stress mode m multicast flow q QoS info dump VLArb and SLtoVL tables t run trap 64 65 flow this flow requires running of external tool Default all flows except QoS W wait This option specifies the wait time for trap 64 65 in seconds It is used only
261. pes t1 t2 for all devices or list of BDF to port type array bb dd f t1 t2 string Valid port types 1 ib 2 eth 3 auto 4 N A If only a single port is available use the N A port type for port2 e g 1 4 probe vf Jfabsent or zero no VFs will be used by the PF driver Jfits value is a single number in the range of 0 63 Physical Function driver will use probe v VFs and this will be applied to all ConnectX HCAs on the host If its format is a string the string specifies the probe vf parameter separately per installed HCA The string format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to use in the PF driver for that HCA For example probe vfs 5 The PF driver will probe 5 VFs on HCA and this will be applied to all ConnectX HCAs on the host probe vfs 00 04 0 5 00 07 0 8 The PF driver will probe 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for the one in 00 07 0 Note PFs not included in the above list will not use any of their VFs in the PF driver The example above loads the driver with 5 VFs num vfs The standard use of a VF is a single VF per a single VM However the number of VFs varies upon the working mode requirements The protocol types are Port I IB Port 2 Ethernet port type array 2 2 Ethernet Ethernet Mellanox Technologies 93 J Rev 2 1 1 0 0 Driver Features port type
262. r a given coordinate either is sufficient However the algorithm used for topology discovery needs extra information for torus dimensions of radix four see TOPOLOGY DISCOVERY in torus 2QoS 8 For such cases both the positive and negative coordinate directions must be specified Based on the topology specified via the torus mesh keyword torus 2QoS will detect and log when it has insufficient seed configuration X dateline position y dateline position z dateline position In order for torus 2QoS to provide the guarantee that path SL values do not change under any conditions for which it can still route the fabric its idea of dateline position must not change rel ative to physical switch locations The dateline keywords provide the means to configure such behavior The dateline for a torus dimension is always between the switch with coordinate 0 and the switch with coordinate radix 1 for that dimension By default the common switch in a torus seed is taken as the origin of the coordinate system used to describe switch location The position param 164 Mellanox Technologies Rev 2 1 1 0 0 eter for a dateline keyword moves the origin and hence the dateline the specified amount rela tive to the common switch in a torus seed next seed If any of the switches used to specify a seed were to fail torus 2QoS would be unable to complete topology discovery successfully The next seed keyword specifies that the following link and date
263. r in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs Mellanox Technologies 65 J Rev 2 1 1 0 0 Driver Features 4 4 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 8 can be split into two main parts I Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the discovered fabric elements II PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined match rules such that the target QoS Level defini tion is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level 4 5 Quality of Service Ethernet 4 5 1 Quality of Service Overview Quality of Service QoS is a mechanism of assigning a priority to a network flow socket rdma cm connection and manage its guarantees limitations and its priority over other flows This is accomplished by mapping the user s priority to a hardware TC traffic class through a 2 3 stages process The TC is assigned with the QoS attributes and the different flows behave accordingly 4 5 2 Mapping Traffic to Traffic Classes Mapping traffic to TCs consists of several actions which are user controllable some controlled by the application itself and others by t
264. r router lt ca_name gt P ca port Optional Use the specified port 204 Mellanox Technologies Rev 2 1 1 0 0 Table 32 ibportstate Flags and Options Optional Default Flag Mida rur If Not Description y Specified t Optional Override the default timeout for the solicited lt timeout_ms msec gt lt destdr path Optional Destination s directed path LID or GUID lid guid gt lt startlid gt Optional Starting LID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0 0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 2 Dump all Lids with valid out ports of the switch with Lid 2 ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Des
265. ral usage scenarios for these utilities are presented ibsrpdm Ibsrpdm is using for the following tasks 1 Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device sys class infiniband_mad umad0 execute the following command ibsrpdm This command will output information on each SRP Target detected in human readable form Sample output IO Unit Info port LID 0103 port GID e800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 IO class 0100 TD LSI Storage Systems SRP Driver 200400a0b81146a1 service entries 1 service 0 200400a0b81146a1 SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the following command ibsrpdm d lt umad device gt 2 Assistance in creating an SRP connection a To generate output suitable for utilization in the echo command of Section 4 1 2 2 add the c option to ibsrpdm ibsrpdm c 50 Mellanox Technologies Rev 2 1 1 0 0 Sample output id ext 200400A0B81146A1 i1oc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service 10 200400 0081146 1 b To establish a connection with an SRP Target using the output from the 3bsrpdm c example above execute the following command echo n id ext 200400A0B81146A1 i1oc guid 0002c90200402
266. re conf file to set the number of VFs the protocol type per port and the allowed number of virtual functions to be used by the physical function driver probe vf For example options mlx4 core num vfs 5 port type array 1 2 probe vf 1 If the fields in the example above do not appear in the HCA section meaning SR IOV is not supported in the used INI If SR IOV is supported to enable SR IOV if it is not enabled it is sufficient to set sriov en true in the INI 92 Mellanox Technologies Rev 2 1 1 0 0 Parameter Recommended Value num_vfs e Ifabsent or zero no VFs will be available If its value is a single number in the range of 0 63 The driver will enable the num v s VFs on the HCA and this will be applied to all ConnectX HCAs on the host fits format is a string The string specifies the num v s parameter separately per installed HCA The string format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to enable for that HCA For example num_vfs 5 The driver will enable 5 VFs on the HCA and this will be applied to all ConnectX HCAs on the host num vfs 00 04 0 5 00 07 0 8 The driver will enable 5 VFs on the HCA positioned in BDF 00 04 0 and 8 on the one in 00 07 0 Note PFs not included in the above list will not have SR IOV enabled port type array Specifies the protocol type of the ports It is either one array of 2 port ty
267. riority rx novlan bytes Total bytes in successfully received packets with no VLAN pri ority tx prio i packets Total packets successfully transmitted with priority 1 tx prio lt i gt bytes Total bytes in successfully transmitted packets with priority i tx novlan packets Total packets successfully transmitted with no VLAN priority tx novlan bytes Total bytes in successfully transmitted packets with no VLAN priority Table 10 Port Pause where lt i gt is in the range 0 7 Counter Description IX pause prio lt i gt The total number of PAUSE frames received from the far end port rx pause duration prio i gt The total time in microseconds that far end port was requested to pause transmission of packets Ix pause transition prio lt i gt The number of receiver transitions from XON state paused to XOFF state non paused tx pause prio lt i gt The total number of PAUSE frames sent to the far end port tx pause duration prio lt i gt The total time in microseconds that transmission of packets has been paused Mellanox Technologies 109 Rev 2 1 1 0 0 Driver Features Table 10 Port Pause where lt i gt is in the range 0 7 Counter Description tx pause transition prio lt i gt The number of transmitter transitions from XON state paused to XOFF state non paused Table 11 V
268. router P ca port Optional Use the specified port t timeout ms Optional Override the default timeout for the solicited MADs msec dest dr path Optional Destination s directed path LID or guid gt GUID lt portnum gt Optional Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 Ifnot found the first port that is UP physical link state is LinkUp Examples Mellanox Technologies 201 Rev 2 1 1 0 0 1 Query the status of Port 1 of CA mlx4_0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstatus mlx4 0 1 Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sm lid 0x3 state Ag phys state 5 LinkUp rate 20 Gb sec 4X DDR InfiniBand Fabric Diagnostic Utilities gt ibportstate C mlx4 0 3 1 query PortInfo Port info Lid 3 port 1 TIMKS Initialize Phys LINKS tate LinkUp Tira Sup pO rate d E EE 1X or 4X 1X or 4X eI CEI es aan A 4X LinkSpeedSupported 2
269. rt GID tables ports n gids n where 0 lt n lt 127 the physical port gids 98 Mellanox Technologies Rev 2 1 1 0 0 ports lt n gt admin_guids lt n gt where 0 lt n lt 127 allows examining or changing the administrative state of a given GUID gt ports lt n gt pkeys lt n gt where 0 lt n lt 126 displays the contents of the physical pkey table pci id directories one for Dom0 and one per guest Here you may see the map ping between virtual and physical pkey indices and the virtual to physical gid 0 Currently the GID mapping cannot be modified but the pkey virtual to physical mapping can These directories have the structure lt pci_id gt port lt m gt gid_idx 0 where m 1 2 this is read only and pci id port m pkey idx n Where m 1 2andn 0 126 For instructions on configuring pkey_idx please see below 4 13 7 2 2Configuring an Alias GUID under ports lt n gt admin_guids Step 1 Determine the GUID index of the PCI Virtual Function that you want to pass through to a guest For example if you want to pass through PCI function 02 00 3 to a certain guest you initially need to see which GUID index is used for this function To do so cat sys class infiniband iov 0000 02 00 3 port lt port_num gt gid_idx 0 The value returned will present which guid index to modify on Dom0 Step 2 Modify the physical GUID table via the admin guids sysfs interfa
270. s leaf switches going down or one or more of these nodes coming back after being down A very common case that is handled by the unicast routing cache is host reboot which otherwise would cause two full routing recalculations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destina tion LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID This step is common to standard and 152 Mellanox Technologies Rev 2 1 1 0 0 Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigne
271. s per VLAN mapping Ifthe underlying device is not a VLAN device the tc command is used In this case even though tc manual states that the mapping is from the sk prio to the TC number the mlx4 en driver interprets this as a sk prio to UP mapping Mapping the sk prio to the UP is done by using tc wrap py i dev name u 0 1 2 3 4 5 6 7 4 The the UP is mapped to the TC as configured by the minx qos tool or by the 11dpad daemon if DCBX is used of the socket In this case the ToS to sk prio fixed mapping is not needed This allows gt Socket applications can use setsockopt SK_PRIO value to directly set the sk_prio Adi the application and the administrator to utilize more than the 4 values possible via ToS In case of VLAN interface the UP obtained according to the above mapping is also used in the VLAN tag of the traffic 4 5 4 RoCE Quality of Service Mapping Applications use RDMA CM API to create and use QPs The following is the RoCE QoS mapping flow 1 The application sets the ToS of the QP using the rdma_set_option option RDMA OPTION ID TOS value 2 ToS is translated into the Socket Priority sk prio using a fixed translation TOS 0 lt gt sk prio 0 TOS 8 lt gt sk_prio 2 TOS 24 lt gt sk prio 4 TOS 16 lt gt sk prio 6 3 The Socket Priority is mapped to the User Priority UP using the tc command In case of a VLAN device the parent real device is used for the purpose of this mapping Mell
272. s successfully transmitted tx bytes Total bytes in successfully transmitted packets tx multicast packets Total multicast packets successfully transmitted tx broadcast packets Total broadcast packets successfully transmitted tx errors Number of frames that failed to transmit tx dropped Number of transmitted frames that were dropped tx It 64 bytes packets Number of transmitted 64 or less octet frames tx 127 bytes packets Number of transmitted 65 to 127 octet frames tx 255 bytes packets Number of transmitted 128 to 255 octet frames tx 511 bytes packets Number of transmitted 256 to 511 octet frames tx 1023 bytes packets Number of transmitted 512 to 1023 octet frames 108 Mellanox Technologies Table 8 Port OUT Counters Rev 2 1 1 0 0 Counter Description tx 1518 bytes packets Number of transmitted 1024 to 1518 octet frames tx 1522 bytes packets Number of transmitted 1519 to 1522 octet frames tx 1548 bytes packets Number of transmitted 1523 to 1548 octet frames tx gt 1548 bytes packets Number of transmitted 1549 or greater octet frames Table 9 Port VLAN Priority Tagging where i is in the range 0 7 Counter IX prio lt i gt packets Description Total packets successfully received with priority 1 IX prio i bytes rx novlan packets Total bytes in successfully received packets with priority i Total packets successfully received with no VLAN p
273. sbin vendor pre uninstall sh Removing OFED Software installations Running bin rpm e allmatches kernel ib kernel ib devel libibverbs libibverbs devel libibverbs devel static libibverbs utils libmlx4 libmlx4 devel libibcm libibcm devel libibumad libibumad devel libibumad static libibmad libibmad devel libibmad static librdmacm librdmacm utils librdmacm devel ibacm opensm libs opensm devel perftest com pat dapl compat dapl devel dapl dapl devel dapl devel static dapl utils srptools infini band diags guest ofed scripts opensm devel warning etc infiniband openib conf saved as etc infiniband openib conf rpmsave Running tmp 2818 ofed vendor post uninstall sh Step3 Restart the server 4 13 6 Burning Firmware with SR IOV The following procedure explains how to create a binary image with SR IOV enabled that has 63 VFs However the number of VFs varies according to the working mode requirements To burn the firmware Step 1 Verify you have MFT installed in your machine Step 2 Enter the firmware directory according to HCA type e g ConnectX 3 The path is mlnx_ofed firmware lt device gt lt FW version Step3 Find the ini file that contains the HCA s PSID Run ibv devinfo grep board id board id MT 1090120019 96 Mellanox Technologies Rev 2 1 1 0 0 If such ini file cannot be found in the firmware directory you may want to dump the configura tion file using mstflint Run mstflint dev lt
274. se all paths that do not transit the failed compo nents will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was config ured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and mes sage deadlock are likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capable of routing a torus without credit loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To verify that a torus fabric 1s routed free of credit loops use ibdmchk to analyze data collected via ibdiagnet vlr Mellanox Technologies 163 Rev 2 1 1 0 0 OpenSM Subnet Manager 8 5 7 6 Torus 2QoS Configuration File Syntax The file torus 2QoS conf contains configuration information that is specific to the OpenSM rout ing engine torus 2QoS Blank lines and lines where the first non whitespace character is are ignored A token is any contiguous group of non whitespace characters Any tokens on a line fol lowing the recognized configuration tokens described below are ignor
275. serving Your Performance Settings after a Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows sysctl namel valuel sysctl name2 gt lt value2 gt sysctl name3 gt lt value3 gt sysctl name4 gt lt value4 gt For example Tuning the Network Adapter for Improved IPv4 Traffic Performance on page 129 lists the following setting to disable the TCP timestamps option Sysctl w net ipv4 tcp timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 7 2 4 Tuning Power Management Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent Check the maximum supported CPU frequency cat sys devices system cpu cpu cpufreq cpuinfo max freq e Check that core frequencies are consistent cat proc cpuinfo grep cpu MHz Check that the output frequencies are the same as the maximum supported If the CPU frequency is not at the maximum check the BIOS settings according to tables in is section Recommended BIOS Settings on page 126 to verify that power state is disabled Check the current CPU frequency to check whether it is configured to max available frequency cat sys devices system cpu cpu cpufreq cpuinfo cur freq 130 Mellanox Technologies Rev 2 1 1 0 0 7 2 4 1 Setti
276. shell e g logout and login again Other packages such as environ ment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also shows what the current selections are This command is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting Mellanox Technologies 117 Rev 2 1 1 0 0 HPC Features 5 2 4 5 3 5 3 1 Compiling MPI Applications Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user guide html To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps MellanoX Messaging MellanoX Messaging MXM provides enhancements to parallel communication libraries by fully utilizin
277. sk3 3 gt proc scsi tgt groups Default devices modprobe ib srpt Mellanox Technologies 235 Rev 2 1 1 0 0 echo add mgmt gt proc scsi tgt trace level echo add mgmt dbg gt proc scsi tgt trace level echo add out of mem proc scsi tgt trace level kkkkkkkkkkkkkkkkkkkkkkk End srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkk B 3 How to Unload Shutdown 1 Unload ib_srpt modprobe r ib srpt 2 Unload scst and its dev_handlers first modprobe r scst_vdisk scst 3 Unload ofed etc rc d openibd stop 236 Mellanox Technologies Rev 2 1 1 0 0 Appendix C mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modprobe conf options mlx4 core parameter lt value gt and or options mlx4 ib parameter lt value gt and or options mlx4 en parameter lt value gt The following sections list the available m1x4 parameters C 1 mlx4 ib Parameters sm guid assign Enable SM alias GUID assignment if sm guid assign 0 Default 1 int Map device function numbers to IB device numbers lsc OOOO 000 0 0 Q2 orders sn Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for IB device numbers e g 1 Max supported devices 32 string dev assign str 1 In the current version this parameter is using decimal number to describe the InfiniBand device and not hexadecimal number as it was in previous versions in order t
278. sociated with the same counter set If multiple QPs share the same counter its value represents the cumulative total ConnectX 3 support 127 different counters which allocated counters reserved for PF 2 counters for each port 2 counters reserved for VF 1 counter for each port All other counters if exist are allocated by demand RoCE counters are available only through sysfs located under sys class infiniband mlx4_ ports counters sys class infiniband mlx4_ ports counters_ext Physical Function can also read Virtual Functions port counters through sysfs located under fd sys class net eth vf statistics To display the network device Ethernet statistics you can run Ethtool S lt devname gt Table 7 Port IN Counters Counter Description rx_packets Total packets successfully received rx_bytes Total bytes in successfully received packets rx_multicast_packets Total multicast packets successfully received rx_broadcast_packets Total broadcast packets successfully received rX_errors Number of receive packets that contained errors preventing them from being deliverable to a higher layer protocol rx_ dropped Number of receive packets which were chosen to be discarded even though no errors had been detected to prevent their being deliverable to a higher layer protocol rx_length_errors Number of received frames that were dropped due to an error in frame length
279. switch s enhanced port 0 The names of such parameters are prefixed by qos type string Here is a full list of the currently supported sets qos ca QoS configuration parameters set for CAs qos rtr parameters set for routers qos sw parameters set for switches port 0 qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos ca max vls 15 qos ca high limit 0 Wg Le ARO 3x0 480 520 680 51 qos ca vlarb low 0 0 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 4 13 4 14 4 Gos e HAW 0 3 2 3 58 7 19 61 40 dl 3225 359 3140 7 qos swe max vls 15 qos swe high limit 0 qos swe vlarb high 0 4 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 qos swe vlarb low ood Thon Trade yea ael DI SA b S 2 1252 Mellanox Technologies 173 Rev 2 1 1 0 0 OpenSM Subnet Manager Cos sie SIO 01 2 drar 900507 9 9 10 Ui 12 19 AT VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped Ifa list entry is programmed for VL15 or
280. switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by main taining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 3 Once this stage has been completed it is highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoid ing the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by
281. t Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 3 3 1 on page 57 and connected to the client machine 3 An initrd file Mellanox Technologies 229 Rev 2 1 1 0 0 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with a MLNX EN Linux Driver that is appro priate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File executed by users with expertise in the boot process Improper application of this pro The following procedure modifies critical files used in the boot procedure It must be cedure may prevent the diskless machine from booting Step 1 Back up your current initrd file Step 2 Make a new working directory and change to it host1 mkdir tmp initrd en host1 cd Jemp initrd en Step3 Normally the initrd image is zipped Extract it using the following command host1 gzip dc lt initrd image gt cpio id The initrd files should now be found under tmp initrd_en Step 4 Create a directory for the ConnectX EN modules and copy them hostl mkdir p tmp initrd en lib modules mlnx en hostl cd lib modules uname r updates kernel drivers host1 cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en host1 cp net mlx4 mlx4 en ko tmp initrd en lib
282. tagram mode up to 64k in Connected mode Uses any ConnectX IB ports one or two Inserts IP UDP TCP checksum on outgoing packets Calculates checksum on received packets Support net device TSO through ConnectX LSO capability to defragment large data grams to MTU quantas Dual operation mode datagram and connected Large MTU support through connected mode IPoIB also supports the following software based enhancements Giant Receive Offload NAPI Ethtool support IPoIB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Datagram except for Connect IBTM adapter card which uses IPoIB with Connected mode as default For better scalability and performance we recommend using the Datagram mode However the mode can be changed to Connected mode by editing the file etc infiniband openib conf andsetting SET IPOIB CM yes The SET IPOIB CM parameter is set to auto by default to enable the Connected mode for Con nect IB card and Datagram for all other ConnectX cards After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode 56 Mellanox Technologies Rev 2 1 1 0 0 4 3 3 IPoIB Configuration Unless you have run the installation script mlnxofedinstall1 with the flag n then IPoIB
283. te iSCSI Target and load from it the kernel and initrd Linux There are two instances of connection to the remote iSCSI Target the first is for getting the kernel and initrd via FlexBoot and the second is for loading other parts of the OS via initrd If you choose to continue loading the OS after boot through the HCA device driver please ver ify that the initrd image includes the HCA driver as described in Section A 8 A 9 1 Configuring an iSCSI Target in Linux Environment Prerequisites Step 1 Make sure that an iSCSI Target is installed on your server side You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 3 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target iqn line Lun 0 Path dev sda5 Type fileio Example of an iSCSI Target iqn line Target ign 2007 08 7 3 4 10 iscsiboot Step 4 Start your iSCSI Target Example host1 etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section 4 3 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd con and ad
284. ted from the SM These GIDs are mapped to VHCAs as follows vHCA number x is assigned the GID GUID at index x of the physical GID table Each vHCA port presents its own virtual PKey table The virtual PKey table presented to a VF is a mapping of selected indexes of the physical PKey table The host admin can control which PKey indexes are mapped to which virtual indexes using a sysfs interface see Section on page 98 The physical PKey table may contain both full and partial memberships of the same PKey to allow different membership types in different virtual tables Each vHCA port has its own virtual port state A vHCA port is up if the following conditions apply The physical port is up The virtual GID table contains the GIDs requested by the host admin The SM has acknowledged the requested GIDs since the last time that the physical port went up Other port attributes are shared such as GID prefix LID SM LID LMC mask To allow the host admin to control the virtual GID and PKey tables of vHCAs a new sysfs iov sub tree has been added under the PF InfiniBand device 4 13 7 2 1SRIOV sysfs Administration Interfaces on the Hypervisor Administration of GUIDs and PKeys is done via the sysfs interface in the Hypervisor Dom0 This interface is under sys class infiniband lt infiniband device gt iov Under this directory the following subdirectories can be found ports The actual physical port resource tables Po
285. tem image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value guids burn sg 4 GUIDs must be specified here The specified GUIDs are lt GUIDs gt assigned the following values repectively node port1 port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 Mellanox Technologies 215 Rev 2 1 1 0 0 Table 36 mstflint Switches Sheet 2 of 3 Affected Switch Relevant Description Commands mac burn sg MAC address base value Two MACs are automatically lt MAC gt assigned to the following values gt portl 1 gt port2 Note This switch is applicable only for Mellanox Technolo gies Ethernet products macs burn sg Two MACs must be specified here The specified MACs are lt MACs gt assigned to 1 and port2 repectively Note This switch is applicable only for Mellanox Technolo gies Ethernet products blank guids burn Burn the image with blank GUIDs and MACs where applica ble These values can be set later using the sg command see Table 37 below No com Force clear the Flash semaphore on the device No command is clear semap mands allowed when this switch is used hore allowed Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage
286. the UPDN algorithm Use a root guid file gt for adding an UPDN guid file that contains the root nodes for ranking If the a option is not used OpenSM uses its auto detect root nodes algo rithm Notes on the guid list file 154 Mellanox Technologies Rev 2 1 1 0 0 1 A valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 8 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also prevents credit loop dead locks If the root guid file is not provided a or root_guid_file options the topology has to be pure fat tree that complies with the following rules Tree rank should be between two and eight inclusively Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all Switches of the same rank should have the same number of DO
287. the new VLAN interface to the same bridge that the virtual machine interface is already attached to brctl addif br name interface name Mellanox Technologies 83 J Rev 2 1 1 0 0 Driver Features 4 8 4 4 9 For example to create the VLAN tag 3 with pkey 0x8003 over that port in the eIPoIB interface eth4 run vconfig add eth4 3 brctl addif br2 eth4 3 Setting Performance Tuning Use 4K MTU over OpenSM For further information please refer to Section 8 4 1 File Format on page 149 Default 0xffff ipoib mtu 5 ALL full Use MTU for 4K 4092 Bytes In UD mode the maximum MTU value is 4092 Bytes Make sure that all interfaces including the guest interface and its virtual bridge have the same MTU value MTU 4092 Bytes For further information of MTU settings please refer to the Hypervisor User Manual Tune the TCP IP stack using sysctl dom0 domu sbin sysctl perf tuning Other performance tuning for KVM environment such as vCPU pinning and NUMA tuning may apply For further information please refer to the Hypervisor User Manual Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over physical con tiguous pages It enables a user application to ask low level drivers to allocate contiguous mem ory for itas part of ibv reg mr Additional performance improvements can be reached by allocating Queue Pair QP and Com pletion Queue CQJ buffers to
288. the same program Commonly the PEs perform computation on their own sub domains of the larger problem and periodically communicate with other PEs to exchange information on which the next communi cation phase depends The SHMEM routines minimize the overhead associated with data transfer requests maximize bandwidth and minimize data latency the period of time that starts when a PE initiates a transfer of data and ends when a PE can use the data SHMEM routines support remote data transfer through put operations data transfer to a different PE get operations data transfer from a different PE and remote pointers allowing direct references to data objects owned by another PE Additional supported operations are collective broadcast and reduction barrier synchronization and atomic memory operations An atomic memory operation is an atomic read and update oper ation such as a fetch and increment on a remote or local data object SHMEM libraries implement active messaging The sending of data involves only one CPU where the source processor puts the data into the memory of the destination processor Likewise a processor can read data from another processor s memory without interrupting the remote CPU The remote processor is unaware that its memory has been read or written unless the programmer implements a mechanism to accomplish this 5 1 1 Mellanox ScalableSHMEM The ScalableSHMEM programming library is a one side com
289. tination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 gt 97 Mellanox Technologies 205 Rev 2 1 1 0 0 Unicast lids 0x3 0x7 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 3 valid lids dumped 4 Dump all Lids with valid out ports of the switch with portguid oxooob8c ff004016 gt ibroute G 0x000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 023 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 000 Switch portguid 0x000b8cffff004016
290. tional Destination s directed path LID or GUID lid guid gt Examples 1 Query PortInfo by LID with port modifier gt smpquery portinfo 1 1 Port info Lid 1 port 1 a as 0 0000000000000000 Ot Oxfe80000000000000 JU G LANA EE EU E AME SE 0x0001 cha be unu pa m E ERE 0x0001 en T tracing E ir 0x251086a IsSM IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported sClientRegistrationSupported 0 0000 1X or 4X UA 1X or 4X 208 Mellanox Technologies Rev 2 1 1 0 0 2 Query SwitchInfo by GUID Mellanox Technologies 209 Rev 2 1 1 0 0 InfiniBand Fabric Diagnostic Utilities LUGS TAME Mme eee os ae oe 18 State Changer n un 0 lxdSPerDont ue IDEE 0 PTT 32 HnboundPart maaa il ill hu lterpawlnpounma an 1 awa hnhancedpost0 n nn n MM 0 3 Query Nodelnfo by direct route gt smpquery D nodeinfo 0 Node info DR path slid 65535 dlid 65535 0 BaSEVErS na aa ee seers ill class Vers d na n pss 1 NoedeType saam Channel Adapter NUMPORUSH 2 SVSECMGUIG er eue 0x
291. titions conf To change this filename you can use opensm with the Pconfig or P flags The default partition 1s created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P Key value of Ox7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial mem bership 8 4 4 File Format Notes Line content followed after character is comment and ignored by parser General File Format Partition Definition gt lt PortGUIDs list Partition Definition PartitionName PKey flag value defmember full limited Mellanox Technologies 149 Rev 2 1 1 0 0 OpenSM Subnet Manager where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value for this partition Only low 15 bits will be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of this partition defmember full limited specifies default membership for port guid list Default is limited Currently recognized flags are ipoib indicates that this partition may be used for IPoIB asa result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val
292. to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device The priority of this list can be modified through BIOS setup A 7 Operation A 7 1 Prerequisites Make sure that your client is connected to the server s The FlexBoot image is already programmed on the adapter card see Section A 3 For InfiniBand ports only Start the Subnet Manager as described in Section A 5 The DHCP server should be configured and started see Section 4 3 3 1 IPoIB Config uration Based on DHCP on page 57 Configure and start at least one of the services iSCSI Target see Section A 9 and or TFTP see Section A 6 224 Mellanox Technologies Rev 2 1 1 0 0 A 7 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot to be the first on the boot device priority list see Section A 6 On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 Note also that the driver waits up to 90 seconds for ad each port to come up If MLNX FlexBoot iPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by the Subnet Manager In case sensing the port protocol fails the port will be configured as an InfiniBand port For
293. topo file file name gt t topo file lt file gt out ibnl dir lt directory gt screen num errs num smp window num gmp window lt num gt max hops lt max hops gt V version h help H deep help 188 Mellanox Technologies Rev 2 1 1 0 0 Options i device lt dev name gt Specifies the name of the device of the port used to connect to the IB fabric in case of multiple devices on he local system p port lt port num gt Specifies the local device s port number used to connect to the IB fabric g guid lt GUID in hex gt Specifies the local port GUID value of the port used to connect to the IB fabric If GUID given is 0 than ibdiagnet displays a list of possible port GUIDs and waits for user input vlr lt file gt Specifies opensm path records dump file path src dst to SL mapping generated by SM plugin ibdiagnet will use this mapping for MADs sending and credit loop check if r option selected r routing Provides a report of the fabric qualities u fat tree Indicates that UpDown credit loop checking should be done against automatically determined roots o output path directory Specifies the directory where the output files will be placed default var tmp ibdiagnet2 skip lt stage gt Skip the executions of the given stage Applicable skip stages all dup guids dup node desc lids links sm pm nodes in
294. uSWXeubZmbXcMrP wAIWByfH8ajwo6A5SWioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorX NCOZBe4kTnUqm63nQ2zi1qVMdL9FrCmalxIOu9 SQUAjwONevaMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrod ZlfCQ username Ghostl Step 4 Now you need to add the public key to the authorized keys2 file on the target machine host1 cat id rsa pub xargs ssh host2 V echo home username ssh authorized keys2 lt username gt host2 s pass word Enter password host1 For a local machine simply add the key to authorized keys2 hostl cat id rsa pub gt gt authorized keys2 Step5 Test host1 ssh host2 uname Linux 5 23 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mella nox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new
295. ual addresses is unexpected Connect IBTM supports Inline Receive on both the requestor and the responder sides Since data Is copied at the poll CQ verb Inline Receive on the requestor side is possible only if the user chooses IB V SIGNAL ALL WR 4 18 1 Querying Inline Receive Capability User application can use the ibv exp query device function to get the maximum possible Inline Receive size To get the size the application needs to set the IBV EXP DEVICE ATTR INLINE RECV SZ bitinthe ibv exp device attr comp mask 4 18 2 Activating Inline Receive To activate the Inline Receive you need to set the required message size in the max inl recv field in the ibv exp qp init attr struct when calling ibv exp create qp function The value returned by the same field is the actual Inline Receive size applied Setting the message size may affect the WQE CQE size 106 Mellanox Technologies Rev 2 1 1 0 0 4 19 Ethernet Performance Counters Counters are used to provide information about how well an operating system an application a service or a driver is performing The counter data helps determine system bottlenecks and fine tune the system and application performance The operating system network and devices pro vide counter data that an application can consume to provide users with a graphical view of how well the system is performing The counter index is a QP attribute given in the QP context Multiple QPs may be as
296. ueries 83 Multi MAD RMPP Path Record SA queries Without s stress testing is not performed M Multicast ModeThis option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC ultiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is per formed t timeout This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds all lg Tilg This option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout v verbose This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the vf option for more information about log verbosity V This option sets the maximum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity vf This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows
297. umber of RDMARC buffers per QP default 4 int log maximum number of CQs per HCA default 16 int log maximum number of int multicast groups per HCA default 13 log maximum number of default 19 int log maximum number of memory translation table segments per HCA default max 20 2 MTTs for register all of the host mem ory limited to 30 int Enable Quality of Service support in the HCA default off bool Reset device on internal errors if non zero default 1 in SRIOV mode default is 0 int memory protection table entries per HCA Threshold for using inline data int Default and max value is 104 bytes Saves PCI read operation transaction packet less then threshold size will be copied to hw buffer directly Enable RSS for incoming UDP traffic uint On by default Once disabled no RSS for incoming UDP traffic will be done Priority based Flow Control policy on TX 7 0 Per priority bit mask uint Priority based Flow Control policy on RX 7 0 Per priority bit mask uint 238 Mellanox Technologies Rev 2 1 1 0 0 Appendix D mlx5 Module Parameters The mlx5_ib module supports a single parameter used to select the profile which defines the number of resources supported The parameter name for selecting the profile is prof_sel The supported values for profiles are 0 for medium resources medium performance 1 for low resources 2 for high perform
298. upported only in mlx5 and is at beta level 4 17 PeerDirect PeerDirect uses an API between IB CORE and peer memory clients e g GPU cards to provide access to an HCA to read write peer memory for data buffers As a result it allows RDMA based Mellanox Technologies 105 Rev 2 1 1 0 0 Driver Features over InfiniBand RoCE application to use peer device computing power and RDMA intercon nect at the same time without copying the data between the P2P devices For example PeerDirect is being used for GPUDirect RDMA Detailed description for that API exists under MLNX OFED installation please see docs readme and user manual PEER MEMORY APL txt 4 18 Inline Receive When Inline Receive is active the HCA may write received data in to the receive WQE or CQE Using Inline Receive saves PCIe read transaction since the HCA does not need to read the scatter list therefore it improves performance in case of short receive messages On poll CQ the driver copies the received data from WQE CQE to the user s buffers Therefore apart from querying Inline Receive capability and Inline Receive activation the feature is trans parent to user application When Inline Receive is active user application must provide a valid virtual address for the receive buffers to allow the driver moving the inline received message to these buf fers The validity of these addresses is not checked therefore the result of providing non valid virt
299. use the PKey value with the most significant bit set e g 0x8000 in the example above 4 3 5 Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step 1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command host1 ping c 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seq 0 ttl 64 time 0 079 ms 64 bytes from 11 4 3 176 icmp seg 1 ttl 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seq 3 ttl 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seg 4 ttl 64 time 0 065 ms 11 4 3 176 ping statistics Mellanox Technologies 61 J Rev 2 1 1 0 0 Driver Features 5 packets transmitted 5 received 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 3 6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is
300. ut Display 0 4 Sound D Serialo Senal 0000 09 00 1 Paratel EE video Physical Host Device 00 video B watchdog Cancel 9 Forward F Add Hardware Remove Mellanox Technologies 95 J Rev 2 1 1 0 0 Driver Features Step4 Choose a Mellanox virtual function according to its PCI device e g 00 03 1 Step 5 the Virtual Machine is up reboot it otherwise start it Step 6 Log into the virtual machine and verify that it recognizes the Mellanox card Run lspci grep Mellanox 00 03 0 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Step 7 Add the device to the etc sysconfig network scripts ifcfg ethx configuration file The MAC address for every virtual function is configured randomly therefore it is not necessary to add it 4 13 5 Uninstalling SR IOV Driver gt To uninstall SR IOV driver perform the following Step 1 For Hypervisors detach all the Virtual Functions VF from all the Virtual Machines VM or stop the Virtual Machines that use the Virtual Functions Please be aware stopping the driver when there are VMs that use the VFs will cause machine to hang Step 2 Run the script below Please be aware uninstalling the driver deletes the entire driver s file but does not unload the driver root swl022 usr sbin ofed uninstall sh This program will uninstall all OFED packages on your machine Do you want to continue y N y Running usr
301. vel static HH H HH HHHH HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH Prepari ng aC HH H HH HHHH HH H HH H HH HH i H HH HHHH HH H HH H HH HH libipat verbs Preparing HH H HH HHHH HH H HH H HH HH H HH H HH HHHH HH H HH H HH HH libipat verbs THHHBHHHHHHHHBHBHHHHHHHHHHHHHHHHHBHHHHHHHHHHHHRHHHHI Preparing a HH H HH HHHH HH H HH H HH HH H H HHHH H HH H HH HH libipat verbs devel THHHBHHHHBHHHBHBHHHHHHHHBHHHBRHHHHBHHHHHHHHHHHHRSHRHHI Preparing 5 HH H HH HU HH HH HH HH H HH H HH HHHH HH H HH H HH HH libipat verbs devel HH H HH HHHH HH H HH H HH HH H HH H Het HH HH H HH H HH HH Preparing aa HH H HH HHHH HHH Het HH HH H HH H HH HHHH HH H HH H HH HH libibcn HH HH HU HHH HH H HH HH H HH H HHH HH HH H HH H HH HH Preparing M HH H HH HHHH HH HH H HH HH H HH H HH HHHH HH H HH H HH HH libibcn THHHHHHHBHHHBHBHHHHHHHHBHHHBRHHHHBHBHHHHHHBHHHHHRHRHHI Preparing Qo THHHHHHHBHHHHHBHHHHHHHHBHHHHBHHHHBHHHHHHHHHHHHRSHRHHHI libibcm devel THHHBHHHHHHHHBHBHHHHHHHHBHHHBHHHHBHBHHHHHHHHHHRHRHHHI Preparing o HHH i H H HH H HHH HH HH H HH H HH HHHH HHH HHH HH libibcn devel HH H HH H H HH H i H HH HH H i H HH HHHH HH H HHH i Preparing Ta HHH HH H H HH H HH H HH HH H i H HHH HH HH H HHH i libibun ad HH H HH HHHH HH H HH H HH HH H HH H HH HHHH HH H HHH i i Preparing tes HHH HH H H HH H HH H HH HH H HH H HH HHHH HHH HHH i HH libibun ad HH H HH H H HH H i H HH HH
302. w 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 Cos _ Bie SIAM Opa Sat 5 y In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x 4KB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off 8 6 8 Deployment Example Figure 5 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs 174 Mellanox Technologies Rev 2 1 1 0 0 Figure 5 Example QoS Deployment on InfiniBand Subnet Traffic class SDP Traffic class Partition A Service level 2 I ce f A E Policy min 20 BW olicy min 40 Service Access Points Traffic class SRP Service Level 1 Policy min 30 BW Traffic class IPoIB Service Level 3 Policy min 10 BW j App A Server Virtual Server App B Server 8 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each exam ple provides the QoS level assignment and their administration via OpenSM configuration files 8 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI Separate from I O load Min BW of 70 Storage Control Lustre MDS Low latency
303. want to check for current state of the fabric A directory l is also created by this option and iles required to load this topology es you will need to set the environment BDM IBNL PATH to that directory located in tmp or in the output ed by the o flag t data from the given db file and skip stage the checks require actual subnet discovery uld not run when load db is specified Duplicated zero guids link state SMs p page information Prints the version of the tool Prints the tool s environment variables and their values Table 27 ibdiagnet of ibutils Output Files Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet lst List of all the nodes ports and links in the fabric 192 Mellanox Technologies Rev 2 1 1 0 0 Table 27 ibdiagnet of ibutils Output Files Output File Description ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet masks In case of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet pkey A dump of the existing partitions and their member host ports ibdiagnet mcg A dump of t
304. when running f t the trap 64 65 flow Default 10 sec d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support m max lid This option specifies the maximal LID number to be searched for during inventory file build Default 100 g This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port r This option displays a menu of possible local port GUID values with which osmtest could bind i inventory This option specifies the name of the inventory file Normally osmtest expects to find an inventory file which osmtest uses to validate real time information Mellanox Technologies 147 Rev 2 1 1 0 0 OpenSM Subnet Manager received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See c option for related information 8 stress This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description s1 Single MAD response SA queries s2 Multi MAD RMPP response SA q
305. which OpenSM runs contain ing lines Default 0x7fff ipoib ALL full Pkey1 0x3000 ipoib ALL full Pkey3 0x3030 ipoib ALL full This will cause OpenSM to configure the physical Port Pkey tables on all physical ports on the network as follows pkey idx pkey value 0 OxFFFF 1 0xB000 2 0xB030 100 Mellanox Technologies Rev 2 1 1 0 0 the most significant bit indicates if a PKey is a full PKey The ipoib causes OpenSM to pre create IPoIB the broadcast group for the indicated PKeys die Step 2 Configure on Dom0 the virtual to physical PKey mappings for the VMs Step a Check the PCI ID for the Physical Function and the Virtual Functions lspci grep Mel Stepb Assuming that on Hostl the physical function displayed by Ispci is 0000 02 00 0 and that on Host2 it is 0000 03 00 0 On Host do the following cd sys class infiniband mlx4 0 iov 0000 02 00 0 0000 02 00 1 0000 02 00 2 1 0000 02 00 0 contains the virtual to physical mapping tables for the physical func tion 0000 02 00 X contain the virt to phys mapping tables for the virtual functions Do not touch the Dom0 mapping table under lt nnnn gt lt nn gt 00 0 Modify only tables under 0000 02 00 1 and or 0000 02 00 2 We assume that vml uses VF 0000 02 00 1 and vm2 uses VF 0000 02 00 2 Stepc Configure the virtual to physical PKey mapping for the VMs echo 0 gt 0000 02 00 1 ports 1l pkey idx 1 echo 1 gt 0000 02 00 1 p
306. work and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Congestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches 8 9 2 Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so perform the following 1 Create the file Run opensm c options file name 2 Find the event plugin name option in the file and add cemgr to it Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt Once the Congestion Control is enabled on the fabric nodes to completely disable Congestion Control you will need to actively turn it off Running the SM w o the CC Manager is not sufficient as the hardware still continues to function in accordance to A the previous CC configuration For further information on how to turn OFF CC please refer to Section 8 9 3 Configuring Con gestion Control Manager on page 183 8 9 3 Configuring Congestion Control Manager Congestion Control CC Manager comes with a predefined set of setting However you can fine tune the CC mechanism and CC Manager behavior by modifying some of the options To do so perform the following 1
307. x Technologies Rev 2 1 1 0 0 4 1 2 1 1 SRP Module Parameters When loading the SRP module the following parameters can be set viewable by the modinfo ib_srp command cmd sg entries Default number of gather scatter entries in the SRP command default is 12 max 255 allow_ext_sg Default behavior when there are more than cmd sg entries S G entries after mapping fails the request when false default false topspin workarounds Enable workarounds for Topspin Cisco SRP target bugs reconnect delay Time between successive reconnect attempts Time between successive reconnect attempts of SRP initiator to a disconnected target until dev loss tmo timer expires if enabled after that the SCSI target will be removed fast io fail tmo Number of seconds between the observation of a transport layer error and failing all I O Increasing this timeout allows more tolerance to transport errors however doing so increases the total failover time in case of serious transport failure Note fast io fail tmo value must be smaller than the value of reconnect delay dev loss tmo Maximum number of seconds that the SRP transport should insulate transport layer errors After this time has been exceeded the SCSI target is removed Normally it is advised to set this to 1 disabled which will never remove the scsi host In deployments where different SRP targets are connected and disconnected frequently it may be required to enable this timeout in
308. y be superceded with a prefix 222 Mellanox Technologies Rev 2 1 1 0 0 The value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecimal digits Extracting the Port GUID Method gt To obtain the port GUID Step 1 Start mst host1 mst start host1l mst status The following MFT commands assume that the Mellanox Firmware Tools MFT package has been installed on the client machine Step 2 Obtain the Port GUID using the device name The device name will be of the form dev mst mt lt dev_id gt pci _cr0 conf0 flint d lt MST DEVICE NAME gt q Example with ConnectX 2 QDR device Image type ConnectX FW Version 2 9 1000 Rom Info type PXE version 3 4 142 devid 26428 proto VPI Device ID 26428 Description Node Port1 Port2 Sys image GUIDs 0002c9030005cffa 0002c9030005cffb 0002c9030005cffc 0002c9030005cffd MACs 0002c905cffa 0002c905cffb Board ID MT_0DD0110009 VSD PSID MT_0DD0110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 00 10 39 Extracting the Port GUID Method Il An alternative method for obtaining the port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure bel
309. y be used several times for higher debug levels ddd or d d d e rr show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Optional Use directed path address arguments The path is acomma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries V ersion Optional Show version info C Optional Use the specified channel adapter or router ca name P ca port Optional Use the specified port Mellanox Technologies 207 Rev 2 1 1 0 0 Table 33 smpquery Flags and Options InfiniBand Fabric Diagnostic Utilities Optional bed Flag if Not Description Specified t Optional Override the default timeout for the solicited lt timeout_ms msec gt lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt 512 1 lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt mepi lt addr gt lt portnum gt destdr path Op
Download Pdf Manuals
Related Search
Related Contents
User's manual Avocent PM 3000 Vertical Model Content manual panasonic Flash 06 - Warrant Giulietta (Manuale Blue&Me) Jewel (BLJ18) Manual CX-330 プリンタ簡易操作説明書(STR User Guide - GarrettCom Instruções do Sistema dos MENU - JVC-TV Copyright © All rights reserved.
Failed to retrieve file