Home
        Mellanox OFED Linux User`s Manual
         Contents
1.                                                                                                                                                                                                                                  H H HH          ie          H H HH HH H                  H H HH   H               4 H H HH HH                     H H HH                 E    H H HH HH H           Ww    H H HH                   H H H HH HH                 H H H HH HH                     H H H HH HH H                  H H HH H H                  H H HH HH          atte     H H H HH HH H                 H H HH HH H  T       Ww      H H HH HH                 H H H HH HH                 H H H HH    H        UP       H H HH HH H        T       H H HH H H               4 H H HH HH                 H H H HH H              tte  H H H HH HH H     T    T  H H H HH H H               H H H HH HH H               H H H HH HH                       H H HH HH H                 H H HH H                 4 H H HH HH          atte     H H H HH HH                     H H HH   H  T       Ww     H H HH H H               H HHHHH HH HH                 H HHHHH HH HH       ay        H HH H HH HH                   H H HH H          T        H H HH HH          atte         H H HH HH H           ate    H H HH HH    T       Ww    H H HH H H                 H H HH HH H               H H H HH H                       H H HH HH H                                                           Mellanox Technologies                
2.                                    68  4 7 Atomic Operations                                                 69  4 7 1 Enhanced Atomic Operations                                         69  4 8 Ethernet Tunneling Over IPoIB Driver  eIPoIB                           70  4 8 1 Enabling the eIPoIB                                                      71  4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver                     71  4 8 3 VLAN Configuration Over an eIPoIB                                            73  4 8 4 Setting Performance Tuning                                           74  4 9     Contiguous  Pages      oves pepe rae Rais ales Dee is Da ete 74  4 10 Shared Memory Region                                             75  4 11 XRC   eXtended Reliable Connected Transport Service for InfiniBand        76  4 12  Flow Steenin suec ve eR ELEM DIEN ES 77  4 12 1 Enable Disable Flow Steering                                         TI   4 12 2 Flow Domains and                                                           77  4 13 Single Root IO Virtualization  SR IOV                                 80  4 13 1 System Requirements s ssr eso                    e 80      4 Mellanox Technologies J      Rev 2 0 3 0 0      4 13 2  Setting Up SR lOV u u pe sind heen dk  heed erfa ote Re SOC suu 80  4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup                84  4 13 4 Assigning a Virtual Function to a Virtual Machine                         85  4 13 5 Uninstal
3.                          169  Table 21  Congestion Control Manager CA Options File                                 169  Table 22  Congestion Control Manager CC MGR Options File                            170  Table 23   ibdiagnet  of ibutils2  Output Files                                          175  Table 24   ibdiagnet  of ibutils  Output Files                                           177  Table 25  ibdiagpath Output Files                                                   180  Table 26     devinfo Flags and Options                                              181  Table 27  ibstatus Flags and Options                                                 183  Table 28   ibportstate Flags and Options                                               185  Table 29   ibportstate Flags and Options                                               189  Table 30  smpquery Flags and Options                                               191  Table31  perfquery Flags and Options                                                195  Table 32   ibcheckerrs Flags and Options                                              197  Table 33  mstflint Switches                                     I 199   Table 34     mstflint  Commands   oeste veterem steer Dae ROC S ae Be ee Rs 201    10 Mellanox Technologies      Rev 2 0 3 0 0      Document Revision History    Table 1   Document Revision History             Release Date Description  2 0 3 0 0 October 2013 Updated the following sections      Appendix E     Lu
4.                     BIOS Option Values  Memory Memory speed Max performance  Memory channel mode Independent  Node Interleaving Disabled   NUMA  Channel Interleaving Enabled  Thermal Mode Performance                   7 2 Performance Tuning for Linux    You can use the Linux sysctl command to modify default system network parameters that are set  by the operating system in order to improve IPv4 and IPv6 traffic performance  Note  however   that changing the network parameters may yield different results on different systems  The  results are significantly dependent on the CPU and chipset efficiency     7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance  The following changes are recommended for improving IPv4 traffic performance     Disable the TCP timestamps option for better CPU utilization   Sysctl  w net ipv4 tcp timestamps 0    Enable the TCP selective acks option for better throughput   sysctl  w net ipv4 tcp sack 1       ncrease the maximum length of processor input queues           sysctl  w net core netdev max backlog 250000       Increase the TCP maximum and default buffer sizes using setsockopt          Sysctl  w net core rmem max 4194304  Sysctl  w net core wmem max 4194304  Sysctl  w net core rmem default 4194304  Sysctl  w net core wmem default 4194304  sysctl  w net core optmem max 4194304       Increase memory thresholds to prevent packet dropping     Sysctl  w net ipv4 tcp rmem  4096 87380 4194304   Sysctl  w net ipv4 tcp wmem  4096 65536 
5.               H H HH HH H                 H H HH H H                4 H H HH HH                 H H H HH                 E    H H HH HH H           Ww    H H HH                   H H H HH HH                 H H H HH HH                     H H H HH HH H                  H H HH H H                  H H HH HH          atte     H H H HH HH                    H H HH HH H  T       Ww      H H HH HH                 H H H HH HH                 H H H HH    H   T    uP       H H HH HH             Ww     H H HH    H               H H H HH HH                  z H H HH H       ate    T    H H HH HH H                  H H HH H H                  4 H H HH HH          atte     H  HHHHH HH HH                 H HHHHH HH HH             Ww      H H HH HH                 H H H HH HH                 H H H HH HH          UP         H H HH HH H        T       H H HH H H                4 H H HH HH H               H H H HH                    UL  H H H HH HH H  T       ar     H H HH                     H H HH HH                 H H H HH                       H H H HH HH                     H H HH H H               H H H HH HH          ate     H H H HH HH H                     H H HH HH H           Ww  H  H HHH HH H H                                                                 Mellanox Technologies                    Rev 2 0 3 0 0    Rev 2 0 3 0 0       Preparing     dapl devel static  Preparing     dapl devel static  Preparing     dapl utils  Preparing     perftest  Preparing     mstflint  Preparin
6.              219  B 1 Prerequisites and Installation                                      219  B 2  JHOW tO RUN i nanan cows  sehr de Mt obe watts 219  B 3 How to Unload Shutdown                                        222  Appendix C mlx4 Module Parameters                                  223    Mellanox Technologies 7 J      Rev 2 0 3 0 0            mlx4 a1b Parameters uu us ass n               OR        TCR ee I 223  C2 mlx4 core Parameters  sosie sense      ble as A ER usa Gl ue 223     3       4 en Parameters                                            224  Appendix D mlx5 Module Parameters                                  225  Appendix E Lustre Compilation over MLNX OFED                       226      8 Mellanox Technologies J      Rev 2 0 3 0 0      List of Figures    Figure 1  Mellanox OFED Stack for ConnectX  Family Adapter Cards                      19  Figure 2       Consolidation Over InfiniBand                                           56  Figure 3  An Example ofa Virtual Network                                            73  Fig  ure 4  Qos Manager             a s upa aq eles  oe   as 150  Figure 5  Example QoS Deployment on InfiniBand Subnet                               159    Mellanox Technologies 9 J      Rev 2 0 3 0 0    List of Tables    Table 1    Document Revision History                                                 11  Table 2    Abbreviations and Acronyms                                                12          3    Glossary                   bod fr
7.          Possible Value  Description   ANON Use current pages ANON small ones   Default value    HUGE Force huge pages   CONTIG Force contiguous pages   PREFER CONTIG Try contiguous fallback to ANON small pages   PREFER HUGE Try huge fallback to ANON small pages   ALL Try huge fallback to contiguous if failed fallback to ANON    small pages                 76 Mellanox Technologies      Rev 2 0 3 0 0      1  Values are NOT case sensitive     Usage     The application calls the ibv reg mr API which turns on the IBV ACCESS ALLOCATE MR bit and  sets the input address to NULL  Upon success  the address field of the struct ibv mr will hold the  address to the allocated memory block  This block will be freed implicitly when the  ibv_dereg_mr   is called     The following are environment variables that can be used to control error cases   contiguity     Table 4   Parameters Used to Control Error Cases   Contiguity       Parameters Description       MLX MR ALLOC TYPE Configures the allocator type      ALL  Default    Uses all possible allocator and selects most effi   cient allocator      ANON   Enables the usage of anonymous pages and disables the  allocator      CONTIG   Forces the usage of the contiguous pages allocator  If  contiguous pages are not available the allocation fails          MLX MR MAX LOG2 CONTIG BS   Sets the maximum contiguous block size order   IZE    Values  12 23    Default  23       MLX MR MIN LOG2 CONTIG BS   Sets the minimum contiguous block size order   IZE 
8.         5 0 Gbps           gt  ibportstate  C mthca0  D 0 1    PortInfo    Port info  DR path slid 65535  dlid 65535  0 port 1  HINKS CAE A t e TUE Down  physhinkstate e T TIT E Polling  MkW ERS Upp Orr 9           EET TT 1X or 4X                    le qawas 1X or 4X  Ln                VEEE                            4X  LinkSpeedSupported                2 5 Gbps  LinkSpeedEnabled                  2 5 Gbps          Mellanox Technologies 187        Rev 2 0 3 0 0    IR NS                M tt    2 5 Cbps    3  Change the speed of a port       First query for current configuration   gt  ibportstate  C mlx4 0  D 01       PortInfo      Port info  DR path slid 65535  dlid 65535  0 port 1  Lak       master uate SER S Initialize  physpimkstate e s LinkUp    irse pond                          1X or 4X    i Tile Wata Eli ral lie    aaa 1X or 4X   ulwaq q ERAN GCI esa 4X  LinkSpeedSupported                2 5 Gbps or 5 0 Gbps                          mataas 2 5 Gbps or 5 0 Gbps  ImisspeedAGVCH PU 5 0 Gbps     Now change the enabled link speed        gt  ibportstate     mlx4 0  D 0 1 speed 2   ibportstate  C mlx4 0  D 0 1 speed 2   Initial PortInfo      Port info  DR path slid 65535  dlid 65535  0 port 1  LinkSpeedEnabled                  2 5 Gbps    After PortInfo set     Port info  DR path slid 65535  dlid 65535  0 port 1  Dai opeeqanapied  naa 5 0 Gbps  IBA extension       Show the new configuration   gt  ibportstate     mlx4 0  D 0 1    PortInfo    Port info  DR path slid 65535  dlid 
9.         e 102   5 3 3 Tuning MXM Settings          ir f  rd cece tenet ene 102   5 3 4 Configuring Multi Rail Support                                       103   5 3 5 Configuring MXM over the Ethernet Fabric                             103   5 4 Fabric Collective Accelerator                                        103  5 5    ScalableUPG    asa Riek Aa ah             Po de tiga y bY u hq 104  5 5 1 Installing ScalableUPC                           104   5 35 27  Runtime Parameters  Loa ges eR pens 105   5 5 3 Various Executable Examples                                        106  Chapter 6 Working With VPI                                             10S  6 1 Port Type                                                                 108  6 2  Auto                              ped E Ve      109  6 2 1 Enabling Auto Sensing                                              109  Chapter 7 Performance                                                  110  7 1 General System                                                                 110  7 1 1 PCI Express  PCIe  Capabilities                                      110   7 1 2 Memory Configuration                                              110   7 1 3 Recommended BIOS Settings                                        110   7 2 Performance Tuning for Linux                                       113  7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance         113   7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic
10.         no part enforce   N  DEPRECATED   This option disables partition enforcement on switch external ports     126 Mellanox Technologies      Rev 2 0 3 0 0        pant enforce       both  in  out  off   This option indicates the partition enforcement type  for switches   Enforcement type can be outbound only  out   inbound only  in   both or  disabled  off   Default is both       allow both pkeys   W  This option indicates whether both full and limited membership  on the same partition can be configured in the PKeyTable   Default is not to allow both pkeys       qos   Q  This option enables QoS setup       qos policy file   Y   QoS policy file    This option defines the optional QoS policy file   The default name is   etc opensm qos policy conf        congestion control   EXPERIMENTAL  This option enables congestion control configuration       cc key   key     EXPERIMENTAL  This option configures the CCkey to use when configuring  congestion control       Say  Gu        This option will cause SM not to exit on fatal initialization  issues  if SM discovers duplicated guids or 12x link with  lane reversal badly configured   By default  the SM will exit on these errors     B  Run in daemon mode   OpenSM will run in the background       daemon       inactive   I  Start SM in inactive rather than normal init SM state       perfmgr  Start with PerfMgr enabled       perfmgr sweep time s   sec     PerfMgr sweep interval in seconds       prefix routes file   path to file    This op
11.       HHH         HH         HH HH  Preparing     HH    Het HH HH    HH    HH   H        H H HHH HH H HH HH  libibcn      HHH HH HH    HH    HH                   HHH   H HH HH  Preparing    7 HHH HH E 41 4E 4L HH    HH    HH   H HHH   H H HH H HHH HH HH  libibcm devel THHHHHHHHHHHBHBHHHHHHHHHHHHBHHHBHHHHHHHHHHBHHHBHRHHI  Preparing  27 HH    HH      HH    HH    HH   H Het   H H HH H HH H HH HH  libibcn  devel      HH Ht HH    HH    HH HH H        HH HHHH HH H HH H HH HH  Preparing  E HHH HH METANET HH    HH L HH HH    HH    HH HHHH HH H HH H HH HH  libibumad                                                                                                    Preparing  2 HH    HH        HH    HH    HH   H Het   H H HHH HH H HH HH  libib un ad HH    HH HHHH HH    HH    HH   H HH H   H H HH H HH H HH HH  Preparing       HH    HHH HH HH    HH H HH HH H HHH   H H HH H HHH HH HH  libibun ad devel HH H HH HHHH HH H HH H        HHH         HH         HH HH  Preparing     HH    HH He 41 4E 4L HH    HH    HH   H        H H HH H HH H HH HH  libibun ad devel      HH E 4 4E 4L HH    HH    HH      Het         HH H   H HH HH  Preparing     HHH HH          HH    HH H HH HH H HHH   H H HH H HHH HH HH  libibun ad static      HH E 4 4E 4L HH    HH           HHH      H HH H   H HH HH  Preparing     HH    HH E 41 4E UL HH    HHH HH   H HHH   H H HHH HH H HH HH  libibun ad static HH         HH HH    HH    HH   H  H     H H HH H HH H HH HH  Preparing     HH    HHH HH HH    HH H HH HH H HHH   H H HH H 
12.      Valid port types  1 1  2 eth  3 auto  string        log maximum number of QPs per HCA  default  19   int   log maximum number of SRQs per HCA  default  16   int   log number of RDMARC buffers per QP  default  4   int   log maximum number of CQs per HCA  default  16   int     log maximum number of  int     multicast groups per HCA  default  13     log maximum number of  default  19   int   log maximum number of memory translation table segments per  HCA  default  max 20  2 MTTs for register all of the host mem   ory limited to 30    int    Enable Quality of Service support in the HCA  default  off    bool    Reset device on internal errors if non zero  default 1  in  SRIOV mode default is 0   int     memory protection table entries per HCA                      Threshold for using inline data  int    Default and max value is 104 bytes  Saves PCI read operation  transaction  packet less then threshold size will be copied to  hw buffer directly    Enable RSS for incoming UDP traffic  uint    On by default  Once disabled no RSS for incoming UDP traffic  will be done    Priority based Flow Control policy on TX 7 0   Per priority  bit mask  uint    Priority based Flow Control policy on RX 7 0   Per priority  bit mask  uint        224 Mellanox Technologies      Rev 2 0 3 0 0      Appendix D  mlx5 Module Parameters    The mlx5 ib module supports a single parameter used to select the profile which defines the  number of resources supported  The parameter name for selecting the pro
13.     In case sensing the port protocol fails  the port will be configured as an InfiniBand  port        For ConnectX     Mellanox ConnectX FlexBoot v3 3 400  iPXE 1 0 0     Open Source Network Boot Firmware       net0  00 02 c9 03 00 0c 78 11 on PCI02 00 0  open    Link doun  TX 0 TXE O RX 0 RXE 01   Link status  The socket is not connected   Waiting for link up on netO    ok       Mellanox Technologies 209      Rev 2 0 3 0 0      A 8    A 8 1    A 8 2    A 8 3    After configuring the IB ETH port  the client attempts connecting to the DHCP server to obtain  an IP address and the source location of the kernel OS to boot from     For ConnectX  InfiniBand      Mellanox ConnectX FlexBoot v3 3 400  iPXE 1 0 0     Open Source Network Boot Firmware       netO  00 02 c9 03 00 0c 78 11 on     192 00 0  open    ILink doun  TX O TXE O RX O RXE O    Link status  The socket is not connected   Waiting for link up      met0    ok  DHCP  netO 02 02 c9 0c 78 11      ok  netO  11 3 12 2 255 255 255 0  Next server  11 3 12 121  Filename  pxeilinux O  Root path  vtftpbootv  tftp   11 3 12 121  pxeilinux O          Next  FlexBoot attempts to boot as directed by the DHCP server   Command Line Interface  CLI     Invoking the CLI    When the boot process begins  the computer starts its Power On Self Test  POST  sequence   Shortly after completion of the POST  the user will be prompted to press CTRL B to invoke Mel   lanox FlexBoot CLI  The user has few seconds to press CTRL B before the message disa
14.     PTP v2  UDP  Sync packet      HWTSTAMP FILTER PTP V2 L4 SYNC       PTP v2  UDP  Delay req packet     HWTSTAMP FILTER PTP V2 L4 DELAY REQ                             xi  1   4  rJ          802 AS1   HWTSTAMP FIL     802 AS1   HWTSTAMP FIL     802 AS1   HWTSTAMP FILT    hernet  any kind of event packet     R PTP V2 L2 EVENT    hernet  Sync packet                V2 L2 SYNC    hernet  Delay req packet                V2 L2 DELAY REQ                 Ed  9    Cj       Ei ct p ct wc                   PTP v2 802 AS1  any layer  any kind of event packet     HWTSTAMP FILTER PTP V2 EVENT       PTP v2 802 AS1  any layer  Sync packet     HWTSTAMP FILTER PTP V2 SYNC        PTP v2 802 AS1  any layer  Delay req packet     HWTSTAMP FILTER PTP V2 DELAY REQ                                   Note  for receive side time stamping currently only HWTSTAMP FILTER NONE and  HWTSTAMP FILTER ALL are supported                    4 6 2 Getting Time Stamping    Once time stamping is enabled time stamp is placed in the socket Ancillary data  recvmsg   can  be used to get this control message for regular incoming packets  For send time stamps the outgo   ing packet is looped back to the socket s error queue with the send time stamp s  attached  It can    be received with recvmsg flags MSG ERRQUEUE   The call returns the original outgoing  packet data including all headers preprended down to and including the link layer  the  scm timestamping control message and a sock extended err control message with  ee
15.     Rev 2 0 3 0 0        Rev 2 0 3 0 0 Installation                   Preparing  x HH     H   HH HH    HHH HH   H Het   H H HH H HH H HH HH  libexgb4 devel THHHBHHHEHHHBHBHHHHHHHHBHHHHBHHHHBHHHHHHHHBHHHBSHRHHI  Preparing        HH    HHH HH HH    HH H HH HH H HHH   H H HH H HHH HH HH  libn es HH H HH HHHH HH H HH H        HH           HH         HH HH  Preparing  2 HH    HH          HH    HH    HH   H        H H HH H HH H HH HH  libn es      HH          HH    HH    HH      HH jk          HH H   H HH HH  Preparing       HHH HH          HH    HH H HH HH H HHH   H H HH H HHH HH HH  libnes devel static HE E E EHE EH aE aE a EE a R EEEH  Preparing  27 HH    HH Ht db HH    HHH HH   H Het   H H HHH HH H HH HH  libnes devel static HH         HH HH    HH    HH   H Het   H H HH H HH H HH HH  Preparing       HHH HHH HH HH    HH H HH HH H HHH   H H HH H HHH HH HH  libipa thverbs THHHBHHHHHHHHBHBHHHHHHHHHHHHHHHHHBHHHHHHHHBHHHBHHHHI  Preparing  2x HH         HH HH    HH    HH   H        H H HH H HH H HH HH  libipa thverbs THHHBHHHHHHHHBHBHHHHHHHHBHHHHHHHHHBHHHHHHHHHBHHHBHRHHI  Preparing  AS HH         HH HH    HH H HH HH H HHH   H H HH H HHH HH HH  libipa thverbs devel HH    HH E 4 4 4L HH H HH H     H HH H   H H HH H HH H HH HH  Preparing  21 HH    HH        HH    HH    HH   H Het   H H HHH HH H HH HH  libipa thverbs devel HH         HH HH    HH    HH      Het         HH         HH HH  Preparing  zn HH    HHH HH HH    HH    HH   H HHH   H H HH H HHH HH HH  libibcn HH H HH HHHH HH H HH H  
16.     e            Applications use the ROMA API to transmit using QPs    Raw Ethernet QP   Application use VERBs API to transmit using a Raw Ethernet QP    4 5 3 Plain Ethernet Quality of Service Mapping  Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver     The following is the Plain Ethernet QoS mapping flow     Mellanox Technologies 59 J      Rev 2 0 3 0 0 Driver Features      1  The application sets the ToS of the socket using setsockopt  IP TOS  value      4 5 4    ToS is translated into the sk_prio using a fixed translation     TOS 0  lt   gt  sk_prio 0  TOS 8  lt   gt  sk prio 2  TOS 24  lt   gt  sk prio 4  TOS 16  lt   gt  sk prio 6      The Socket Priority is mapped to the UP       Ifthe underlying device is a VLAN device  egress map is used controlled by the vconfig  command  This is per VLAN mapping        Ifthe underlying device is not a VLAN device  the tc command is used  In this case  even  though tc manual states that the mapping is from the sk prio to the TC number  the  mlx4 en driver interprets this as    sk prio to UP mapping    Mapping the sk prio to the UP is done by using tc wrap py  i   dev name    u  0 1 2 3 4 5 6 7   The the UP is mapped to the TC as configured by the m1nx qos tool or by the 11         daemon   if DCBX is used     of the socket  In this case the ToS to sk prio fixed mapping is not needed  This allows  the application and the administrator to utilize more than the 4 values possible via ToS     
17.     mlnxofedinstall script  For further information  please see    add kernel support  option       gt  The  mlnx add kernel support sh    script can be executed directly from the  ad below     2 3 2 Installation Script    Mellanox OFED includes an installation script called minxofedinstall  Its usage is described  below  You will use it during the installation procedure described in Section 2 3 3  Installation  Procedure   on page 28     Usage    mnt mlnxofedinstall  OPTIONS     Options   c   config   packages config file   Example of the configuration file can be found  under docs   n   net   network config file   Example of the network configuration file can be    found under docs   k   kernel version   kernel version   Use provided kernel version instead of  uname         rf    p   print available Print available packages for current platform  and create corresponding ofed conf file    without 32bit Skip 32 bit libraries installation    without depcheck Skip Distro s libraries check    without fw update Skip firmware update    fw update only Update firmware  Skip driver installation    force fw update Force firmware update    26 Mellanox Technologies      Rev 2 0 3 0 0           force Force installation     all   hpc   basic   msm Install all  hpc  basic or Mellanox Subnet man   ager packages correspondingly     vma                1 Install packages required by VMA to support            vma eth Install packages required by VMA to work over  Ethernet     with vma Set confi
18.     root selene     mstflint  dev   PCI Device   dc       Verify in the  HCA  section the following field appears  2      HCA    num pfs   1  cotal vigs   5  sriov en   true    HCA parameters can be configured during firmware update using the m1nxofedinstall script and  running the    enable sriov  and    total vfs  lt 0 63 gt   installation parameters        1  Ifthe fields in the example above do not appear in the  HCA  section  meaning SR IOV is not supported in the used INI   2  IfSR IOV is supported  to enable if it is not  it is sufficient to set    sriov_en   true    in the INI     Mellanox Technologies 83      Rev 2 0 3 0 0 Driver Features      If the current firmware version is the same as one provided with MLNX_OFED  run it in combination  with the    force fw update  parameter         This configuration option is supported only in HCAs that their configuration file  INI  is  included in MLNX_OFED                    Parameter Recommended Value  num pfs 1  Note  This field is optional and might not always  appear   total_vfs 63  sriov en true                  Ifthe HCA does not support SR IOV  please contact Mellanox Support  support mellanox com    Step 7  Create the text file  etc modprobe d mlx4 core conf if it does not exist  otherwise delete its  contents    Step 8  Insert an  option  line in the  etc modprobe d mlx4 core conf file to set the number of VFs  the  protocol type per port  and the allowed number of virtual functions to be used by the physical  funct
19.    Arbitration tables on various nodes in the fabric     However  this is not supported in OFED   the section is parsed    and ignored  SL2VL and VLArb tables should be configured in the    OpenSM options file  by default    var cache opensm opensm opts     end qos setup    qos levels      Having a QoS Level named  DEFAULT  is a must   it is applied to    PR MPR requests that didn t match any of the matching rules   gos level   name  DEFAULT   use  default QoS Level   sl  0  end qos level      the whole set  SL  MTU Limit  Rate Limit  PKey  Packet Lifetime    gos level  name  WholeSet    Mellanox Technologies 153    OpenSM   Subnet Manager    54 Mellanox Technologies         Rev 2 0 3 0 0      8 6 6 Simple QoS Policy   Details and Examples    Simple QoS policy match rules are tailored for matching ULPs  or some application on top of a  ULP  PR MPR requests  This section has a list of per ULP  or per application  match rules and  the SL that should be enforced on the matched PR MPR query     Match rules include        Default match rule that is applied to PR MPR query that didn t match any of the other  match rules       SDP      SDP application with a specific target TCP IP port range      SRP with a specific target IB port GUID   e RDS      IPoIB with a default PKey      IPoIB with a specific PKey      Any ULP application with a specific Service ID in the PR MPR query     Any ULP application with a specific PKey in the PR MPR query      Any ULP application with a specific ta
20.    Default  0x100 as in rev 16A of the specification   In rev 10 the default was Oxff00      initiator ext   Please refer to Section 9  Multiple Connections          Tolist the new SCSI devices that have been added by the echo command  you may use  either of the following two methods        Execute    fdisk  I   This command lists all devices  the new devices are included in this  listing        Execute    dmesg    or look at  var log messages to find messages with the names of the  new devices     4 1 2 3 SRP Tools   ibsrpdm and srp daemon    To assist in performing the steps in Section 6  the OFED distribution provides two utilities   ibsrpdm and srp daemon  which      Detect targets on the fabric reachable by the Initiator  for Step 1        Output target attributes in a format suitable for use in the above    echo    command  Step  2     The utilities can be found under  usr sbin   and are part of the srptools RPM that may be  installed using the Mellanox OFED installation  Detailed information regarding the various  options for these utilities are provided by their man pages     Below  several usage scenarios for these utilities are presented     42 Mellanox Technologies      Rev 2 0 3 0 0      ibsrpdm  ibsrpdm is using for the following tasks   1  Detecting reachable targets    a  To detect all targets reachable by the SRP initiator via the default umad device   dev umad0   execute the  following command     ibsrpdm  This command will output information on each SRP Targ
21.    This assistant will guide you through adding a new  piece of virtual hardware  First select what type  of hardware you wish to add     Hardware type  __ Storage       Parallel           Physical Host Device  00 video  B watchdog    X cancel     Forward    F Add Hardware    Bemove    Step 4  Choose a Mellanox virtual function according to its PCI device  e g   00 03 1          5       the Virtual Machine is up reboot it  otherwise start it   Step 6  Log into the virtual machine and verify that it recognizes the Mellanox card  Run     lspci   grep Mellanox    00 03 0 InfiniBand  Mellanox Technologies MT27500 Family  ConnectX 3 Virtual Function    rev b0            7  Add the device to the  etc sysconfig network scripts ifcfg ethx configuration file  The  MAC address for every virtual function is configured randomly  therefore it is not necessary to  add it     4 13 5 Uninstalling SR IOV Driver     To uninstall SR IOV driver  perform the following     Step 1  For Hypervisors  detach all the Virtual Functions  VF  from all the Virtual Machines  VM  or  stop the Virtual Machines that use the Virtual Functions     Please be aware  stopping the driver when there are VMs that use the VFs  will cause machine to hang     Step2  Run the script below  Please be aware  uninstalling the driver deletes the entire driver s file  but  does not unload the driver      root sw1022      usr sbin ofed uninstall sh   This program will uninstall all OFED packages on your machine   Do you want to 
22.    Values  12 23    Default  12                   4 10 Shared Memory Region      Shared Memory Region is only applicable to the mlx4 driver          Shared Memory Region  MR  enables sharing MR among applications by implementing the   Register Shared MR  verb which is part of the IB spec     Sharing MR involves the following steps          1  Request to create a shared MR    The application sends a request via the ibv_reg_mr API to create a shared MR  The application  supplies the allowed sharing access to that MR and if the MR was created successfully  a unique  MR ID is returned as part of the struct ibv mr which can be used by other applications to register  with that MR     The underlying physical pages must not be Last Recently Used  LRU  or Anonymous  To  disable that  you need to turn on the IBV ACCESS ALLOCATE MR bit as part of the   sharing bits    Usage     Mellanox Technologies 77 J      Rev 2 0 3 0 0 Driver Features         Turns on via the        reg mr             more of the sharing access bits  The sharing bits are part of       ibv_reg_mr man page       Turns on the IBV ACCESS ALLOCATE MR bit   Step 2  Request to register to a shared MR  A new verb called ibv reg shared mr is added to enable sharing an MR  To use this verb  the  application supplies the MR ID that it wants to register for and the desired access mode to that  MR  The desired access is validated against its given permissions and upon successful creation  the  physical pages of the original MR a
23.    gt  Socket applications can use setsockopt  SK PRIO  value  to directly set the sk prio        In case of VLAN interface  the UP obtained according to the above mapping is also used    in the VLAN tag of the traffic    RoCE Quality of Service Mapping  Applications use RDMA CM API to create and use QPs   The following 1s the RoCE QoS mapping flow     l     The application sets the ToS of the QP using the          set option option   RDMA OPTION ID TOS  value      ToS is translated into the Socket Priority  sk prio  using a fixed translation     TOS 0       sk prio 0  TOS 8       sk prio 2  TOS 24       sk prio 4  TOS 16       sk prio 6       The Socket Priority is mapped to the User Priority  UP  using the tc command     In case of a VLAN device  the parent real device is used for the purpose of this mapping     60 Mellanox Technologies      Rev 2 0 3 0 0      4  The the UP is mapped to the TC as configured by the m1nx_qos tool or by the 11dpad daemon  if DCBX is used          With RoCE  there can only be 4 predefined ToS values for the purpose of QoS mapping   ad    4 5 5 Raw Ethernet QP Quality of Service Mapping  Applications open a Raw Ethernet QP using VERBs directly   The following is the RoCE QoS mapping flow     1  The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of  the QP        Sets      attrs ah attrs sl   up  e Calls modify      with           av set in the mask    2  The UP is mapped to the TC as configured by the m1nx
24.    mlnx add kernel support sh  m   mlnx ofed   path to MLNX OFED directory      make iso      make tgz      make iso  Create MLNX OFED ISO image      make tgz  Create MLNX OFED tarball   Default     t   tmpdir   local work dir        kmp  Enable KMP format if supported     k     kernel    kernel version   Kernel version to use     s     kernel sources    path to the kernel sources   Path to kernel headers     v     verbose            name  Name of the package to be created     y   yes  Answer  yes  to all questions          1  The firmware will not be updated if you run the install script with the      without fw update    option     Mellanox Technologies 25 J      Rev 2 0 3 0 0 Installation      Example    The following command will create    MLNX OFED LINUX ISO image for RedHat 6 3 under  the  tmp directory         MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 mlnx add kernel support sh  m  lt path gt     MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64   make tgz   Note  This program will create MLNX OFED LINUX TGZ for rhel6 3 under  tmp directory   All Mellanox  OEM  OFED  or Distribution IB packages will be removed    Do you want to continue   y N   y   See log file  tmp mlnx ofed 150 1380 109    Building OFED RPMs  Please wait     Removing OFED RPMs     Created  tmp MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 tgz    Install newly created MLNX OFED package      cd  tmp     tar xzf MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 tgz       MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 mlnxofedinstall
25.    mstflint  dev   PCI device   dc  gt    ini device file gt   Step 4  Edit the ini file that you found in the previous step  and add the following lines to       HCA   section in order to support 63 VFs      SRIOV enable    total vfs   63   num pfs   1  Sriov en   true    1  Some servers might have issues accepting 63 Virtual Functions or more  In such case  please set the  number of  total vfs  to any required value     Step 5  Create a binary image using the modified ini file  Run     mlxburn  fw     fw name  mlx  conf   modified ini file    wrimage   file name gt  bin    The file   file name gt  bin is a firmware binary file with SR IOV enabled that has 63 VFs  It can  be spread across all machines and can be burnt using mstflint  which is part of the bundle  using  the following command       mstflint  dev   PCI device    image   file name   bin b       After burning the firmware  the machine must be rebooted  If the driver is only    restarted  the machine may hang and a reboot using power OFF ON might be required   ae    Mellanox Technologies 89 J      Rev 2 0 3 0 0 Driver Features      4 13 7 Configuring Pkeys        GUIDs under SR IOV    4 13 7 1 Port Type Management    Port Type management is static when enabling SR IOV  the connectx_port_config script will  not work   The port type is set on the Host via a module parameter  port type array      mlx4_core  This parameter may be used to set the port type uniformly for all installed Con   nectX   HCAs  or it may speci
26.    pou  Flag un dur  If Not Description  y Specified    d   device   Optional First found   Run the command for the provided IB    ib dev  lt device gt  device device    device      i  lt port gt  Optional All device Query the specified device port  lt port gt     ib port  lt port gt  ports       Mellanox Technologies 181           Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities       Table 22   ibv_devinfo Flags and Options                               Optional    Default  Flag ee ior  If Not Description  y Specified    1 Optional Inactive Only list the names of InfiniBand    list devices   V Optional Inactive Print all available information about the    verbose InfiniBand device s   Examples    1  List the names of all available InfiniBand devices    gt  ibv devinfo  1  2 HCAs found   mthca0  mlx4 0    2  Query the device mlx4 0 and print user available information for its Port 2    gt       devinfo  d mlx4 0  i 2  hca id  mlx4 0    fw ver  2 5 944  node guid  0000 0000 0007 3895  Sys image guid  0000 0000 0007 3898  vendor id  0x02c9  vendor part id  25418  hw ver  0xA0  board id  MT 04A0140005  phys port cnt  2  port 2  state  PORT ACTIVE  4   max mtu  2048  4   active mtu  2048  4   sm lid  il  port lid  1  port         0x00    9 8 ibdev2netdev    ibdev2netdev enables association between IB devices and ports and the associated net device   Additionally it reports the state of the net device link     Synopsys  ibdev2netdev   v    h     182 Mellanox Technologies     
27.    reset performance counters of port 1 only   reset extended performance counters of port 1 only  reset performance counters of all ports      reset only error counters of port 2         reset only non error counters of port 2    1  Read local port s performance counters      gt  perfquery         Port counters  Lid 6 port 1    196 Mellanox Technologies      Rev 2 0 3 0 0      2  Read performance counters from LID 2  all ports     3  Read then reset performance counters from LID 2  port 1     Mellanox Technologies 197                 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities       RevCons traint Errors aaa 0  Mink integr TE v BST    8 Pia 0  ECB    OM STAINTON aaa 0  WESC Jo       0          na sasana 0            0  XMEPKE S S Senet 0  REVERIE S  ee a aa aso as s 0    9 144 ibcheckerrs    Validates an IB port  or node  and reports errors in counters above threshold     Check specified port  or node  and report errors that surpassed their predefined threshold  Port  address is lid unless  G option is used to specify a GUID address  The predefined thresholds can  be dumped using the  s option  and a user defined threshold file  using the same format as the  dump  can be specified using the  t  lt file gt  option     Synopsis    ibcheckerrs   h    b    v    G         threshold file gt     s    N   nocolor    C ca name             port    t timeout ms    lid guid       port       Output Files    Table 28 lists the various flags of the command     Table 28   ibchecker
28.   Direct Bridging  as well as other L3 Switching modes  e g  NAT    This document explains the configuration and driver behavior when configured in Bridging  mode    In virtualization environment  a virtual machine can be expose to the physical network by per   forming the next setting           1  Create a virtual bridge   Step 2  Attach the para virtualized interface created by the eth_ipoib driver to the bridge   Step 3  Attach the Ethernet interface in the Virtual Machine to that bridge    72 Mellanox Technologies      Rev 2 0 3 0 0      The diagram below describes the topology that was created after these steps              Hypervisor       Virtual  Interface s    vX     Virtual  Bridge s    vbrX  aka  vSwitch        Bridge  Uplink s       pif     elPolB   IPoib  Uplink             InfiniBand  Fabric    The diagram shows how the traffic from the Virtual Machine goes to the virtual bridge in the  Hypervisor and from the bridge to the eIPoIB interface  eIPoIB interface is the Ethernet interface  that enslaves the IPoIB interfaces in order to send receive packets from the Ethernet interface in  the Virtual Machine to the IB fabric beneath     4 8 1 Enabling the           Driver    Once the mlnx ofed driver installation is completed  perform the following   Step 1  Open the  etc infiniband openib conf file and include    E IPOIB LOAD yes  Step2  Restart the InfiniBand drivers     etc init d openibd restart    4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver    When
29.   ERROR WINDOW  5     SWITCH 0x12345    ENABLE  true   AGEING TIME  77          SWITCH 0x0002c902004050f8    AGEING TIME  44          SWITCH Oxabcde    ENABLE  false          166 Mellanox Technologies      Rev 2 0 3 0 0      8 9 Congestion Control    8 9 1 Congestion Control Overview    Congestion Control Manager is a Subnet Manager  SM  plug in  i e  it is a shared library  libcc   mgro  that is dynamically loaded by the Subnet Manager  Congestion Control Manager is  installed as part of Mellanox OFED installation     The Congestion Control mechanism controls traffic entry into a network and attempts to avoid  oversubscription of any of the processing or link capabilities of the intermediate nodes and net   works  Additionally  is takes resource reducing steps by reducing the rate of sending packets   Congestion Control Manager enables and configures Congestion Control mechanism on fabric  nodes  HCAs and switches      8 9 2 Running OpenSM with Congestion Control Manager    Congestion Control  CC  Manager can be enabled disabled through SM options file  To do so   perform the following     1  Create the file  Run     opensm       options file name       2  Find the  event plugin name option in the file  and add  cemgr  to it     _         Event plugin name s   event plugin name ccmgr    3  Run the SM with the new options file   opensm  F  lt options file name gt          Once the Congestion Control is enabled on the fabric nodes  to completely disable    Congestion Control  
30.   If  for instance  TCO is set to 80  guarantee and        to 20   the TCs sum must be 100   then  the BW left after servicing all strict priority TCs will be split according to this ratio     Since this is a minimal guarantee  there is no maximum enforcement  This means  in the same  example  that if TC1 did not use its share of 20   the reminder will be used by TCO     Rate Limit    Rate limit defines a maximum bandwidth allowed for a TC  Please note that 1096 deviation from  the requested values is considered acceptable     Quality of Service Tools    mlnx qos    mlnx qos is a centralized tool used to configure QoS features of the local host  It communicates  directly with the driver thus does not require setting up a DCBX daemon on the system     The minx qos tool enables the administrator of the system to        Inspect the current QoS mappings and configuration    62 Mellanox Technologies      Rev 2 0 3 0 0      The tool will also display maps configured by TC and vconfig set_egress_map tools  in order to  give a centralized view of all QoS mappings        Set UP to TC mapping      Assign a transmission algorithm to each TC  strict or ETS      Set minimal BW guarantee to ETS TCs      Set rate limit to TCs    For unlimited ratelimit set the ratelimit to 0     Usage   mlnx_gos  i   interface    options   Options       version show program s version number and exit    h    help show this help message and exit     MSI     CEUS  maps UBs to TCs  LIST is 8 comma seperated TC 
31.   Optional Use  lt smlid gt  as the target LID for SM SA  queries    V ersion  Optional Show version info    C Optional Use the specified channel adapter or router    lt ca_name gt     P   ca port     Optional Use the specified port                        Rev 2 0 3 0 0      Table 26   smpquery Flags and Options                                  Optional   Detault A   Flag  If Not Description  Mandatory Specified    t Optional Override the default timeout for the solicited   lt timeout ms           msec    gt    lt op gt  Mandatory Supported operations   nodeinfo  lt addr gt   nodedesc  lt addr gt   portinfo  lt addr gt    lt portnum gt    switchinfo  lt addr gt   pkeys  lt addr gt    lt portnum gt    512 1  lt addr gt    lt portnum gt    vlarb  lt addr gt    lt portnum gt    guids  lt addr gt   mepi  lt addr gt    lt portnum gt     lt destdr_path   Optional Destination   s directed path  LID  or GUID    lid   guid gt   Examples    1  Query PortInfo by LID  with port modifier      gt  smpquery portinfo 1 1    Port info  Lid 1 port 1           0  0000000000000000             M RI ENTRE STET 0xfe80000000000000  Dude qu e d Mee RE UE 0x0001  SM esas Gos Gia e UP           0x0001                 ee 0x251086a  IsSM  IsTrapSupported    IsAutomaticMigrationSupported  IsSLMappingSupported  IsSystemImageGUIDsupported  IsCommunicatonManagement Supported  IsVendorClassSupported  IsCapabilityMaskNoticeSupported       IsClientRegistrationSupported              COGS    e ee sns E A E 0x0000 
32.   S  binterfaces   uu y ge peret ree        AN pig ond Pa S IRE  53  4 3 5 Verifying IPoIB Functionality                                         54  4 3 6 Bonding IPoIB       00    ett ene en eens 55  4 4 Quality of Service InfiniBand                                         56  4 4 1 Quality of Service Overview                                          56  442 QoS Architecture si  sie             es cca eka            CAN e 57  4 43     Supported Policy            ERR ee ee ed 57  4344  CMA Features ia  ar eet eee sg eed tcd 58  4 4 5  vOpenSM  Features se eS e               CADRE CASCO I Rn 59  4 5 Quality of Service Ethernet                                           59  4 5 1 Quality of Service Overview                                          59  4 5 2 Mapping Traffic to Traffic   1                                               59  4 5 3 Plain Ethernet Quality of Service Mapping                               59  4 5 4          Quality of Service Mapping                                     60  4 5 5 Raw Ethernet QP Quality of Service Mapping                            61  4 5 6 Map Priorities with tc wrap py mlnx qos                                61  4 5 7 Quality of Service Properties sese us luu yy a eee 62  4 5 8 Quality of Service Tools                                             62  4 6 Time Stamping Service                                             66  4 6 1 Enabling Time Stamping                                             67  4 6 2 Getting Time Stamping           
33.   Samples programs for reference        jbv task pingpong  16   cc pingpong    Mellanox Technologies 95 J      Rev 2 0 3 0 0 Driver Features      4 15 Ethtool  ethtool is    standard Linux utility for controlling network drivers        hardware  particularly for  wired Ethernet devices  It can be used to      Getidentification and diagnostic information     Get extended device statistics     Control speed  duplex  autonegotiation and flow control for Ethernet devices     Control checksum offload and other hardware offload features     Control DMA ring sizes and interrupt moderation    The following are the ethtool supported options     Table 6   ethtool Supported Options       Options Description       ethtool  i eth lt x gt  Checks driver and device information    For example      gt  ethtool  i eth2   driver  mlx4 en  MT 0DD0120009 CX3   version  2 1 6  Aug 2013     firmware version  2 30 3000    bus info  0000 1a 00 0       ethtool  k eth lt x gt  Queries the stateless offload status        ethtool  K eth lt x gt   rx onloff   tx Sets the stateless offload status     onloff   sg        tso onjoff   Iro TCP Segmentation Offload  TSO   Generic Segmentation  onjoff   gro          gso onjoff  Offload  GSO   increase outbound throughput by reducing  CPU overhead  It works by queuing up large buffers and  letting the network interface card split them into separate  packets     Large Receive Offload  LRO   increases inbound through   put of high bandwidth network connections by r
34.   The expansion ROM image presents itself to the BIOS as a boot device  As a result  the BIOS  will add to the list of boot devices    MLNX FlexBoot  lt ver gt     for a ConnectX device  The priority  of this list can be modified through BIOS setup     208 Mellanox Technologies      Rev 2 0 3 0 0      A 7 Operation    A 7 1 Prerequisites       Make sure that your client is connected to the server s     The FlexBoot image is already programmed on the adapter card     see Section A 2    ForInfiniBand ports only  Start the Subnet Manager as described in Section A 4        The DHCP server should be configured and started  see Section 4 3 3 1     IPoIB Config   uration Based on DHCP   on page 50      Configure and start at least one of the services iSCSI Target  see Section A 10  and or  TFTP  see Section A 5     A 7 2 Starting Boot    Boot the client machine and enter BIOS setup to configure    MLNX FlexBoot    to be the first on  the boot device priority list     see Section A 6       On dual port network adapters  the client first attempts to boot from Port 1  If this fails   it switches to boot from Port 2  Note also that the driver waits up to 90 seconds for  ad each port to come up     If MLNX FlexBoot iPXE was selected through BIOS setup  the client will boot from FlexBoot   The client will display FlexBoot attributes  sense the port protocol     Ethernet or InfiniBand  In  case of an InfiniBand port  the client will also wait for port configuration by the Subnet Manager   
35.   a   run all validation tests  expecting an input  inventory           only validate the given inventory file           run service registration  deregistration  and lease  test   e   run event forwarding test        flood the SA with queries according to the stress mode   m   multicast flow   q   QoS info  dump VLArb and SLtoVL tables   t   run trap 64 65 flow  this flow requires running of    external tool   Default   all flows except QoS   w    wait This option specifies the wait time for trap 64 65 in  seconds  It is used only when running  f t   the trap 64   65 flow Default   10 sec   d    debug This option specifies a debug option  These options  are not normally needed  The number following  d  selects the debug option to enable as follows   OPT Description   d0 Ignore other SM nodes   d1 Force single threaded dispatching   d2 Force log flushing after each log message   d3 Disable multicast support   m    max lid This option specifies the maximal LID number to be  searched for during inventory file build  Default   100    Gy    G  This option specifies the local port GUID value with  which OpenSM should bind   OpenSM may be bound to  1 port at a time  If GUID given is 0  OpenSM displays a  list of possible port GUIDs and waits for user input   Without  g  OpenSM tries to use the default port    O  VORE This option displays a menu of possible local port GUID  values with which osmtest could bind   i    inventory This option specifies the name of the inventory file  No
36.   l     3                    l     2                  l     1                                            y 0                  x 0 i 2 3 4 s    Assuming the y dateline was between y 4 and y 0  this spanning tree has a branch that crosses a  dateline  However  again this cannot contribute to credit loops as it occurs on a 1D ring  the ring  for x 3  that is broken by a failure  as in the above example     8 5 7 3 Torus Topology Discovery    The algorithm used by torus 2QoS to construct the torus topology from the undirected graph rep   resenting the fabric requires that the radix of each dimension be configured via torus 2QoS conf   It also requires that the torus topology be  seeded   for a 3D torus this requires configuring four  switches that define the three coordinate directions of the torus  Given this starting information   the algorithm is to examine the cube formed by the eight switch locations bounded by the corners   x y z  and  x 1 y 1 z 1   Based on switches already placed into the torus topology at some of  these locations  the algorithm examines 4 loops of interswitch links to find the one that is consis   tent with a face of the cube of switch locations  and adds its swiches to the discovered topology  in the correct locations     Because the algorithm is based on examining the topology of 4 loops of links  a torus with one or  more radix 4 dimensions requires extra initial seed configuration  See torus 2QoS conf 5  for  details  Torus 2QoS will detect and repor
37.   m1nx qos should  be used  m1nx qos gets a list of a mapping between UPs to TCs  For example  m1nx qos    iethO  p 0 0 0 0 1 1 1 1 maps UPs 0 3 to rco  and Ups 4 7 to Tc1     Quality of Service Properties   The different QoS properties that can be assigned to a TC are      Strict Priority  see  Strict Priority     e Minimal Bandwidth Guarantee  ETS   see    Minimal Bandwidth Guarantee  ETS        Rate Limit  see    Rate Limit      Strict Priority    When setting a TC s transmission algorithm to be  strict   then this TC has absolute  strict  prior   ity over other TC strict priorities coming before it  as determined by the TC number  TC 7 is  highest priority  TC 0 is lowest   It also has an absolute priority over non strict TCs  ETS      This property needs to be used with care  as it may easily cause starvation of other TCs     A higher strict priority TC is always given the first chance to transmit  Only if the highest strict  priority TC has nothing more to transmit  will the next highest TC be considered     Non strict priority TCs will be considered last to transmit     This property is extremely useful for low latency low bandwidth traffic  Traffic that needs to get  immediate service when it exists  but is not of high volume to starve other transmitters in the sys   tem     Minimal Bandwidth Guarantee  ETS     After servicing the strict priority TCs  the amount of bandwidth  BW  left on the wire may be  split among other TCs according to a minimal guarantee policy   
38.   threshold 10  lid 2 port 255   threshold 10  lid 2 port 255    warn   threshold 100  lid 2 port 255    Error check on lid 2  MT47396 Infiniscale III Mellanox Technologies  port all      warn  counter LinkRecovers     warn  counter LinkDowned   12     warn  counter RcvErrors   565    counter XmtDiscards   441       FAILED    2  Check port counters for LID 2 Port 1      gt  ibcheckerrs  v 2 1  Error check on lid 2  MT47396 Infiniscale III Mellanox Technologies  port 1  OK    3  Check the LID2 Port 1 using the specified threshold file      gt  cat threshl  SymbolErrors 10  LinkRecovers 10  LinkDowned 10  RevErrors 10  RcvRemotePhysErrors 100  RcvSwRelayErrors 100  XmtDiscards 100  XmtConstraintErrors 100  RcvConstraintErrors 100  LinkIntegrityErrors 10  ExcBufOverrunErrors 10  VL15Dropped 100    Mellanox Technologies 199    Rev 2 0 3 0 0               gt  ibcheckerrs  v  T threshl 2 1  Error check on lid 2  MT47396 Infiniscale III Mellanox Technologies  port 1  OK    9 15 mstflint    Queries and burns a binary firmware image file on non volatile  Flash  memories of Mellanox  InfiniBand and Ethernet network adapters  The tool requires root privileges for Flash access         If you purchased a standard Mellanox Technologies network adapter card  please down       load the firmware image from www mellanox com  gt  Downloads  gt  Firmware  If you   purchased a non standard card from a vendor other than Mellanox Technologies  please         contact your vendor     To run mstflint 
39.   to the event plugin option option  in the file  amp  options string that would be passed to the plugin s  event plugin options  armgr   conf file  lt ar mgr options file name gt     2  Run Subnet Manager with the new options file     opensm  F   options file name     AR Manager options file contains two types of parameters     1  General options   Options which describe the AR Manager behavior and the AR parameters  that will be applied to all the switches in the fabric     2  Per switch options   Options which describe specific switch behavior   Note the following     Adaptive Routing configuration file is case sensitive       You can specify options for nonexisting switch GUID  These options will be ignored  until a switch with a matching GUID will be added to the fabric       Adaptive Routing configuration file is parsed every AR Manager cycle  which in turn is  executed at every heavy sweep of the Subnet Manager        Ifthe AR Manager fails to parse the options file  default settings for all the options will  be used     164 Mellanox Technologies    Table 13   Adaptive Routing Manager Options File      Rev 2 0 3 0 0      8 8 5 1 General AR Manager Options       Option File    Description    Values       ENABLE    lt true false gt     Enable disable Adaptive Routing on fabric  switches    Note that if a switch was identified by AR Man   ager as device that does not support AR  AR  Manager will not try to enable AR on this  switch  If the firmware of this switch was  upd
40.  00 1 ports l pkey idx 1  echo 1  gt  0000 02 00 1 ports 1 pkey_idx 0  echo 0  gt  0000 02 00 2 ports 1 pkey idx 1  echo 2  gt  0000 02 00 2 ports 1 pkey_idx 0    Mellanox Technologies 93 J      Rev 2 0 3 0 0 Driver Features      vml pkey index 0 will be mapped to physical pkey index 1         vm2 pkey index 0  will be mapped to physical pkey index 2  Both vml and vm2 will have their pkey  index   mapped to the default pkey     Step d       Host do the following     cd  sys class infiniband mlx4 0 iov   echo 0  gt  0000 03 00 1 ports l pkey idx 1  echo 1  gt  0000 03 00 1 ports 1 pkey_idx 0  echo 0  gt  0000 03 00 2            1          idx 1  echo 2  gt  0000 03 00 2 ports 1 pkey_idx 0    Stepe  Once the VMs are running  you can check the VM s virtualized PKey table by doing  on the  vm      cat  sys class infiniband mlx4 0 ports  1 2  pkeys   0 1   Step3  Start up the VMs  and bind VFs to them    Step 4  Configure IP addresses for ib0 on the host and on the guests     4 13 7 3 Ethernet Virtual Function Configuration when Running SR IOV    4 13 7 3 1VLAN Guest Tagging  VGT  and VLAN Switch Tagging  VST     When running ETH ports on VFs  the ports may be configured to simply pass through packets as  is from VFs  Vlan Guest Tagging   or the administrator may configure the Hypervisor to silently  force packets to be associated with a VLan Qos  Vlan Switch Tagging      In the latter case  untagged or priority tagged outgoing packets from the guest will have the  VLAN tag insert
41.  2 0 3 0 0      Unicast lids  0x3 0x7  of switch Lid 2 guid 0x0002c902fffff00a  MT47396 Infiniscale III  Mellanox Technologies      Lid Out Destination    Port Info  0  0003 021    Switch portguid 0x000b8cffff004016   MT47396 Infiniscale III Mellanox  Technologies      0x0006 007    Channel Adapter portguid 0x0002c90300001039   sw137 HCA 1    0x0007 021    Channel Adapter portguid 0x0002c9020025874a   sw157 HCA 1    3 valid lids dumped    4  Dump all Lids with valid out ports of the switch with portguid                      004016      gt  ibroute  G 0x000b8cffff004016    Unicast lids  0x0 0x8  of switch Lid 3 guid 0x000b8cffff004016  MT47396 Infiniscale III  Mellanox Technologies      Lid Out Destination    Port Info  0x0002 023    Switch portguid 0x0002c902fffff00a   MT47396 Infiniscale III Mellanox  Technologies   0x0003 000    Switch portguid 0x000b8cffff004016   MT47396 Infiniscale III Mellanox  Technologies     0  0006 023    Channel Adapter portguid 0x0002c90300001039   sw137 HCA 1   0  0007 020    Channel Adapter portguid 0x0002c9020025874a   sw157 HCA 1   0x0008 024    Channel Adapter portguid 0x0002c902002582cd   sw136 HCA 1   5 valid lids dumped                5  Dump all non empty mlids of switch with Lid 3        ibroute  M 3    Multicast mlids  0xc000 0xc3ff  of switch Lid 3 guid 0x000b8cffff004016  MT47396  Infiniscale III Mellanox Technologies      0 1 2   Mocs        anons    oso sO lage  MLid  0xc000  0xc001  0xc002  0xc003  0xc020  0xc021  0xc022  0xc023  0
42.  2QoS can generate  credit loop free unicast routes  it is also possible to generate a master spanning tree for multicast  that retains the required properties  For example  consider that same 2D 6x5 torus  with the link  from  2 2  to  3 2  failed  Torus 2QoS will generate the following master spanning tree     4                   I     I  3                           2                                   I        1                      l I  y 0              x     2 3 4 5    Mellanox Technologies 145      Rev 2 0 3 0 0 OpenSM     Subnet Manager      Two things        notable about this master                  tree  First  assuming the    dateline was  between x 5 and x 0  this spanning tree has a branch that crosses the dateline  However  just as  for unicast  crossing a dateline on a 1D ring  here  the ring for y 2  that is broken by a failure  cannot contribute to a torus credit loop  Second  this spanning tree is no longer optimal even for  multicast groups that encompass the entire fabric  That  unfortunately  is a compromise that must  be made to retain the other desirable properties of torus 2QoS routing  In the event that a single  switch fails  torus 2QoS will generate a master spanning tree that has no  extra  turns by appro   priately selecting a root switch  In the 2D 6x5 torus example  assume now that the switch at   3 2   1     the root for a pristine fabric  fails  Torus 2QoS will generate the following master  spanning tree for that case     4                    
43.  3 0 0 OpenSM     Subnet Manager      end qos ulps       OpenSM options file    qos max vls 8   qos high limit 0   qos vlarb high 1 32 2 96 3 96 4 96   qos vlarb low 0 1   Gio  ew Ordr 3 256  7  105  15 15 15 15 15 15  18      Partition configuration file    Default 0x7ff    ipoib   ALL full   PartA 0x8001  sl 1  ipoib   ALL full     8 8 Adaptive Routing    8 8 1 Overview      Adaptive Routing is at beta stage          Adaptive Routing  AR  enables the switch to select the output port based on the port s load  AR  supports two routing modes        Free AR  No constraints on output port selection     e Bounded AR  The switch does not change the output port during the same transmission  burst  This mode minimizes the appearance of out of order packets     Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric  switches  It scans all the fabric switches  deduces which switches support Adaptive Routing and  configures the AR functionality on these switches     Currently  Adaptive Routing Manager supports only link aggregation algorithm  Adaptive Rout   ing Manager configures AR mechanism to allow switches to select output port out of all the ports  that are linked to the same remote switch  This algorithm suits any topology with several links  between switches  Especially  it suits 3D torus mesh  where there are several link in each direc   tion of the X Y Z axis       If some switches do not support AR  they will slow down      AR Manager as it may 
44.  3 1 IPoIB Configuration Based on DHCP    Setting an IPoIB interface configuration based on DHCP is performed similarly to the configura   tion of Ethernet interfaces  In other words  you need to make sure that IPoIB configuration files  include the following line    For RedHat     BOOTPROTO dhcp    For SLES   BOOTPROTO   dchp           If IPoIB configuration files are included  ifefg ib lt n gt  files will be installed under    etc sysconfig network scripts  on a RedHat machine  p  etc sysconfig network  on a SuSE machine       A patch for DHCP is required for supporting IPoIB  For further information  please see    the REAME which is available under the docs dhcp  directory        Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hard   ware address  To overcome this problem  DHCP over InfiniBand messages convey a client iden   tifier field used to identify the DHCP session  This client identifier field can be used to associate  an IP address with a client identifier value  such that the DHCP server will grant the same IP  address to any client that conveys this client identifier     The length of the client identifier field is not fixed in the specification  For the Mellanox OFED  for Linux package  it is recommended to have IPoIB use the same format that FlexBoot uses for  this client identifier     see Section A 3 2     Configuring the DHCP Server     on page 207     4 3 3 1 1 DHCP Server    In order for the DHCP server to provide config
45.  79 J      Rev 2 0 3 0 0 Driver Features      e struct        flow attr  attaches the      to the flow specified  The flow contains mandatory control  parameters and optional L2  L3 and L4 headers  The optional headers are detected by setting the size and  num_of_specs fields     struct ibv flow attr can be followed by the optional flow headers structs     struct ibv flow spec ib  struct ibv flow spec eth  struct ibv flow spec ipv4  struct ibv flow spec tcp udp    For further information  please refer to the ibv create flow man page       Be advised that from MLNX_OFED v2 0 3 0 0 and higher  the parameters  both the    value and the mask  should be set in big endian format   de    Each header struct holds the relevant network layer parameters for matching  To enforce the match  the  user sets a mask for each parameter  The supported masks are        All one mask   include the parameter value in the attached rule  Note  Since the VLAN ID in the Ethernet header is 12bit long  the following parameter should be  used  flow spec eth mask vlan tag   htons OxOfff         All zero mask   ignore the parameter value in the attached rule    When setting the flow type to NORMAL  the incoming traffic will be steered according to the rule spec   ifications  ALL DEFAULT and MC DEFAULT rules options are valid only for Ethernet link type since  InfiniBand link type packets always include QP number     For further information  please refer to the relevant man pages   ibv destroy flow  int ibv 
46.  Administrator that runs on top of  the Mellanox OFED stack  opensm performs the InfiniBand specification   s required tasks for ini   tializing InfiniBand hardware  One SM must be running for each InfiniBand subnet    opensm also provides an experimental version of a performance manager    opensm defaults were designed to meet the common case usage on clusters with up to a few hun   dred nodes  Thus  in this default mode  opensm will scan the IB fabric  initialize it  and sweep  occasionally for changes    opensm attaches to a specific IB port on the local machine and configures only the fabric con   nected to it   If the local machine has other IB ports  opensm will ignore the fabrics connected to  those other ports   If no port is specified  opensm will select the first    best    available port   opensm can also present the available ports and prompt for a port number to attach to    By default  the opensm run is logged to two files   var log messages and  var log   opensm  log  The first file will register only general major events  whereas the second file will  include details of reported errors  All errors reported in this second file should be treated as indi   cators of IB fabric health issues   Note that when a fatal and non recoverable error occurs   opensm will exit   Both log files should include the message  SUBNET UP  if opensm was able  to setup the subnet correctly     8 2 1 opensm Syntax  opensm  OPTIONS     where OPTIONS are       version  Prints OpenSM ver
47.  BER for each port and check no BER value  has exceeds the BER threshold    default threshold  10  12          ber_use data   Indicates that BER test will use the received  data for calculation     ber thresh   value     Specifies the threshold value for the BER    test  The reciprocal number of the BER should   be provided  Example  for 10  12 than value   need to be 1000000000000 or 0    804  51000    10  12   If threshold given is 0 than all   BER values for all ports will be reported     extended_speeds  lt dev type gt    Collect and test port extended speeds counters    dev type   sw   all                 pm per lane   List all counters per lane  when available       ls  lt 2 5 5 10 14 25 FDR10 gt    Specifies the expected link speed      lw  lt 1x 4x 8x 12x gt    Specifies the expected link width     w   write topo file   file name gt   Write out a topology file for the discovered   topology     t   topo file   file     Specifies the topology file name      out ibnl dir   directory     The topology file custom system definitions   ibnl  directory      Screen num errs   num     Specifies the threshold for printing errors  to screen    default 5        smp window  lt num gt    Max smp MADs on wire   default 8        gmp window  lt num gt    Max gmp MADs on wire   default 128        max hops  lt max hops gt    Specifies the maximum hops for the discovery  process    default 64       V   version   Prints the version of the tool     h   help   Prints help information  withou
48.  ERROR             Causes the process to hang in a loop when completion with error which is not flushed with  error or retry exceeded occurs       Otherwise disabled  e MLX5 POST SEND PREFER BF      Configures every work request that can use blue flame will use blue flame         Otherwise   blue flame depends on the size of the message and inline indication in the  packet      MLXS SHUT UP BF     Disables blue flame feature    Otherwise   do not disable  e MLX5 SINGLE THREADED    Allspinlocks are disabled    Otherwise   spinlocks enabled      Used by applications that are single threaded and would like to save the overhead of taking  spinlocks       MLX5        SIZE    64 completion queue entry size is 64 bytes  default       128  completion queue entry size is 128 bytes    20 Mellanox Technologies      Rev 2 0 3 0 0      MLX5 SCATTER TO CQE       Small buffers are scattered to the completion queue entry and manipulated by the driver   Valid for RC transport       Default is 1  otherwise disabled     13 33  Mid layer Core    Core services include  management interface  MAD   connection manager  CM  interface  and  Subnet Administrator  SA  interface  The stack includes components for both user mode and  kernel applications  The core services run in the kernel and expose an interface to user mode for  verbs  CM and management     1 34 ULPs    IPoIB    The IP over IB  IPoIB  driver is a network interface implementation over InfiniBand  IPoIB  encapsulates IP datagrams over an 
49.  MULTICAST MTU 2044 Metric 1  RX packets 0 errors 0 dropped 0 overruns 0 frame 0  TX packets 0 errors 0 dropped 0 overruns 0 carrier 0  collisions 0 txqueuelen 128  RX bytes 0  0 0 b  TX bytes 0  0 0 b     Step 4       can be seen  the interface does not have IP or network addresses  To configure those  you  should follow the manual configuration procedure described in Section 4 3 3 3            5       be able to use this interface  a configuration of the Subnet Manager is needed so that the  PKey chosen  which defines a broadcast address  be recognized  see Chapter 8     OpenSM      Subnet Manager         Removing a Subinterface  To remove a child interface  subinterface   run     echo   subinterface          gt   sys class net   ib interface   delete child  Using the example of Step 2   echo 0x8001  gt   sys class net ib0 delete child    Note that when deleting the interface you must use the PKey value with the most significant bit  set  e g   0x8000 in the example above      Verifying IPoIB Functionality   To verify your configuration and your IPoIB functionality  perform the following steps    Step 1  Verify the IPoIB functionality by using the i  config command   The following example shows how two IB nodes are used to verify IPoIB functionality  In the  following example  IB node 1 is at 11 4 3 175  and IB node 2 is at 11 4 3 176     hostl  ifconfig ib0 11 4 3 175 netmask 255 255 0 0  host2  ifconfig ib0 11 4 3 176 netmask 255 255 0 0    Step 2  Enter the ping command 
50.  Mellanox FCA which off loads from   UPC collective operations    For further information on FCA  please refer to the Mellanox website      GasNet library contains MXM conduit which offloads from UPC all P2P operations as    well as some synchronization routines  For further information on MXM  please refer to  the Mellanox website         Mellanox OFED 1 8 includes ScalableUPC 2 1  which is installed under    opt mellanox bupc   p If you have installed OFED 1 8  you do not need to download and install ScalableUPC     Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the  Mellanox website     Installing ScalableUPC  Mellanox ScalableUPC is installed as part of MLNX OFED package         Mellanox OFED 1 8 5 includes ScalableUPC Rev 2 2  which is installed under      opt mellanox bupc     If you have installed OFED 1 8 5  you do not need to download and install ScalableUPC     Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the  Mellanox website     106 Mellanox Technologies      Rev 2 0 3 0 0      Please note  the binary distribution of ScalableUPC is compiled with the following defaults              support  FCA is disabled at runtime by default and must be configured prior to  using it from the ScalableUPC  For further information  please refer to FCA User Man   ual       MXM support enabled by default    5 5 2 Runtime Parameters    The following parameters can be passed to  upcrun  in order to change FCA
51.  Mellanox Technologies 125      Rev 2 0 3 0 0 OpenSM     Subnet Manager        port_search ordering file   0  lt path to file gt     This option provides the means to define a mapping  between ports and dimension  Order  for controlling  Dimension Order Routing  DOR     Moreover this option provides the means to define non  default routing port order       dimn ports file   O   path to file    DEPRECATED   This option provides the means to define a mapping  between ports and dimension  Order  for controlling  Dimension Order Routing  DOR        honor guid2lid   x  This option forces OpenSM to honor the guid2lid file   when it comes out of Standby state  if such file exists  under OSM CACHE DIR  and is valid  By default  this is FALSE       const multicast  This option forces OpenSM to conserver previously built  multicast trees      log file   f  lt log file name gt   This option defines the log to be the given file   By default  the log goes to  var log opensm log   For the log to go to standard output use  f stdout       log limit   L   size in MB    This option defines maximal log file size in MB  When  Specified the log file will be truncated upon reaching  this limit         erase log file   e  This option will cause deletion of the log file   if it previously exists   By default  the log file  is accumulative       Pconfig   P  lt partition config file gt   This option defines the optional partition configuration file     The default name is   etc opensm partitions conf
52.  Performance         113    Mellanox Technologies 5 J      Rev 2 0 3 0 0      7 2 3 Preserving Your Performance Settings after a Reboot                     114   7 2 4 Tuning Power Management                                          114   7 2 5 Interrupt                             cece ee 116   7 2 6 Tuning for NUMA Architecture                                      116   TAT IRQUATffinity u hate a ashe At tei teeta wkd er eee a Saves 118   7 2 8 Tuning Multi Threaded IP                                                        120  Chapter 8 OpenSM     Subnet Manager                                      121  Bul  OVerVIew   2   edet e e        RO                     Dida ete de 121  8 2   opensm Description       ge Ig eR eR e Ue ee ee Side oe 121  8 2 1   opensm Syntax  i e ete et rem dete cow ocolos 121   8 2 2 Environment Variables         llle 129   823 Signaling        EPIO SERRE LOL ae ales                      130   8 2 4  Running opensm  i lec e RESELLER RO OE E SEED 130   8 3 osmtest Description                                               130  9 9  T  3S yntax  csset AIR Rete    deese      este nic 131   8 32       Running         NE wo uet Sas BRO eR CEA pets 133   8 4                     ete ees 133  8 4 1  File Format  saa               tates eee ees sua ae ees 133   8 5 Routing Algorithms         llle nes 136  8 5 1 Effect of Topology Changes                                         137   8 5 2 Min Hop Algorithm                         uama uda E e      137   8 5 
53.  QP based congestion control     1   SL Port based congestion con   trol    Default  0       Mellanox Technologies 169             Table 17   Congestion    Control Manager CA Options File      Rev 2 0 3 0 0 OpenSM     Subnet Manager         Option File    Desctiption    Values       ca_control_map    An array of sixteen bits  one for each SL  Each  bit indicates whether or not the corresponding  SL entry is to be modified     Values  Oxffff                      ccti_increase Sets the CC Table Index  CCTI  increase  Default  1  trigger_threshold Sets the trigger threshold  Default  2  ccti min Sets the CC Table Index  CCTI  minimum  Default  0  cct Sets all the CC table entries to a specified value    Values   lt comma separated  The first entry will remain 0  whereas last value   list  will be set to the rest of the table  Default  0  When the value is set to 0   the CCT calculation is  based on the number of  nodes   ccti timer Sets for all SL s the given ccti timer  Default  0          When the value is set to 0   the CCT calculation is  based on the number of  nodes        Table 18   Congestion    Control Manager CC MGR Options File       Option File    Desctiption    Values       max errors  error window    When number of errors exceeds  max_errors  of  send receive errors or timeouts in less than   error window  seconds  the CC MGR will abort  and will allow OpenSM to proceed     Values     max errors   0  zero tollerance      abort configuration on first error        er
54.  QoS policy file has the following sections   I  Port Groups  denoted by port groups     This section defines zero or more port groups that can be referred later by matching rules  see  below   Port group lists ports by        Port GUID     Port name  which is a combination of NodeDescription and IB port number       PKey  which means that all the ports in the subnet that belong to partition with a given  PKey belong to this port group       Partition name  which means that all the ports in the subnet that belong to partition with  a given name belong to this port group       Node type  where possible node types are  CA  SWITCH  ROUTER  ALL  and SELF   SM s port      150 Mellanox Technologies      Rev 2 0 3 0 0      ID QoS Setup  denoted by qos setup     This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the  fabric  However  this is not supported in OFED  SL2VL and VLArb tables should be configured  in the OpenSM options file  default location    var cache opensm opensm opts             QoS Levels  denoted by qos levels   Each QoS Level defines Service Level  SL  and a few optional fields     e MTU limit    Rate limit                 e Packet lifetime    When path s  search is performed  it is done with regards to restriction that these QoS Level  parameters impose  One QoS level that is mandatory to define is a DEFAULT QoS level  It is  applied to a PR MPR query that does not match any existing match rule  Similar to any other  QoS Leve
55.  Rev 2 0 3 0 0      Options     v Enable verbose mode  Adds additional information such as  Device ID  Part Number   Card Name  Firmware version  IB port state      h Print help messages                                               Example    Sw417   BXOFED 1 5 2 20101128 1524   ibdev2netdev  v   mlx4 0      26428   MT1006X00034  FALCON QDR fw 2 7 9288 port 1  ACTIVE     gt  eth5  Down   mlx4 0  MT26428   MT1006X00034  FALCON QDR fw 2 7 9288 port 1  ACTIVE     gt  ib0  Down   mlx4 0  MT26428   MT1006X00034  FALCON QDR fw 2 7 9288 port 2  DOWN    gt  1101  Down   mlx4 1  MT26448   MT1023X00777  Hawk Dual Port fw 2 7 9400 port 1  DOWN    gt  eth2  Down   mlx4 1  MT26448   MT1023X00777  Hawk Dual Port fw 2 7 9400 port 2  DOWN zz   eth3  Down   sw417   BXOFED 1 5 2 20101128 1524   ibdev2netdev    1  4 0 port 1    gt  eth5  Down   1  4 0 port 1    gt  1  0  Down     lx4 1 port 1      eth2  Down   lx4 1 port 2      eth3  Down        m  m  mlx4 0 port 2    gt  ibl  Down   m  m    9 9 ibstatus    Displays basic information obtained from the local InfiniBand driver  Output includes LID   SMLID  port state  port physical state  port width and port rate     Synopsis    ibstatus   h     device name gt    lt port gt        Output Files    Table 23 lists the various flags of the command     Table 23   ibstatus Flags and Options             a Default  Flag T  If Not Description  y Specified    h Optional Print the help menu   lt device gt  Optional All devices Print information for the spe
56.  Skprio  4  tos  24   Skprio  5  Skprio  6  tos  16   Skprio  7  Skprio  8  Skprio  9  Skprio  10  Skprio  11  Skprio  12  Skprio  13  Skprio  14  Skprio  15  up  7  tc  1 ratelimit  4 Gbps  tsa  ets  bw  70   tie 1  up  2  Wos 9  tc  2 ratelimit  2 Gbps  tsa  strict  up  4  bigs 5  up  6    4 5 8 2 tc and tc wrap py  The  tc  tool is used to setup sk prio to UP mapping  using the mgprio queue discipline     In kernels that do not support               such as 2 6 34   an alternate mapping is created in sysfs   The      wrap py  tool will use either the sysfs      the  tc  tool to configure the sk prio to UP    mapping     66 Mellanox Technologies      Rev 2 0 3 0 0      Usage     tc_wrap py  i  lt interface gt   options     Options       version show program s version number and exit    h    help show this help message and exit    u SKPRIO UP    skprio up SKPRIO UP maps sk prio to UP  LIST is  lt  16 comma separated  UP  index of element is sk prio     i INTF    interface INTF Interface name    Example  set skprio 0 2 to UPO  and skprio 3 7 to UP1 on eth4          UP 0  Skprio  0  Skprio  1  Skprio  2  tos  8   Skprio  7  Skprio  8  Skprio  9  Skprio  10  Skprio  11  Skprio  12  Skprio  13  Skprio  14  Skprio  15  UP 1  Skprio  3  Skprio  4  tos  24   Skprio  5  Skprio  6  tos  16   UP 2  Us 3  UP 4  Us 5  0     7    4 5 8 3 Additional Tools    tc tool compiled with the sch_mqprio module is required to support kernel v2 6 32 or higher   This is a part of iproute2 package v2 
57.  Write a data block to Flash without sector erase       rb  lt addr gt   lt size gt    out file     Read a data block from Flash       swreset          SW reset the target InfniScale   IV device  This command is  supported only in the In Band access method        202 Mellanox Technologies    InfiniBand Fabric Diagnostic Utilities            Rev 2 0 3 0 0      Possible command return values are     0   successful completion  1   error has occurred  7   the burn command was aborted because firmware is current    Examples    1  Find Mellanox Technologies   s ConnectX   VPI cards with PCI Express running at 2 5GT s  and InfiniBand ports at DDR   or Ethernet ports at 10GigE    gt   sbin lspci  d 15b3 634a  04 00 0 InfiniBand  Mellanox Technologies MT25418  ConnectX IB DDR  PCIe 2 0 2 5GT s    rev a0      In the example above  15b3 is Mellanox Technologies   s vendor number  in hexadecimal   and 634a is  the device   s PCI Device ID  in hexadecimal   The number string 04 00 0 identifies the device in the  form bus dev fn       The PCI Device IDs of Mellanox Technologies    devices can be obtained from the PCI    ID Repository Website at http   pci ids ucw cz read PC 15b3           2  Verify the ConnectX firmware using its ID  using the results of      example above       gt  mstflint  d 04 00 0 v  ConnectX failsafe image  Start address  80000  Chunk size 80000     NOTE  The addresses below are contiguous logical addresses  Physical addresses on flash  may be different  based on the
58.  be unique  but PKey does need to be unique        Ifa PKey is repeated then the associated partition configurations will be merged and the  first PartitionName will be used  see also next note         115 possible to split a partition configuration in more than one definition  but then they  PKey should be explicitly specified  otherwise different PKey values will be generated  for those definitions      134 Mellanox Technologies      Rev 2 0 3 0 0      Examples     DeFault 0x7FFF   ALL  SELF full    NewPartition   ipoib   0x123456 full  0x3456789034 limi  0x2134af2306     YetAnotherOne   0x300   SELF full    YetAnotherOne   0x300   ALL limited            ShareIO   0x80   defmember full   0  123451  0  123452     0x123453  0  123454 will be limited   ShareIO   0x80   0x123453  0x123454  0x123455 full      0x123456  0x123457 will be limited    ShareIO   0x80    defmember limited    0x123456  0  123457   0x123458 full     ShareIO   0x80   defmember full   0x123459  0x12345a     ShareIO   0x80   defmember full   0x12345b  0x12345c limited   0x12345d     The following rule is equivalent to how OpenSM used to run prior to the partition manager        Default 0x7fff ipoib ALL full     Mellanox Technologies 135      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 5 Routing Algorithms    OpenSM offers six routing engines   1     Min Hop Algorithm      Based on the minimum hops to each node where the path length is optimized   2     UPDN Algorithm       Based on the minimum hops to ea
59.  before the rest of the LASH algorithm runs     Mellanox Technologies 141      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 5 6 DOR Routing Algorithm    The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses short   est paths  Instead of spreading traffic out across different paths with the same shortest distance  it  chooses among the available shortest paths based on an ordering of dimensions  Each port must  be consistently cabled to represent a hypercube dimension or a mesh dimension  Paths are  grown from a destination back to a source using the lowest dimension  port  of available paths  at each step  This provides the ordering necessary to avoid deadlock  When there are multiple  links between any two switches  they still represent only one dimension and traffic is balanced  across them unless port equalization is turned off  In the case of hypercubes  the same port must  be used throughout the fabric to represent the hypercube dimension and match on both ends  of the cable  In the case of meshes  the dimension should consistently use the same pair of  ports  one port on one end of the cable  and the other port on the other end  continuing along the  mesh dimension     Use     R dor    option to activate the DOR algorithm     8 5 7  Torus 2QoS Routing Algorithm    Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics  The torus 2QoS  routing engine can provide the following functionality on a 2D 3D torus        Fr
60.  by their LIDs  or by the  names defined in the topology file   In this case  the actual path from the local port to the source  port  and from the source port to the destination port  is defined by means of Subnet Management  Linear Forwarding Table queries of the switch nodes along that path  Therefore  the path cannot  be predicted as it may change     ibdiagpath should not be supplied with contradicting local ports by the  p and  d flags  see  synopsis descriptions below   In other words  when ibdiagpath is provided with the options  p  and  d together  the first port in the direct route must be equal to the one specified in the     p     option  Otherwise  an error is reported         When ibdiagpath queries for the performance counters along the path between the    source and destination         it always traverses      LID route  even if a directed route   is specified  If along the LID route one or more links are not in the ACTIVE state  ibdi   Aa agpath reports an error     Moreover  the tool allows omitting the source node in LID route addressing  in which case the  local port on the machine running the tool is assumed to be the source     Synopsis  ibdiagpath   n  lt  src name  dst name gt   1  lt  src lid  dst lid gt   d    pl p2 p3          c   count      v    t   topo file       8   sys name      ic lt dev index gt  c  p  lt port num gt      o   out dir      lw   1x 4x 12x      1s  lt 2 5 5 10 gt     pm    pc     P  lt  lt PM counter     Trash Limit gt  gt      
61.  device numbers  e g  1   Max supported  devices   32  string     1  In the current version  this parameter is using decimal number to describe the InfiniBand device and  not hexadecimal number as it was in previous versions in order to uniform the mapping of device  function numbers to InfiniBand device numbers as defined for other module parameters  e g  num vfs  and probe vf     For example to map mlx4 15 to device function number 04 00 0 in the current version we use   options mlx4 ib dev assign str 04 00 0 15  as opposed to the previous version in which we  Used  options mlx4 ib dev assign str 04 00 0 f     C 2  mlx4 core Parameters    set 4k mtu   Obsolete  attempt to set 4K MTU to all ConnectX ports  int   debug level  Enable debug tracing if    0  int        52 0   don t use MSI X    1   wise WSK     gt 1   limit number of MSI X irgs to msi_x  non SRIOV only   int   enable sys tune  Tune the cpu s for better performance  default 0   int   block loopback  Block multicast loopback packets if    0  default  1   int   num vfs  Either single value  e g   5   to define uniform num vfs value    for all devices functions or a string to map device function  numbers to their num vfs values  e g   0000 04 00 0   5 002b 1c 0b a 15     Hexadecimal digits for the device function  e g  002b 1c 0b a   and decimal for num vfs value  e g  15    string    probe vf  Either single value  e g   3   to define uniform number of VFs  to probe by the pf driver for all devices functions or a st
62.  down SRP     Mellanox Technologies 47 J      Rev 2 0 3 0 0 Driver Features      2  After Manual Activation of High Availability   If you manually activated SRP High Availability  perform the following steps   a  Unmount all SRP partitions that were mounted   b  Kill the SRP daemon instances     c  Make sure there are no multipath instances running  If there are multiple instances  wait for them to end or  kill them     d  Run  multipath  F  3  After Automatic Activation of High Availability    If SRP High Availability was automatically activated  SRP shutdown must be part of the driver shut   down    etc init d openibd stop   which performs Steps 2 4 of case b above  However  you still have  to unmount all SRP partitions that were mounted before driver shutdown     42 iSCSI Extensions for RDMA  iSER       iSCSI Extensions for ROMA  iSER  is currently at beta level     Please be aware that the content below is subject to change           42 1 Overview    iSCSI Extensions for RDMA  iSER  extends the iSCSI protocol to RDMA  It permits data to be  transferred directly into and out of SCSI buffers without intermediate data copies     4 2 2 iSER Initiator    The iSER initiator is controlled through the iSCSI interface available from the iscsi initiator utils  package     Targets settings such as timeouts and retries are set the same as any other iSCSI targets       If targets are set to auto connect on boot  and targets        unreachable  it may take a long    time to continue th
63.  e Aet de DA babu ss Fd eso oid gae Due i        13  Table4  Reference Documents                              s 14  Table 5  Software and Hardware Requirements                                         24  Table 6  mlnxofedinstall Return Codes                                               27  Table 7s  Buffer Valles s susse                    ately EUR ame          74  Table 8    Parameters Used to Control Error Cases   Contiguity                             75  Table 9  Flow Specific Parameters                                                   79  Table 10   ethtool Supported Options                                                  94  Table  Useful MPELinks  s atiy aypuspa CET CHE CRT HE CERT PCR RKA 99  Table 12  Runtime Parameters                                                      105  Table 13  Recommended PCIe Configuration                                          110  Table 14  Recommended BIOS Settings for Intel Sandy Bridge Processors                   111  Table 15  Recommended BIOS Settings for Intel   Nehalem Westmere Processors            112  Table 16  Recommended BIOS Settings for AMD Processors                             112  Table 17    Adaptive Routing Manager Options File                                      165  Table 18  Adaptive Routing Manager Pre Switch Options File                             166  Table 19  Congestion Control Manager General Options File                              169  Table 20  Congestion Control Manager Switch Options            
64.  errno  ENOMSG and ee origin  SO EE ORIGIN TIMESTAMPING  A socket with such    70 Mellanox Technologies      Rev 2 0 3 0 0      a pending bounced packet is ready for reading as far as select   is concerned  If the outgoing  packet has to be fragmented  then only the first fragment is time stamped and returned to the  sending socket         When time stamping is enabled  VLAN stripping is disabled  For more info please  refer to Documentation networking timestamping txt in kernel org    47 Atomic Operations      Atomic Operations are applicable to the mlx4 driver only        4 7 1 Enhanced Atomic Operations    ConnectX   implements a set of Extended Atomic Operations beyond those defined by the IB  spec  Atomicity guarantees  Atomic Ack generation  ordering rules and error behavior for this set  of extended Atomic operations is the same as that for IB standard Atomic operations  as defined  in section 9 4 5 of the IB spec      4 7 1 1 Masked Compare and Swap  MskCmpSwap     The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB  spec  MskCmpSwap allows the user to select a portion of the 64 bit target data for the  compare   check as well as to restrict the swap to a  possibly different  portion  The pseudocode below  describes the operation      atomic response    va     if     compare add    va   amp  compare add mask   then      va     va  amp    swap mask      swap  amp  swap mask       return atomic response    The additional operands a
65.  eth ipoib is loaded  number of eIPoIB interfaces are created  with the following default  naming scheme  ethx  where X represents the ETH port available on the system     Too check which eIPoIB interfaces were created     cat  sys class net eth ipoib interfaces    Mellanox Technologies 73      Rev 2 0 3 0 0 Driver Features      For example          system with dual                       following two interfaces might be created   eth4 and eth5    cat  sys class net eth ipoib interfaces   eth4 over IB port  ib0   eth5 over IB port  ibl    These interfaces can be used to configure the network for the guest  For example  if the guest has  a VIF that is connected to the Virtual Bridge bro  then enslave the eIPoIB interface to bro by  running      brctl addif br0 ethX      In RHEL KVM environment  there are other methods to create configure your virtual net     work  e g  macvtap   For additional information  please refer to the Red Hat User Manual        The IPoIB daemon  ipoibd  detects the new virtual interface that is attached to the same bridge as  the eIPoIB interface and creates a new IPoIB instances for it in order to send receive data  As a  result  number of IPoIB interfaces  ibX Y  are shown as being created destroyed  and are being  enslaved to the corresponding ethx interface to serve any active VIF in the system according to  the set configuration  This process is done automatically by the ipoibd service      gt  To see the list of IPoIB interfaces enslaved under et
66.  for transaction timeouts   Specifying  t 0 disables timeouts   Without  t  OpenSM defaults to a timeout value of  200 milliseconds       retries  lt number gt   This option specifies the number of retries used  for transactions   Without   retries  OpenSM defaults to 3 retries  for transactions       maxsmps   n  lt number gt   This option specifies the number of VL15 SMP MADs  allowed on the wire at any one time   Specifying   maxsmps 0 allows unlimited outstanding  SMPs   Without   maxsmps  OpenSM defaults to a maximum of  4 outstanding SMPs       rereg on guid migr  This option if enabled  forces OpenSM to send port info  with client reregister bit set to all nodes in the  fabric when alias Guid migrates from one physical port  to another       aguid inout notice  This option enables sending GID IN OUT notices on Alias GUIDs  register delete request to registered clients       sm assign guid func  unig count   base port   Specifies the algorithm that SM will use when it comes to choose  SM assigned alias GUIDs  The default is uniq count       console   q  off local   This option activates the OpenSM console  default off        ignore guids   i   equalize ignore guids file    This option provides the means to define a set of ports   by guid  that will be ignored by the link load  equalization algorithm       hop weights file   w   path to file    This option provides the means to define a weighting  factor per port for customizing the least weight  hops for the routing    
67.  h   help Prints the help page information   V   version Prints the version of the tool    Vars Prints the tool s environment variables and  their values  Output Files  Table 21   ibdiagpath Output Files  Output File Description  ibdiagpath log A dump of all the application reports generated according to the  provided flags  ibdiagnet pm A dump of the Performance Counters values  of the fabric links             180 Mellanox Technologies    InfiniBand Fabric Diagnostic Utilities         Rev 2 0 3 0 0      Error Codes    1   The path traced is un healthy   2   Failed to parse command line options   3   More then 64 hops are required for traversing the local port to the   Source  port and then to the  Destination  port   4   Unable to traverse the LFT data from source to destination   5   Failed to use Topology File   6   Failed to load required Package    96 ibv_devices    Lists InfiniBand devices available for use from userspace  including node GUIDs     Synopsis    ibv devices    Examples  1  List the names of all available InfiniBand devices      gt  ibv devices    device node GUID  mthca0 0002c9000101d150  mlx4 0 0000000000073895    9 7 ibv_devinfo    Queries InfiniBand devices and prints about them information that is available for use from user   space     Synopsis    ibv_devinfo   d  lt device gt     1  lt port gt     1    v     Output Files    Table 22 lists the various flags of the command     Table 22   ibv_devinfo Flags and Options                            Optional
68.  image start address and chunk size                       0x00000038 0x000010db  0x0010a4    BOOT2    OK   0x000010dc 0x00004947  0x00386c    BOOT2    OK   0x00004948 0x000052c7  0x000980    Configuration    OK   0x000052c8 0x0000530b  0x000044    GUID    OK   0x0000530c 0x0000542f  0x000124    Image Info    OK   0x00005430 0x0000634    0x000  20    DDR    OK   0x00006350 0x0000f  29b  0  008  4      DDR    OK   0x0000  29c 0x0004749b  0x038200    DDR    OK   0x0004749c 0x0005913    0x011ca4    DDR    OK   0x00059140 0x0007a123  0x020fe4    DDR    OK   0x0007a124 0x0007bdff  0x001cdc    DDR    OK   0x0007be00 0x0007eb97  0x002d98    DDR    OK   0x0007eb98 0x0007f0af  0x000518    Configuration    OK   0x0007  0b0 0x0007  0fb  0x00004c    Jump addresses    OK   0x0007  0  c 0x0007  2a7  0x0001ac    FW Configuration    OK    FW image verification succeeded  Image is bootable     Mellanox Technologies 203           Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities    9 16        asyncwatch    Display asynchronous events forwarded to userspace for an InfiniBand device     Synopsis    ibv asyncwatch    204 Mellanox Technologies      Rev 2 0 3 0 0      Examples    1  Display asynchronous events      gt  ibv_asyncwatch  mlx4 0  async event FD 4    9 17 ibdump    Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX   family  adapters InfiniBand ports  The dump file can be loaded by the Wireshark tool for graphical traffic  analysis     The following describ
69.  keybeaseperiodi      0               pa ass 1   noma                      ETT 1   or 4X            SURE         maaa 1X or 4X                     TT AX    Mellanox Technologies 193    Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities    2  Query SwitchInfo by GUID       194 Mellanox Technologies J         Rev 2 0 3 0 0                    nn nn 18  Statechange n      n TS 0  IdSBerDontec  eee ER 0                              rrr 32  NH             ia 1   Qutboundbab      e e          il                         a a          1                   a il  EnnancedPortO merriment nn  0    3  Query          by direct route      gt  smpquery  D nodeinfo 0    Node info  DR path slid 65535  dlid 65535  0    BaseVerg                         E 1      il   Node TY DC    acumen tat              Channel Adapter  N  lm        8                                                   2               0x0002c9030000103b  Gurdas s    TREE 0x0002c90300001038       doeet M eerie 0x0002c90300001039         Capi iam aver a e Ee TES 128   DEVIA e ree n ps 0  634               0  000000  0                    1            n n E S ss 0x0002c9    9 13 perfquery  Queries InfiniBand ports    performance and error counters  Optionally  it displays aggregated  counters for all ports of a node  It can also reset counters after reading them or simply reset them   Synopsis  perfquery   h    d    G    a    1    r    C  lt ca_name gt     P  lt ca_port gt     R    t      timeout ms      V     lid guid     port   reset
70.  of the flow specification the ethtool API defines  Asking for  an unsupported flow specification will result with an    invalid value  failure     The following are the flow specific parameters     Table 5   Flow Specific Parameters                      ether tep4 udp4 ip4  Mandatory dst src ip dst ip  Optional vlan src ip  dst ip  src    src ip  dst ip  vlan  port  dst port  vlan                RFS    RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUs  that are used by flow s owner applications  This domain allows the RFS mechanism to use the flow  steering infrastructure to support the RFS logic by implementing the ndo rx flow steer  which  in  turn  calls the underlying flow steering mechanism with the RFS domain     Enabling the RFS requires enabling the    ntuple    flag via the ethtool   For example  to enable ntuple for     0  run   ethtool  K eth0 ntuple         RFS requires the kernel to be compiled with the conFIG_RFS ACCEL option  This options is available  in kernels 2 6 39 and above  Furthermore  RFS requires Device Managed Flow Steering support         RFS cannot function if LRO is enabled  LRO can be disabled via ethtool           Allof the rest  The lowest priority domain serves the following users       The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP  using L2 flow specifications      The mlx4      driver when it attaches its QP to his configured GIDS        Fragmented UDP tra
71.  set up partitions with appropriate IPoIB broadcast group  This broadcast group carries its QoS  attributes  SL  MTU  RATE  and Packet Lifetime     3  IPoIB is being setup  IPoIB uses the SL  MTU  RATE and Packet Lifetime available on the  multicast group which forms the broadcast group of this partition     4  MPI which provides non IB based connection management should be configured to run using  hard coded SLs  It uses these SLs for every QP being opened    5  ULPs that use CM interface  like SRP  have their own pre assigned Service ID and use it  while obtaining PathRecord MultiPathRecord  PR MPR  for establishing connections  The  SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR  including SL  MTU  RATE and Lifetime    6  ULPs and programs  e g  SDP  use CMA to establish RC connection provide the CMA the  target IP and port number  ULPs might also provide QoS Class  The CMA then creates Ser   vice ID for the ULP and passes this ID and optional QoS Class in the PR MPR request  The  resulting PR MPR is used for configuring the connection QP     PathRecord and MultiPathRecord Enhancement for QoS     As mentioned above  the PathRecord and MultiPathRecord attributes are enhanced to carry the  Service ID which is a 64bit value  A new field QoS Class is also provided     A new capability bit describes the SM QoS support in the SA class port info  This approach pro   vides an easy migration path for existing access layer and ULPs by not in
72.  support behavior     Table 8   Runtime Parameters             Parameter Description   fca enable  lt 0 1 gt  Disables Enables FCA support at runtime  default  disable     fca np   value   Enables FCA support for collective operations if the number    of processes in the job is greater than the fca_np value   default  64          fca verbose   level   Sets verbosity level for the FCA modules        fca ops  lt     gt        list  op list   comma separated list of collective operations        fca ops          op list    Enables disables only the  specified operations       fca ops  lt     gt    Enables disables all operations  By default all operations are enabled  Allowed operation names are   barrier  br   bcast  bt   reduce  rc   allgather  ag   Each operation can  be also enabled disabled via environment variable     GASNET FCA ENABLE BARRIER    GASNET FCA ENABLE BCAST     GASNET FCA ENABLE REDUCE   Note  All the operations are enabled by default                 5 5 2 1 Enabling FCA Operations through Environment Variables in ScalableUPC  This method can be used to control UPC FCA offload from environment using job scheduler srun  utility  The valid values are  1   enable  0   disable      To enable a specific operation with shell environment variables in ScalableUPC       export GASNET FCA ENABLE BARRIER 1  export GASNET FCA ENABLE BCAST 1    export GASNET FCA ENABLE REDUCE 1    oo    5 5 2 2 Controlling FCA Offload in ScalableUPC using Environment Variables   gt  To enab
73.  the   last  packet is sent received before triggering an inter   rupt        ethtool  a eth lt x gt     Queries the pause frame settings        ethtool  A eth lt x gt   rx onloff   tx  onloff     Sets the pause frame settings        ethtool  g eth lt x gt     Queries the ring size values        ethtool  G eth lt x gt   rx  lt N gt    tx   lt N gt      Modifies the rings size        ethtool  S eth lt x gt     Obtains additional device statistics        ethtool  t eth lt x gt     Performs a self diagnostics test        ethtool  s eth lt x gt  msglvl  N           Changes the current driver message level        Dynamically Connected Transport Service      4 Dynamically Connected transport  DCT  is currently at alpha level   Please be aware that the content below is subject to change              Dynamically Connected transport  DCT  service is an extension to transport services to enable     higher degree of scalability while maintaining high performance for sparse traffic  Utilization of  DCT reduces the total number of QPs required system wide by having RC type QPs dynamically  connect and disconnect from any remote node  DCT connections only stay connected while they  are active  This results in smaller memory footprint  less overhead to set connections and higher  on chip cache utilization and hence increased performance  DCT is supported only in mlx5 and is  in alpha level     Mellanox Technologies 97 J      Rev 2 0 3 0 0 HPC Features    5        Features    5 1 Shared Memo
74.  the virtual interfaces at the Virtual  Machines     To display the services provided to the Virtual Machine interfaces     cat  sys class net eth0 eth vifs  Example       cat  sys class net eth0 eth vifs  SLAVE ib0 2 MAC 52 54 00 60 55 88 VLAN N A    In the example above the ib0 2 IPoIB interface serves the MAC 52 54 00 60 55 88 with no  VLAN tag for that interface     4 83 VLAN Configuration Over an elPolB Interface    elPoIB driver supports VLAN Switch Tagging  VST  mode  which enables the virtual machine  interface to have no VLAN tag over it  thus allowing VLAN tagging to be handled by the Hyper   visor      gt  To attach a Virtual Machine interface to a specific isolated tag            1  Verify the VLAN tag to be used has the same pkey value that is already configured on that ib  port       cat  sys class infiniband mlx4 0 ports   ib port gt  pkeys    Step 2  Create a VLAN interface in the Hypervisor  over the eIPoIB interface       vconfig add  lt eIPoIB interface     vlan tag      Step3  Attach the new VLAN interface to the same bridge that the virtual machine interface is already  attached to     brctl addif   br name     interface name    For example  to create the VLAN tag 3 with pkey 0x8003 over that port in the eIPoIB interface  eth4  run      vconfig add eth4 3   brctl addif br2 eth4 3    Mellanox Technologies 75 J      Rev 2 0 3 0 0 Driver Features      4 8 4    4 9    Setting Performance Tuning     Use larger IPoIB RX TX rings  dom0     Reload the IPoIB drive
75.  transfer from a different PE  and remote pointers  allowing  direct references to data objects owned by another PE    Additional supported operations are collective broadcast and reduction  barrier synchronization   and atomic memory operations  An atomic memory operation is an atomic read and update oper   ation  such as a fetch and increment  on a remote or local data object     SHMEM libraries implement active messaging  The sending of data involves only one CPU  where the source processor puts the data into the memory of the destination processor  Likewise   a processor can read data from another processor s memory without interrupting the remote CPU   The remote processor is unaware that its memory has been read or written unless the programmer  implements a mechanism to accomplish this     5 1 1 Mellanox ScalableSHMEM    The ScalableSHMEM programming library is a one side communications library that supports a  unique set of parallel programming features including point to point and collective routines  syn   chronizations  atomic operations  and a shared memory paradigm used between the processes of  a parallel programming application     Mellanox ScalableSHMEM is based on the API defined by the OpenSHMEM org consortium   The library works with the OpenFabrics RDMA for Linux stack  OFED   and also has the ability  to utilize MellanoX Messaging libraries  MXM  as well as Mellanox Fabric Collective Accelera   tions  FCA   providing an unprecedented level of scalability 
76.  value indicates a hexadecimal number  interface  ibl       send dhcp client identifier  20 00 55 04 01      80 00 00 00 00 00 00 00 02   9 02 00 23 13 92          In order to use the configuration file  run   host1  dhclient  cf dhclient conf ibl    Mellanox Technologies 51 J      Rev 2 0 3 0 0 Driver Features      4 3 3 2 Static IPoIB Configuration    If you wish to use an IPoIB configuration that is not based on DHCP  you need to supply the  installation script with a configuration file  using the     n    option  containing the full IP configu   ration  The IPoIB configuration file can specify either or both of the following data for an IPoIB  interface        A static IPoIB configuration     An IPoIB configuration based on an Ethernet configuration   See your Linux distribution documentation for additional information about configuring IP addresses   The following code lines are an excerpt from a sample IPoIB configuration file       Static settings  all values provided by this file   IPADDR ib0 11 4 3 175   NETMASK ib0 255 255 0 0   NETWORK ib0 11 4 0 0   BROADCAST ib0 11 4 255 255   ONBOOT ib0 1     Based on eth0  each     will be replaced with a corresponding octet    from eth0    LAN INTERFACE ib0 eth0   TPADDRETOOZ    Deo   NETMASK ib0 255 255 0 0   NETWORK ib0 11 4 0 0   BROADCAST ib0 11 4 255 255   ONBOOT ib0 1     Based on the first eth  n   interface that is found  for n 0 1               each     will be replaced with a corresponding octet from eth lt n gt    L
77.  with  multiple parallel links  routes are distributed in a round robin fashion across such links  and so  changing the order that CA ports are visited changes the distribution of routes across such links   This may be advantageous for some specific traffic patterns     The default is to visit CA ports in increasing port order on destination switches  Duplicate values  in the list will be ignored     EXAMPLE    Look for a 2D  since x radix is one  4x5 torus   torus 14 5      y is radix 4 torus dimension  need both     ym link and yp link configuration    yp link 0x200000 0x200005   sw   y 0 z 0   gt  sw   y 1 z 0  ym link 0x200000 0x20000f   sw   y 0 z 0   gt  sw   y 3 z 0    z is not radix 4 torus dimension  only need one of   m link or zp link configuration    zp link 0x200000 0x200001   sw      0 2 0   gt  sw      0 2 1  next seed  yp link 0x20000b 0x200010   sw   y 2 z 1     sw   y 3 z 1  ym link 0x20000b 0x200006   sw   y 2 z 1   gt  sw   y 1 z 1   zp link 0x20000b 0x20000c   sw   y 2 z 1   gt  sw   y 2 2 2   y dateline  2   Move the dateline for this seed   z dateline  1   back to its original position      If OpenSM failover is configured  for maximum resiliency    one instance should run on a host attached to a switch     from the first seed  and another instance should run     on a host attached to a switch from the second seed      Both instances should use this torus 2QoS conf to ensure    path SL values do not change in the event of SM failover     port order defi
78.  you must know the device location on the PCI bus  See Example 1 for details     Synopsis    mstflint  switches      lt command gt   parameters        Output Files    Table 29 lists the various switches of the utility  and Table 30 lists its commands     Table 29   mstflint Switches  Sheet 1 of 3     InfiniBand Fabric Diagnostic Utilities                                  Affected   Switch Relevant Description  Commands   h Print the help menu   hh Print an extended help menu   d evice  All Specify the device to which the Flash is connected    lt device gt    guid burn  sg GUID base value  4 GUIDs are automatically assigned to the   lt GUID gt  following values   guid   gt node GUID  guid 1   gt          guid 2   gt  port2  guid 3   gt  system image GUID   Note  Port2 guid will be assigned even for a single port HCA  the HCA ignores this  value    guids burn  sg 4 GUIDs must be specified here  The specified GUIDs are   lt GUIDs    gt  assigned the following values  repectively  node  port1  port2  and system image GUID   Note  Port2 guid must be specified even for a single port HCA   the HCA ignores this value  It can be set to 0x0           200 Mellanox Technologies      Rev 2 0 3 0 0      Table 29   mstflint Switches  Sheet 2 of 3                                               Affected   Switch Relevant Description  Commands    mac burn  sg MAC address base value  Two MACs are automatically    lt MAC gt  assigned to the following values            gt  portl         1   gt  po
79. 0      Step 4  Attach a virtual NIC to VM     ifconfig  a    eth6 Link encap Ethernet HWaddr 52 54 00 E7 77 99   inet addr 13 195 15 5  Bcast 13 195 255 255 Mask 255 255 0 0  inet6 addr  fe80  5054 ff fee7 7799 64 Scope Link   UP BROADCAST RUNNING MULTICAST MTU 1500 Metric 1   RX packets 481 errors 0 dropped 0 overruns 0 frame 0   TX packets 450 errors 0 dropped 0 overruns 0 carrier 0  collisions 0 txqueuelen 1000   RX bytes 22440  21 9 KiB  TX bytes 19232  18 7 KiB   Interrupt 10 Base address  0xa000       Step 5  Add the MAC 52 54 00 E7 77 99 to the  sys class net eth5 fdb table on HV     Before     cat  sys class net eth5 fdb  33 33 00  0002102  33 33      2   66 52  01 00 5    00 00 01  33 33 00 00 00 01           echo   52 54 00 E7 77 99   gt   sys class net eth5 fdb    After    cat  sys class net eth5 fdb               799   33 33 00  0002102  33 33   2 2   66 52  01 00 5   00 00 01  33 33 00 00 00 01       4 13 4 Assigning    Virtual Function to a Virtual Machine  This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine     4 13 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server  Step 1  Run the virt manager     Step 2  Double click on the virtual machine and open its Properties     Mellanox Technologies 87 J      Rev 2 0 3 0 0 Driver Features      Step 3  Goto Details  gt Add hardware   gt PCI host device            Ble Virtual Machine View Send Key   Q Bie         Add new virtual hardware v C x    Adding Virtual Hardware 
80. 0 OpenSM   Subnet Manager      qos swe s12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7  VL arbitration tables  both high and low  are lists of VL Weight pairs  Each list entry contains a  VL number  values from 0 14   and a weighting value  values 0 255   indicating the number of  64 byte units  credits  which may be transmitted from that VL when its turn in the arbitration  occurs  A weight of 0 indicates that this entry should be skipped  If a list entry is programmed for  VL15 or for a VL that is not supported or is not currently configured by the port  the port may  either skip that entry or send from any supported VL for that entry     Note  that the same VLs may be listed multiple times in the High or Low priority arbitration  tables  and  further  it can be listed in both tables  The limit of high priority VLArb table   qos   type   high limit  indicates the number of high priority packets that can be transmitted  without an opportunity to send a low priority packet  Specifically  the number of bytes that can  be sent is high limit times 4K bytes     A high limit value of 255 indicates that the byte limit is unbounded       If the 255 value 15 used       low priority VLs may      starved           A value of 0 indicates that only a single packet from the high priority table may be sent before       opportunity is given to the low priority table    Keep in mind that ports usually transmit packets of size equal to MTU  For instance  for 4KB  MTU a single packet will require 6
81. 3  UPDN Algorithm 4  u yu X EXE Mr Rt    a ER CR SAU 138   8 5 4 Fat tree Routing Algorithm                                          139   8 5 5 LASH Routing Algorithm                                           140   8 5 6 DOR Routing Algorifhm                                          ra 142   8 5 7 Torus 2QoS Routing Algorithm                                      142   8 6 Quality of Service Management in         5                              150  8 64     OVERVIEW 2 ios cec op ICE Ed HERRERA DERE RAE CR EAE 150   8 6 2 Advanced QoS Policy File          0    cece eee 150   8 6 3 Simple QoS Policy                                                       151   8 6 4 Policy File Syntax Guidelines                                        152   8 6 5 Examples of Advanced Policy                                           152   8 6 6 Simple QoS Policy   Details and Examples                              155   8 6 7 SL2VL Mapping and VL Arbitration         osuessa ee eese 157   8 6 8 Deployment Example                                               158   8 7 QoS Configuration Examples                                        159  8 7 1 Typical HPC Example  MPI and Lustre                                159   8 7 2 EDC SOA  2 tier   IPoIB and                                            160   8 7 3 EDC  3 tier   IPoIB  RDS  SRP                 eee 161   8 8 Adaptive Routing                                                 162  8 8 1    OVELVICW 2 su thy Se tN Rie se ts a Sh the      so Es s 162   
82. 4 Supplement to InfiniBand Architecture Speci   fication Volume 1 2 1    A new API can be used by user space applications to work with the XRC transport  The legacy  API is currently supported in both binary and source modes  however it is deprecated  Thus we  recommend using the new API    The new verbs to be used are       ibv open xrcd ibv close xrcd   e ibv create srq ex      ibv get srq num             create      ex      ibv open        Please use ibv xsrq pingpong for basic tests and code reference  For detailed information  regarding the various options for these verbs  please refer to their appropriate man pages     78 Mellanox Technologies      Rev 2 0 3 0 0      4 12 Flow Steering      Flow Steering is applicable to the mlx4 driver only             Flow steering is a new model which steers network flows based on flow specifications to specific  QPs  Those flows can be either unicast or multicast network flows  In order to maintain flexibil   ity  domains and priorities are used  Flow steering uses a methodology of flow attribute  which is  a combination of L2 L4 flow specifications  a destination QP and a priority  Flow steering rules  could be inserted either by using ethtool or by using InfiniBand verbs  The verbs abstraction uses  an opposed terminology of a flow attribute  ibv flow attr   defined by a combination of specifi   cations  struct ibv flow spec        4 12 1 Enable Disable Flow Steering    Flow Steering is disabled by default and regular L2 steering 
83. 4 credits  so in order to achieve effective VL arbitration for  packets of 4KB MTU  the weighting values for each VL should be multiples of 64     Below is an example of SL2VL and VL Arbitration configuration on subnet     qos_ca max vls 15   qos ca high limit 6   qos ca vlarb high 0 4   qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64  COs E AVL 0131 25 9 056  Ue    3 3  02  359 1 7   qos swe max vls 15   qos swe high limit 6   qos swe vlarb high 0 4   qos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64  77 64  qos        sl2vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7    In this example  there are 8 VLs configured on subnet  VLO to VL7  VLO is defined as a high pri   ority VL  and it is limited to 6 x AKB     24KB in a single transmission burst  Such configuration  would suilt VL that needs low latency and uses small MTU when transmitting packets  Rest of  VLs are defined as low priority VLs with different weights  while VL4 is effectively turned off     8 6 8 Deployment Example    Figure 5 shows an example of an InfiniBand subnet that has been configured by a QoS manager  to provide different service levels for various ULPs     158 Mellanox Technologies      Rev 2 0 3 0 0      Figure 5  Example QoS Deployment on InfiniBand Subnet         Traffic class  SDP  Service level  2  Policy  min 20  BW             Traffic class  Partition A  Service level  0  Policy  min 40          App A Server    Service  Access  Points         Traffic class  SRP  Service Level  1  Policy  mi
84. 4194304        Enable low latency mode for TCP        sysctl  w net ipv4 tcp low latency 1    7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance  The following changes are recommended for improving IPv6 traffic performance     Disable the TCP timestamps option for better CPU utilization     Sysctl  w net ipv4 tcp timestamps 0    Mellanox Technologies 113      Rev 2 0 3 0 0 Performance        Enable the TCP selective acks option for better CPU utilization   sysctl  w net ipv4 tcp sack 1    7 2 3 Preserving Your Performance Settings after a Reboot    To preserve your performance settings after a reboot  you need to add them to the file  etc   sysctl conf as follows      lt sysctl namel gt   lt valuel gt    lt sysctl name2 gt   lt value2 gt    lt sysctl name3 gt   lt value3 gt      lt sysctl name4 gt   lt value4 gt     For example     Tuning the Network Adapter for Improved IPv4 Traffic Performance    on  page 113 lists the following setting to disable the TCP timestamps option     sysctl  w net ipv4 tcp timestamps 0  In order to keep the TCP timestamps option disabled after a reboot  add the following line to   etc sysctl conf     net ipv4 tcp timestamps 0    7 2 4 Tuning Power Management    Check that the output CPU frequency for each core is equal to the maximum supported and that  all core frequencies are consistent        Check the maximum supported CPU frequency    cat  sys devices system cpu cpu  cpufreq cpuinfo max freq  e Check that core frequencies 
85. 5   2 3 5  Post installation Notes   ei et rv tae e pc ere tn 36   2 4 Updating Firmware After Installation                                  36   2 5 Installing MLNX OFED using YUM                                   38   2 5 1 Setting up MLNX OFED YUM Repository                              38   2 5 2 Installing MLNX OFED using the YUM       1                           39   2 5 3 Updating Firmware After                                                      39   2 6 Uninstalling Mellanox OFED                                         39   2 7 Uninstalling Mellanox OFED using the YUM Tool                       39   Chapter 3 Configuration Files                                               40  3 1 Persistent Naming for Network Interfaces                               40   Chapter 4 Driver Features                                                  41  4 1 SCSIRDMA                               n 41    Mellanox Technologies 3 J      Rev 2 0 3 0 0      Ai Vel  C OVERVIEW    he D      uter tee iad Mea iA Seat Ue Rs 41  4 1 2  SRP Initiator 4 gee hes paku m    es bua aad yaspa usata 41  4 2 iSCSI Extensions for ROMA  iSER                                    48  AD    t te G 48  A222 ASER TING   T  ex           gol shes Aes            48  4 3  TP over IntimBand     uyu ayaka Oe dee Sa ele Ra ae Re ee ee 49        e IntroductioBz4    as huy gs kashapa three Gases        alah  49  4 3 2 IPoIB Mode Setting    0 0    cette teen ees 49  4 3 3 IPoIB Configuration   47 55   de ease        50  434
86. 6 32 19 or higher  Otherwise  an alternative custom sysfs  interface is available        mlnx qos tool  package  ofed scripts  requires python  gt   2 5       tc_wrap py  package  ofed scripts  requires python  gt   2 5    Mellanox Technologies 67 J      Rev 2 0 3 0 0 Driver Features      46  Time Stamping Service      Time Stamping is currently at beta level   Please be aware that everything listed here is subject to change              Time Stamping is currently supported in ConnectX   3 ConnectX   3 Pro adapter    cards only           Time stamping is      process of keeping track of the creation of    packet  A time stamping ser   vice supports assertions of proof that a datum existed before a particular time  Incoming packets  are time stamped before they are distributed on the PCI depending on the congestion in the PCI  buffers  Outgoing packets are time stamped very close to placing them on the wire     4 6 1 Enabling Time Stamping  Time stamping is off by default and should be enabled before use    gt  To enable time stamping for a socket      Call setsockopt   with SO TIMESTAMPING and with the following flags     SOF TIMESTAMPING TX HARDWARE  try to obtain send time stamp in hardware  SOF TIMESTAMPING TX SOFTWARE  if SOF TIMESTAMPING TX HARDWARE is off or  fails  then do it in software   SOF TIMESTAMPING RX HARDWARE  return the original  unmodified time stamp  as generated by the hardware   SOF TIMESTAMPING RX SOFTWARE  if SOF TIMESTAMPING RX HARDWARE is off or  fail
87. 65535  0 port 1  Pata    Cetera me meme       Initialize  DhyshinkStateo e em         LinkUp            noponscd maaa 1X or 4X  Ariete cielos s eos umanan 1X or 4X         ENAC aye un s sn 4X  LinkSpeedSupported                2 5 Gbps or 5 0 Gbps  IjumisSpeedbrabilcd Satwa aun 5 0 Gbps  IBA extension   I tumispeedAGV         5 0 Gbps          9 11 ibroute       InfiniBand Fabric Diagnostic Utilities    Uses SMPs to display the forwarding tables   unicast  LinearForwardingTable or LFT  or multi   cast  MulticastForwardingTable or MFT    for the specified switch LID and the optional lid     mlid  range  The default range is all valid entries in the range 1 to FDBTop     188 Mellanox Technologies      Rev 2 0 3 0 0      Synopsis    ibroute   h    d    v    V    a    n    D    G    M    s  lt smlid gt           ca name gt     P   lt ca_port gt      t  lt timeout_ms gt       lt dest dr path lid guid     lt star   tlid gt    lt endlid gt        Output Files    Table 25 lists the various flags of the command     Table 25   ibportstate Flags and Options                                                                a Default  Flag Aui Ab  If Not Description         Specified     h help  Optional Print the help menu    d ebug  Optional Raise the IB debug level  May be used several  times for higher debug levels   ddd or  d  d  d     a ll  Optional Show all LIDs in range  including invalid  entries    v erbose  Optional Increase verbosity level  May be used several  times for addition
88. 8 8 2 Installing the Adaptive Routing                                       163   8 8 3 Running Subnet Manager with Adaptive Routing Manager                163   8 84 Querying Adaptive Routing Tables                                   164   8 8 5 Adaptive Routing Manager Options File                                164      6 Mellanox Technologies J      Rev 2 0 3 0 0      8 0  Congestion Control    cot coya up end AS S  S IDA eters 5 wee aaa 167  8 9 1 Congestion Control OverviewW                                        167   8 9 2 Running OpenSM with Congestion Control Manager                      167   8 9 3 Configuring Congestion Control                                             167   8 9 4 Configuring Congestion Control Manager Main Settings                   168  Chapter9 InfiniBand Fabric Diagnostic Utilities                              171  DA  HOVCIVICW                          as Sis ah u      t sha 171  92  Utilities Usage  Lone eI reete RA e qr ee e      171  9 2 1 Common Configuration  Interface and Addressing                        171   9 2 2 InfiniBand Interface Definition                                       171   9 2 3  Addressing    ser tw eid Weed Seeded s        172   9 3 ibdiagnet  of ibutils2    IB Net Diagnostic                              172  9 4  ibdiagnet  of ibutils    IB Net Diagnostic                               176  9 5 ibdiagpath   IB diagnosticpath                                      179  9 6  by devices   seek eee ERR EET e Uie VE EE Ve
89. AN INTERFACE ib0                       Vos ean   NETMASK ib0 255 255 0 0   NETWORK ib0 11 4 0 0   BROADCAST ib0 11 4 255 255   ONBOOT ib0 1       4 3 3 3 Manually Configuring IPoIB      This manual configuration persists only until the next reboot or driver restart                manually configure IPoIB for      default      partition             perform      following steps   Step 1       configure the interface  enter the ifconfig command with the following items      The appropriate IB interface  ib0  ibl  etc         The IP address that you want to assign to the interface       The netmask keyword    52 Mellanox Technologies      Rev 2 0 3 0 0         The subnet mask that you want to assign to the interface  The following example shows how to configure an IB interface   host1  ifconfig 160 11 4 3 175 netmask 255 255 0 0    Step 2   Optional  Verify the configuration by entering the ifconfig command with the appropriate  interface identifier ib   argument     The following example shows how to verify the configuration     host1  ifconfig 100   b0 Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00  inet addr 11 4 3 175  Bcast 11 4 255 255 Mask 255 255 0 0   UP BROADCAST MULTICAST MTU 65520 Metric 1   RX packets 0 errors 0 dropped 0 overruns 0 frame 0   TX packets 0 errors 0 dropped 0 overruns 0 carrier 0   collisions 0 txqueuelen 128   RX bytes 0  0 0 b  TX bytes 0  0 0 b     Step 3  Repeat Step 1 and Step 2 on the remaining interface s      4 3 4 Sub
90. Burning a firmware binary image using mst  1int  that is already installed on your  machine   Please refer to MSTFLINT README txt under docs      2  Burning a firmware image from a  mlx file using the mlxburn utility  that is already  installed on your machine      The following command burns firmware onto the ConnectX device with the device name  obtained in the example of Step 2     host1  mlxburn  dev  dev mst mt25418 pci cr0  fw  mnt firmware fw 25408 fw 25408   rel mlx    Step 4  Reboot your machine after the firmware burning is completed     Mellanox Technologies 37 J      Rev 2 0 3 0 0 Installation      25 Installing MLNX OFED using YUM    2 5 1 Setting up MLNX_OFED YUM Repository  Step 1  Download the tarball to your host     The image   s name has the format MLNX_OFED_LINUX  lt ver gt   lt OS label gt  lt CPU arch gt  tgz  You  can download it from http   www mellanox com  gt  Products  gt  Software gt  InfiniBand Drivers   Step 2  Extract the MLNX OFED tarball package to a shared location in your network     tar xzf MLNX OFED LINUX   MLNX OFED version   rhel6 4 x86 64 tgz  Step3  Download and install Mellanox Technologies GPG KEY   The key can be downloaded via the following link   http   www mellanox com downloads ofed R PM GPG KEY Mellanox      wget http    www mellanox com downloads ofed RPM GPG KEY Mellanox     2013 08 20 13 52 30   http   www mellanox com downloads ofed RPM GPG KEY Mellanox  Resolving www mellanox com    72 3 194 0   Connecting to www mellanox 
91. Datagram  except for Connect IBTM adapter card which uses IPoIB with  Connected mode as default     For better scalability and performance we recommend using the Datagram mode  However  the  mode can be changed to Connected mode by editing the file  etc infiniband openib conf  andsetting  SET IPOIB CM yes      The SET IPOIB CM parameter is set to    auto    by default to enable the Connected mode for Con   nect IB    card and Datagram for all other ConnectX   cards     After changing the mode  you need to restart the driver by running    etc init d openibd restart  To check the current mode used for out going connections  enter     cat  sys class net ib lt n gt  mode    Mellanox Technologies 49 J      Rev 2 0 3 0 0 Driver Features      4 3 3 IPoIB Configuration    Unless you have run the installation script nlnxofedinstall with the flag     n     then IPoIB has  not been configured by the installation  The configuration of IPoIB requires assigning an IP  address and a subnet mask to each HCA port  like any other network adapter card  i e   you need  to prepare a file called ifcfg ib  n   for each port   The first port on the first HCA in the host is  called interface ib0  the second port is called ib1  and so on     An IPoIB configuration can be based on DHCP  Section 4 3 3 1  or on a static configuration     Section 4 3 3 2  that you need to supply  You can also apply a manual configuration that persists  only until the next reboot or driver restart  Section 4 3 3 3      4 3
92. ED Release Notes file        Required Disk Space 1GB  for Installation       Device ID For the latest list of device IDs  please visit Mellanox website        Operating System Linux operating system   For the list of supported operating system distributions and kernels   please refer to the Mellanox OFED Release Notes file        Installer Privileges The installation requires administrator privileges on the target  machine                 2 2 Downloading Mellanox OFED         1  Verify that the system has a Mellanox network adapter  HCA NIC  installed by ensuring that  you can see ConnectX or InfiniHost entries in the display   The following example shows a system with an installed Mellanox HCA       lspci  v   grep Mellanox  06 00 0 Network controller  Mellanox Technologies MT27500 Family  ConnectX 3   Subsystem  Mellanox Technologies Device 0024    Step 2  Download the ISO image to your host     The image s name has the format MLNX_OFED_LINUX  lt ver gt   lt OS label gt  lt CPU arch gt  iso  You  can download it from http   www mellanox com  gt  Products  gt  Software gt  InfiniBand Drivers     Step 3    Use the md5sum utility to confirm the file integrity of your ISO image  Run the following com   mand and compare the result to the value provided on the download page     host1  md5sum MLNX OFED LINUX  lt ver gt   lt 0S label gt  iso    24 Mellanox Technologies      Rev 2 0 3 0 0      23 Installing Mellanox OFED    The installation script  mlnxofedinstal1  performs the foll
93. ED was installed using the yum tool  then it can be uninstalled as follow     yum groupremove    group name gt       1  The    group name gt     must be the same group name that was previously used to install  MLNX OFED     Mellanox Technologies 39 J      Rev 2 0 3 0 0 Configuration Files         Configuration Files    For the complete list of configuration files  please refer to MLNX OFED configuration files txt    3 1 Persistent Naming for Network Interfaces  To avoid network interface renaming after boot or driver restart use the   etc udev rules d   70 persistent net rules  file     Example for Ethernet interfaces     PCI device 0x15b3 0x1003  mlx4 core                                                                       SUBSYSTEM   net   ACTION   add   DRIVERS        ATTR address    00 02 c9 fa c3 50    ATTR dev_id    0x0   ATTR type    1   KERNEL   eth    NAME  eth1    SUBSYSTEM   net   ACTION   add   DRIVERS        ATTR address     00 02 c9 fa c3 51                    14     0  0   ATTR type    1   KERNEL   eth    NAME  eth2    SUBSYSTEM   net   ACTION   add   DRIVERS        ATTR address     00 02 c9 e9 56 al    ATTR dev_id    0x0   ATTR type    1   KERNEL   eth    NAME  eth3    SUBSYSTEM   net   ACTION   add   DRIVERS        ATTR address    00 02 c9 e9 56 a2    ATTR dev_id    0x0   ATTR type    1   KERNEL   eth    NAME  eth4           Example for IPoIB interfaces   SUBSYSTEM   net   ACTION   add   DRIVERS        ATTR dev_id    0x0   ATTR type    32    NAME  ib0     S
94. Ethernet device a static IP address  then copy ifconfig  Otherwise   skip this step   hostl  cp  sbin ifconfig  tmp initrd en sbin  Now you can add the commands for loading the copied modules into the file init  Edit the file     tmp initrd en init and add the following lines at the point you wish the Ethernet driver  to be loaded     A The order of the following commands  for loading modules  is critical     216 Mellanox Technologies      Rev 2 0 3 0 0      echo  loading Mellanox ConnectX FN driver      sbin insmod lib modules mlnx en mlx4 core ko   sbin insmod lib modules mlnx en mlx4 en ko  Step 8  Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network  interface   Step9  Save the init file   Step 10  Close initrd   host1  cd  tmp initrd en  host1  find      cpio  H newc  o  gt   tmp new initrd en img  host1  gzip  tmp new init en img    At this stage  the modified initrd  including the Ethernet driver  is ready and located at   tmp new init ib img gz  Copy it to the original initrd location and rename it properly     A 10 iSCSI Boot    Mellanox FlexBoot enables an iSCSI boot of an OS located on a remote iSCSI Target  It has a  built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel  and initrd  Linux   There are two instances of connection to the remote iSCSI Target  the first  is for getting the kernel and initrd via FlexBoot  and the second is for loading other parts of the  OS via initrd     If y
95. Fabric Collective Accelerator    The Mellanox Fabric Collective Accelerator  FCA  is a unique solution for offloading collective  operations from the Message Passing Interface  MPI  process to the server CPUs  As a system   wide solution  FCA does not require any additional hardware  The FCA manager creates a topol   ogy based collective tree  and orchestrates an efficient collective operation using the CPUs in the  servers that are part of the collective operation  FCA accelerates MPI collective operation perfor   mance by up to 100 times providing a reduction in the overall job runtime  Implementation is  simple and transparent during the job runtime     FCA is built on the following main principles     Topology aware Orchestration  The MPI collective logical tree is matched to the physical topology  The collective logical  tree Is constructed to assure     Maximum utilization of fast inter core communication    Distribution of the results     Communication Isolation    Collective communications are isolated from the rest of the traffic in the fabric using a private virtual  network  VLane  eliminating contention with other types of traffic     After MLNX OFED installation  FCA can be found at  opt  mellanox fca folder     For further information on configuration instructions  please refer to the FCA User Manual     Mellanox Technologies 105      Rev 2 0 3 0 0 HPC Features      5 5    5 5 1    ScalableUPC    Unified Parallel C  UPC  is an extension of the C programming lang
96. HHH HH HH  libibmad                                                                                                  Preparing     THHHBHHHBHHHBHBHHHHHHHHBHHHBHHHHBHHHHHHHHBHHHBHRHHI  libibmad             HH HH    HH    HH      Het         HH         HH HH                                                                                                                                                                                                                                                                                                                   Mellanox Technologies       Preparing     libibmad devel  Preparing     libibmad devel  Preparing     libibmad static  Preparing     libibmad static  Preparing     ibsim  Preparing     ibacm  Preparing     librdmacm  Preparing     librdmacm  Preparing     librdmacm utils  Preparing     librdmacm devel          Preparing     librdmacm devel  Preparing     opensm libs  Preparing     opensm libs  Preparing     opensm  Preparing     opensm devel  Preparing     opensm devel  Preparing     opensm static  Preparing     opensm static  Preparing               Preparing       tg  s  D        w   H  3                 S  D              H  3                                                                                                                                                                                                                                                               H H H HH HH                 H H H HH HH H     
97. IDs are requested from the SM  These GIDs are mapped to VHCAs  as follows     vHCA number x is assigned the GID GUID at index x of the physical GID table      Each vHCA port presents its own virtual PKey table  The virtual PKey table  presented to a VF  is a mapping of selected indexes of the physical PKey  table  The host admin can control which PKey indexes are mapped to which virtual indexes using a    sysfs interface  see Section   on page 89   The physical PKey table may contain both full and partial  memberships of the same PKey to allow different membership types in different virtual tables     90 Mellanox Technologies      Rev 2 0 3 0 0        Each vHCA port has its own virtual port state  A vHCA port is up if the following conditions apply     The physical port is up    The virtual GID table contains the GIDs requested by the host admin      The SM has acknowledged the requested GIDs since the last time that the physical port  went up    e Other port attributes are shared  such as  GID prefix  LID  SM LID  LMC mask  To allow the host admin to control the virtual GID and PKey tables of vHCAs  a new sysfs          sub tree has been added under the PF InfiniBand device    4 13 7 2 1SRIOV sysfs Administration Interfaces on the Hypervisor    Administration of GUIDs and PKeys is done via the sysfs interface in the Hypervisor  Dom0    This interface is under      sys class infiniband  lt infiniband device gt  iov  Under this directory  the following subdirectories can be fo
98. INUX un installation script    lt RPMS folders     Directory of binary RPMs for a specific CPU architecture     firmware    Directory of the Mellanox IB HCA firmware images   including Boot over IB     src    Directory of the OFED source tarball    mlnx add kernel support sh   Script required to rebuild MLNX OFED LINUX for  customized kernel version on supported Linux Distribution    docs    Directory of Mellanox OFED related documentation    18 Mellanox Technologies    Rev 2 0 3 0 0       1 3 Architecture    Figure 1 shows a diagram of the Mellanox OFED stack  and how upper layer protocols  UL Ps   interface with the hardware and with the kernel and user space  The application level also shows  the versatility of markets that Mellanox OFED applies to     Figure 1  Mellanox OFED Stack for ConnectX   Family Adapter Cards          UDAPL MPI    uverbs   rdmacm Sockets Layer    SCSI   TCP UDP ICNP    Mid Layer Kern IP      Netdevice    SRP iSER   elPolB     IPoIB    verbs   CMA  ib_core     mlx4_en          5 ib  IB  mlx4 ib  IB and RoCE     Adapter Driver  mIx5 core  Adapter Driver        4 core     Mellanox VPI Device  HCA NIC     The following sub sections briefly describe the various components of the Mellanox OFED  stack     1 3 1 mlx4 VPI Driver    m1x4 is the low level driver implementation for the ConnectX family adapters designed by Mel   lanox Technologies  ConnectX   family adapters can operate as an InfiniBand adapter  or as an  Ethernet NIC  The OFED driver supports 
99. InfiniBand and Ethernet NIC configurations  To  accommodate the supported configurations  the driver is split into the following modules     mlx4 core    Handles low level functions like device initialization and firmware commands processing  Also  controls resource allocation so that the InfiniBand and Ethernet functions can share the device  without interfering with each other     mlx4 ib  Handles InfiniBand specific functions and plugs into the InfiniBand midlayer    Mellanox Technologies 19      Rev 2 0 3 0 0 Mellanox OFED Overview      mlx4 en    A 10 40GigE driver under drivers net ethernet mellanox mlx4 that handles Ethernet specific  functions and plugs into the netdev mid layer    1 3 2  mlx5 Driver    m1x5 is the low level driver implementation for the Connect IB    adapters designed by Mella   nox Technologies  Connect IBTM operates as an InfiniBand adapter  The mlx5 driver is com   prised of the following kernel modules     mlx5 core    Acts as a library of common functions  e g  initializing the device after reset  required by the  Connect IB    adapter card     mIx5 ib  Handles InfiniBand specific functions and plugs into the InfiniBand midlayer     libmlx5    libmlx5 is the provider library that implements hardware specific user space functionality  If  there is no compatibility between the firmware and the driver  the driver will not load and a mes   sage will be printed in the dmesg     The following are the Libmlx5 environment variables   e       5 FREEZE ON
100. InfiniBand connected or datagram transport service  IPoIB  pre appends the IP datagrams with an encapsulation header  and sends the outcome over the  InfiniBand transport service  The transport service is Unreliable Datagram  UD  by default  but it  may also be configured to be Reliable Connected  RC   The interface supports unicast  multicast  and broadcast  For details  see Chapter 4 3     IP over InfiniBand        iSER    iSCSI Extensions for RDMA  iSER  extends the iSCSI protocol to RDMA  It permits data to be  transferred directly into and out of SCSI buffers without intermediate data copies  For further  information  please refer to Chapter 4 2     iSCSI Extensions for ROMA  iSER       SRP    SCSI RDMA Protocol  SRP  is designed to take full advantage of the protocol offload and           features provided by the InfiniBand architecture  SRP allows a large body of SCSI soft   ware to be readily used on InfiniBand architecture  The SRP driver   known as the SRP Initia   tor   differs from traditional low level SCSI drivers in Linux  The SRP Initiator does not control  a local HBA  instead  it controls a connection to an I O controller   known as the SRP Target   to  provide access to remote storage devices across an InfiniBand fabric  The SRP Target resides in  an I O unit and provides storage services  See Chapter 4 1     SCSI RDMA Protocol    and Appen   dix B     SRP Target Driver        uDAPL    User Direct Access Programming Library  uDAPL  is a standard API that pr
101. MAGE     Mellanox    TECHNOLOGIES    Mellanox Technologies Mellanox Technologies  Ltd   350 Oakmead Parkway Suite 100 Beit Mellanox   Sunnyvale  CA 94085 PO Box 586 Yokneam 20692  U S A  Israel   www mellanox com www mellanox com   Tel   408  970 3400 Tel   972  0 74 723 7200  Fax   408  970 3403 Fax   972  0 4 959 3245       Copyright 2013  Mellanox Technologies  All Rights Reserved    Mellanox    Mellanox logo  BridgeX    ConnectX    CORE Direct    InfiniBridge    InfiniHost    InfiniScale     MLNX OS    PhyX    SwitchX    UFM    Virtual Protocol Interconnect   and Voltaire   are registered trademarks of  Mellanox Technologies  Ltd     Connect IB     ExtendX     FabricIT     Mellanox Open Ethernet     Mellanox Virtual Modular Switch     MetroX      MetroDX     ScalableHPC     Unbreakable Link    are trademarks of Mellanox Technologies  Ltd     All other trademarks are property of their respective owners     2 Mellanox Technologies Document Number  2877      Rev 2 0 3 0 0      Table of Contents    Table of Contemts 2 24 sho eu                  Ou eee oe      3  List of FIGURES de Pr PP 9  List OL Tables PP 10  Chapter 1 Mellanox OFED Overview                                         17  1 1 Introduction to Mellanox OFED                                       17   12 Mellanox OFED Package                                            17   12 1               tee ck E nai ene        a              17   1 2 2 Software Components                 Qua        CAN Qua 17   1 2 3   Riri wa
102. Mellanox    TECHNOLOGIES    Mellanox OFED for Linux  User Manual    Rev 2 0 3 0 0  Last Updated  03 October  2013    www  mellanox com         Rev 2 0 3 0 0      NOTE    THIS HARDWARE  SOFTWARE OR TEST SUITE PRODUCT     PRODUCT S      AND ITS RELATED  DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES    AS IS    WITH ALL FAULTS OF ANY  KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE  THE PRODUCTS IN DESIGNATED SOLUTIONS  THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT  HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE  PRODUCTO S  AND OR THE SYSTEM USING IT  THEREFORE  MELLANOX TECHNOLOGIES CANNOT AND  DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST  QUALITY  ANY EXPRESS OR IMPLIED WARRANTIES  INCLUDING  BUT NOT LIMITED TO  THE IMPLIED  WARRANTIES OF MERCHANTABILITY  FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT  ARE DISCLAIMED  IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES  FOR ANY DIRECT  INDIRECT  SPECIAL  EXEMPLARY  OR CONSEQUENTIAL DAMAGES OF ANY KIND   INCLUDING  BUT NOT LIMITED TO  PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES   LOSS OF USE  DATA  OR PROFITS  OR BUSINESS INTERRUPTION  HOWEVER CAUSED AND ON ANY  THEORY OF LIABILITY  WHETHER IN CONTRACT  STRICT LIABILITY  OR TORT  INCLUDING NEGLIGENCE  OR OTHERWISE  ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S  AND RELATED  DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DA
103. Mellanox Technologies 179           Rev 2 0 3 0 0       Options     n  lt  src name  dst name gt      1    src lid  dst lid                                C  lt count gt     t  lt topo file gt     s   Sys name       1  lt dev index gt         p  lt port num gt    o  lt out dir gt    lw   1x 4x 12x     1s  lt 2 5 5 10 gt      pm     pc   P  lt PM  lt Trash gt  gt        Names of the source and destination ports  as  defined in the topology file  source may be omit  ted     local port is assumed to be the source     Source and destination LIDs  source may be omit  ted      the local port is assumed to be the  Source   Directed route from the local node  which is the  Source  and the destination node   The minimal number of packets to be sent across  each link  default   100    Enable verbose mode   Specifies the topology file name   Specifies the local system name  Meaningful only  if a topology file is specified   Specifies the index of the device of the port  used to connect to the IB fabric  in case of  multiple devices on the local system    Specifies the local device s port number used to  connect to the IB fabric   Specifies the directory where the output files  will be placed  default    tmp    Specifies the expected link width   Specifies the expected link speed   Dump all the fabric links  pm Counters into  ibdiagnet pm   Reset all the fabric links pmCounters   If any of the provided pm is greater then its  provided value  print it to screen                            
104. QoS Levels    Application traffic     IPoIB  UD and CM  and SDP     Isolated from storage    Min BW of 50     SRP     Min BW 50       Bottleneck at storage nodes    Administration    OpenSM QoS policy file       In the following policy file example  replace SRPT  with the real SRP Target port  GUIDs     qos ulps  qefault 0  ipoib adl  sdp 3l  srp  target port guid SRPT1 SRPT2 SRPT3  2    160 Mellanox Technologies      Rev 2 0 3 0 0      end qos ulps       OpenSM options file    qos max vls 8   qos high limit 0   qos vlarb high 1 32 2 32   qos vlarb low 0 1    Gio        O 1 2     6 D 6  7  105  15 195 159  15 15 15 18    8 7 3 EDC  3 tier   IPoIB  RDS  SRP    The following is an example of QoS configuration for an enterprise data center  EDC   with  IPoIB carrying all application traffic  RDS for database traffic  and SRP used for storage     QoS Levels    Management traffic  ssh      IPoIB management VLAN  partition A      Min BW 10      Application traffic    IPoIB application VLAN  partition B      Isolated from storage and database     Min BW of 30     Database Cluster traffic    RDS     Min BW of 30     SRP     Min BW 30       Bottleneck at storage nodes    Administration     OpenSM Qos policy file      In the following policy file example  replace SRPT  with the real SRP Initiator port    GUIDs   aie    qos ulps  default  ipoib  pkey 0x8001    rds    0  i  ipoib  pkey 0x8002       3  srp  target port guid SRPT1  SRPT2  SRPT3   4    Mellanox Technologies 161      Rev 2 0
105. R will abort and will allow OpenSM to pro     ceed  To do so  set the following parameter     max errors  error window      The values are     max errors   0  zero tollerance   abort configuration on first error    error window   0  mechanism disabled   no error checking   0 48K         The default is  5    8 9 4 1 Congestion Control Manager Options File    Table 15   Congestion Control Manager General Options File       Option File Description    Values       enable Enables disables Congestion Control mechanism  on the fabric nodes     Values    TRUE   FALSE    Default  True       num hosts Indicates the number of nodes  The CC table val   ues are calculated based on this number              Values   0 48K     Default  0  base on the CCT  calculation on the current  subnet size        Table 16   Congestion Control Manager Switch Options File                         Option File Description Values  threshold Indicates how aggressive the congestion mark   0 0xf   ing should be    0       packet marking      Oxf  very aggressive  Default  Oxf  marking rate The mean number of packets between marking   Values   0 Oxffff   eligible packets with a FECN Default          packet size Any packet less than this size  bytes  will not be   Values   0 0x3fc0   marked with FECN  Default  0x200       Table 17   Congestion Control Manager CA Options File       Option File Desctiption    Values       port control Specifies the Congestion Control attribute for  this port             Values     0
106. RP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module  to connect to each of them  These SRP daemons also detect targets that subsequently join the fab   ric  and send the ib srp module requests to connect to them as well     Operation    When a path  from port1  to a target fails  the ib srp module starts an error recovery process  If  this process gets to the reset host stage and there is no path to the target from this port  ib srp  will remove this scsi host  After the scsi host is removed  multipath switches to another path to  this target  from another port HCA      When the failed path recovers  it will be detected by the SRP daemon  The SRP daemon will then  request ib srp to connect to this target  Once the connection is up  there will be a new scsi host  for this target  Multipath will be executed on the devices of this host  returning to the original  state  prior to the failed path      46 Mellanox Technologies      Rev 2 0 3 0 0      Manual Activation of High Availability  Initialization   Execute after each boot of the driver   1  Execute modprobe dm multipath  2  Execute modprobe ib srp  3  Make sure you have created file  etc udev rules d 91 srp rules as described above   4  Execute for each port and each HCA   srp daemon  c  e  R 300  i   InfiniBand HCA name    p   port number      This step can be performed by executing srp daemon sh  which sends its log to  var log   srp daemon log     Now it is possible to access the SRP L
107. The list of network interfaces is available via the ifstat com   mand     Example     iPXE   ifclose netl    A 8 3 4 autoboot    Starts the boot process from the device s      A 8 3 5 sanboot  Starts the boot process of an iSCSI target     Example   iPXE   sanboot iscsi 11 4 3 7    ign 2007 08 7 3 4 11 iscsiboot    A 8 3 6 echo    Echoes an environment variable   Example   iPXE   echo   root path     A 8 3 7 dhcp    A network interface attempts to open the network interface and then tries to connect to and com   municate with the DHCP server to obtain the IP address and filepath from which the boot will  occur     Example   iPXE   dhcp net1    Mellanox Technologies 211      Rev 2 0 3 0 0         8 3 8 help    Displays      available list of commands     A 8 3 9 exit    Exits from the command line interface        9 Diskless Machines    Mellanox FlexBoot supports booting diskless machines  To enable using an IB ETH driver  the  initrd image must include a device driver module and be configured to load that driver     This can be achieved by adding the device driver module into the initrd image and loading it     The    initrd    image of some Linux distributions such as SuSE Linux Enterprise  Server and Red Hat Enterprise Linux  cannot be edited prior or during the installation  process          If you need to install Linux distributions over Flexboot  please replace your    initrd       images with the images found at   www mellanox com  gt  Products  gt  Adapter IB VPI SW  g
108. UBSYSTEM   net   ACTION   add   DRIVERS        ATTR dev_id    0x1   ATTR type    32    NAME  ib1           40 Mellanox Technologies      Rev 2 0 3 0 0      4 Driver Features    4 1 SCSI RDMA Protocol    4 1 1 Overview    As described in Section 1 3 4  the SCSI RDMA Protocol  SRP  is designed to take full advantage  of the protocol off load and RDMA features provided by the InfiniBand architecture  SRP allows  a large body of SCSI software to be readily used on InfiniBand architecture  The SRP Initiator  controls the connection to an SRP Target in order to provide access to remote storage devices  across an InfiniBand fabric  The SRP Target resides in an IO unit and provides storage services     Section 4 1 2 describes the SRP Initiator included in Mellanox OFED for Linux  This package   however  does not include an SRP Target     4 1 2 SRP Initiator    This SRP Initiator is based on open source from OpenFabrics  www openfabrics org  that imple   ments the SCSI RDMA Protocol 2  SRP 2   SRP 2 is described in Document   T10 1524 D  available from http   www t10 org     The SRP Initiator supports     Basic SCSI Primary Commands  3  SPC 3    www t10 org ftp tl0 drafts spc3 spc3r21b pdf      Basic SCSI Block Commands  2  SBC 2    www t10 org ftp t10 drafts sbc2 sbc2r16 pdf       Basic functionality  task management and limited error handling    4 1 2 1 Loading SRP Initiator    To load the SRP module  either execute the    modprobe ib srp  command after the OFED driver  is up  or ch
109. UNs on  dev mapper      identified by their names  You can configure the  etc multipath conf file to change       gt  It is possible for regular  non SRP  LUNs to also be present  the SRP LUNs may be  ad multipath behavior     occur if the SRP LUNs are in the black list of multipath  Edit the    blacklist    section in        y It is also possible that the SRP LUNS will not appear under  dev mapper   This can   etc multipath conf and make sure the SRP LUNs are not black listed          Automatic Activation of High Availability     Set the value of SRPHA ENABLE in  etc infiniband openib conf to  yes      For the changes in openib cont to take effect  run    etc init d openibd restart       From the next loading of the driver it will be possible to access the SRP LUNs on  dev   mapper         It is possible that regular  not SRP  LUNs may also be present  the SRP LUNs may be    identified by their name               tis possible to see the output of the SRP daemon in  var log srp daemon log    4 1 2 7 Shutting Down SRP    SRP can be shutdown by using    rmmod ib srp     or by stopping the OFED driver      etc   init d openibd stop      or      a by product of a complete system shutdown     Prior to shutting down SRP  remove all references to it  The actions you need to take depend on  the way SRP was loaded  There are three cases     1  Without High Availability    When working without High Availability  you should unmount the SRP partitions that were mounted  prior to shutting
110. V  This option increases the log verbosity level   The  v option may be specified multiple times  to further increase the verbosity level   See the  D option for more information about  log verbosity     This option sets the maximum verbosity level and  forces log flushing    The  V is equivalent to   D OxFF  d 2     See the  D option for more information about   log verbosity          D   D   flags      128 Mellanox Technologies      Rev 2 0 3 0 0      This option sets the log verbosity level    A flags field must follow the  D option    A bit set clear in the flags enables disables a  specific log level as follows    BIT LOG LEVEL ENABLED   0x01   ERROR  error messages    0x02   INFO  basic messages  low volume    0x04   VERBOSE  interesting stuff  moderate volume   0x08   DEBUG  diagnostic  high volume    0x10   FUNCS  function entry exit  very high volume   0x20   FRAMES  dumps all SMP and GMP frames    0x40   ROUTING  dump FDB routing information    0x80   currently unused    Without  D  OpenSM defaults to ERROR   INFO  0x3    Specifying  D 0 disables all messages        Specifying  D OxFF enables all messages  see  V    High verbosity levels may require increasing  the transaction timeout with the  t option       debug   d  lt number gt   This option specifies a debug option   These options are not normally needed   The number following  d selects the debug  option to enable as follows   OPT Description   d0   Ignore other SM nodes   d1   Force single threaded dispatchi
111. VL values  VL bit 0  per QoS level to provide deadlock free rout   ing on a 3D torus             2005 routes around link failure by  taking the long way around         1D  ring interrupted by a link failure  For example  consider the 2D 6x5 torus below  where switches  are denoted by   a zA Z         46  4I  Blx    I                   n  I I I I I I  go     p    Pp Ir  I I I I I I  2                            THsI   rr  I I I I I I     SSS ee  I I I I I I  y 0                                                                   I I I I I I  x 0 1 2 3 4 5    For a pristine fabric the path from S to D would be S n T r D  In the event that either link S n or  n T has failed  torus 2QoS would use the path S m p o T r D     Note that it can do this without changing the path SL value  once the 1D ring m S n T o p m has  been broken by failure  path segments using it cannot contribute to deadlock  and the x direction  dateline  between  say  x 5 and x 0  can be ignored for path segments on that ring  One result of  this is that torus 2QoS can route around many simultaneous link failures  as long as no 1D ring is  broken into disjoint segments  For example  if links n T and T o have both failed  that ring has  been broken into two disjoint segments  T and o p m S n  Torus 2QoS checks for such issues   reports if they are found  and refuses to route such fabrics     Note that in the case where there are multiple parallel links between a pair of switches  torus   2005 will allocate route
112. _64   ORE        Verse Wai aaa MLNX OFED LINUX 2 0 2 0 0  OFED 2 0 2 0 0    2 6 32 279 e16 x86 64   Host                   PT PASS   Firmware Oil enm ONCE ME v2 9 1000   Firmware Check on CA 40  NIC             PASS                   ton CAN sell WHC E YA doll         Firmware Check on CA  1  NIC             PASS             DIY WME oo gen don        PASS   Number CAWPoOIESEACIU CH      4   Port State of Port  1 on CA  0  NIC       UP 1X QDR  Ethernet   Port State of Port  2 on CA  0  NIC       UP 1X QDR  Ethernet   Port State of Port  1 on CA  1  NIC       UP 1X QDR  Ethernet   Port State of Port  2 on CA  1  NIC       UP 1X QDR  Ethernet   Error Counter Check on CA  0  NIC        NA  Eth ports    Error Counter Check on CA  1  NIC        NA  Eth ports    Kernell syslog Check cr n EUM PASS            GUID  Ca CA  0  UO  aaa 00 02 c9 03 00 07 4     8  NOderGUTDEONE CAEN UNT  Emaan 00 02 c9 03 00 35   d c0  SES rss DON A L      After the installer completes  information about the Mellanox OFED installation such as  prefix  kernel version  and installation parameters can be retrieved by running the com   Ad mand  etc infiniband info     2 3 4 Installation Results    Software       Most of MLNX OFED packages are installed under the     usr    directory except for the  following packages which are installed under the     opt    directory       openshmem            fca and ibutils     The kernel modules are installed under      lib modules  uname  r   updates on SLES and Fedora Dis
113. _mask       Output Files    Table 27 lists the various flags of the command     Table 27   perfquery Flags and Options                         Optional   Default  Flag pos  If Not Description  Mandatory Specified    h help  Optional Print the help menu   d ebug  Optional Raise the IB debug level  May be used several  times for higher debug levels   ddd or  d  d  d           Mellanox Technologies 195        Rev 2 0 3 0 0       Table 27   perfquery Flags and Options       InfiniBand Fabric Diagnostic Utilities                                                       Optional   Default  Flag ae pes  If Not Description                Specified    G uid  Optional Use GUID address argument  In most cases  it  Is the Port GUID  Example    0x08f1040023    a Optional Apply query to all ports   1 Optional Loop ports   r Optional Reset the counters after reading them   C Optional Use the specified channel adapter or router    ca          gt    P   ca port     Optional Use the specified port   R Optional Reset the counters   t Optional Override the default timeout for the solicited   lt timeout_ms           msec    gt    V ersion  Optional Show version info   lt lid   guid gt  Optional LID or GUID    port  reset_  mask    Examples    perfquery  r 32 1  peviquery       32 1  perfquery  R 0x20 1  perfquery  e  R 0x20 1    se               perfquery  R  a 32  perfquery  R 32 2 OxOfff  perfquery  R 32 2 0xf000    read performance counters and reset   read extended performance counters and reset
114. _qos tool or by the 11         daemon if           is used    When using Raw Ethernet QP mapping  the TOS sk prio to UP mapping is lost           Performing the Raw Ethernet      mapping forces      QP to transmit using      given UP     If packets with VLAN tag are transmitted  UP in the VLAN tag will be overwritten with  the given UP          4 5 6 Map Priorities with tc wrap py mlnx qos  Network flow that can be managed by QoS attributes is described by a User Priority  UP   A  user s sk priois mapped to UP which in turn is mapped into TC      Indicating the UP     When the user uses sk prio  it is mapped into a UP by the    tc    tool  This is done by the  tc wrap py tool which gets a list of  lt   16 comma separated UP and maps the sk prio to  the specified UP     For example  tc wrap py  ieth0  u 1 5 maps sk prio 0 of etho device to UP 1 and  sk prio 1to UP 5        Setting set egress map in VLAN  maps the skb priority of the VLAN to a v1an qos   The v1an qos is represents a UP for the VLAN device        In RoCE  rdma set option with ROMA OPTION ID TOS could be used to set the UP     When creating QPs  the s1 field in ibv modify qp command represents the UP     Indicating the TC    Mellanox Technologies 61 J      Rev 2 0 3 0 0 Driver Features      4 5 7    4 5 7 1    4 5 7 2    4 5 7 3    4 5 8    4 5 8 1      After mapping the skb priority to UP  one should map      UP into    TC  This assigns  the user priority to a specific hardware traffic class  In order to do that
115. a decision is  made as to what port should be used to get to that LID  This step is common to standard and    136 Mellanox Technologies      Rev 2 0 3 0 0      Up Down routing  Each port has a counter counting the number of target LIDs going through  it  When there are multiple alternative ports with same MinHop to a LID  the one with less  previously assigned ports is selected     If LMC  gt  0  more checks are added  Within each group of LIDs assigned to same target port    a  Use only ports which have same MinHop   b  First prefer the ones that go to different systemImageGuid  then the previous LID of the same LMC group   c  Ifnone  prefer those which go through another NodeGuid   d  Fall back to the number of paths method  if all go to same node      8 5 1 Effect of Topology Changes    OpenSM will preserve existing routing in any case where there is no change in the fabric  switches unless the  r    reassign_lids  option is specified      r    reassign lids   This option causes OpenSM to reassign LIDs to all end nodes  Specify   ing  r on a running subnet may disrupt subnet traffic  Without  r  OpenSM attempts to  preserve existing LID assignments resolving multiple use of same LID     Ifa link is added or removed  OpenSM does not recalculate the routes that do not have to change   A route has to change if the port is no longer UP or no longer the MinHop  When routing changes  are performed  the same algorithm for balancing the routes is invoked    In the case of using th
116. achine in an IB subnet     By default  an opensm run is logged to two files   var log messages and  var log   opensm log  The first       message  registers only general major events  the second file   opensm  log  includes details of reported errors  All errors reported in opensm 1og should be  treated as indicators of IB fabric health  Both log files should include the message    SUBNET  UP    if opensm was able to setup the subnet correctly       If a fatal  non recoverable error occurs  opensm exits              Running OpenSM As               OpenSM can also run as daemon  To run OpenSM in this mode  enter   host1   etc init d opensmd start    osmtest Description    osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administra   tor  osmtest provides a test suite for opensm  It can create an inventory file of all available nodes   ports  and PathRecords  including all their fields  It can also verify the existing inventory with all  the object fields  and matches it to a pre saved one  See Section 8 3 2     osmtest has the following test flows      Multicast Compliancy test      Event Forwarding test      Service Record registration test     RMPP stress test    130 Mellanox Technologies      Rev 2 0 3 0 0         Small SA Queries stress test    8 3 1 Syntax  osmtest  OPTIONS     where OPTIONS are        ELON This option directs osmtest to run a specific flow   Flow Description        create an inventory file with all nodes  ports and  paths
117. ailable     In general LASH is a very flexible algorithm  It can  for example  reduce to Dimension Order  Routing in certain topologies  it is topology agnostic and fares well in the face of faults     It has been shown that for both regular and irregular topologies  LASH outperforms Up Down   The reason for this is that LASH distributes the traffic more evenly through a network  avoid   ing the bottleneck issues related to a root node and always routes shortest path     The algorithm was developed by Simula Research Laboratory   Use     R lash  Q    option to activate the LASH algorithm      QoS support has to be turned on in order that SL VL mappings are used   ae      LMC  gt  015 not supported by the LASH routing  If this is specified  the default routing    algorithm is invoked instead        For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm  For toroidal  meshes on the other hand there are routing loops that can cause deadlocks  LASH can be used to  route these cases  The performance of LASH can be improved by preconditioning the mesh in  cases where there are multiple links connecting switches and also in cases where the switches are  not cabled consistently  To invoke this  use   R lash  Q   do mesh analysis   This will add an  additional phase that analyses the mesh to try to determine the dimension and size of a mesh  If it  determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in  dimension order
118. ailable MXM parameters and  their default values  run the  opt mellanox mxm bin mxm dump config utility which is  part of the MXM RPM     MXM parameters can be modified in one of the following methods      Modifying the default MXM parameters value as part of the mpirun     mpirun  x MXM UD RX MAX BUFFERS 128000  lt     gt      Modifying the default MXM parameters value from SHELL       export MXM UD RX MAX BUFFERS 128000    mpirun  lt     gt     104 Mellanox Technologies      Rev 2 0 3 0 0      5 3 4 Configuring Multi Rail Support    Multi Rail support enables the user to use more than one of the active ports on the card  by mak   ing a better use of the resources  It provides a combined throughput among the used ports      gt  To configure dual rail support      Specify the list of ports you would like to use to enable multi rail support         MXM          PORTS cardName portNum  mpirun  x                 PORTS mlx4 0 1 mlx4 0 2  lt     gt     5 3 5 Configuring MXM over the Ethernet Fabric     To configure MXM over the Ethernet fabric   Step 1  Make sure the Ethernet port is active     ibv devinfo    ibv devinfo displays the list of cards and ports in the system  Please make sure  in the  16   devinfo output  that the desired port has Ethernet at the 1ink layer field and that    Aa its state 15 PORT ACTIVE            2  Specify the ports you would like to use  if there is a non Ethernet active port in the card   mpirun  x                 PORTS mlx4 0 1  lt     gt     5 4 
119. airs of sources   destinations and groups these paths into virtual layers in  such a way as to avoid deadlock     140 Mellanox Technologies      Rev 2 0 3 0 0      from HCA between and switch does not need virtual layers as deadlock will not arise      y LASH analyzes routes and ensures deadlock freedom between switch pairs  The link  between switch and HCA            In          detail       algorithm works as follows     1  LASH determines the shortest path between all pairs of source   destination switches  Note   LASH ensures the same SL is used for all SRC DST   DST SRC pairs and there is no guar   antee that the return path for a given DST SRC will be the reverse of the route SRC DST     2  LASH then begins an SL assignment process where a route is assigned to a layer  SL  if the  addition of that route does not cause deadlock within that layer  This is achieved by main   taining and analysing a channel dependency graph for each layer  Once the potential addition  of a path could lead to deadlock  LASH opens a new layer and continues the process     3  Once this stage has been completed  it is highly likely that the first layers processed will  contain more paths than the latter ones  To better balance the use of layers  LASH moves  paths from one layer to another so that the number of paths in each layer averages out     Note that the implementation of LASH in opensm attempts to use as few layers as possible  This  number can be less than the number of actual layers av
120. al verbosity   vvv or  v  v    v     V ersion  Optional Show version info    a ll  Optional Show all LIDs in range  including invalid  entries    n o dests  Optional Do not try to resolve destinations    D irect  Optional Use directed path address arguments  The path  is acomma separated list of out ports   Examples      0        self port   0 1 2 1 4       out via port 1  then 2         G uid  Optional Use GUID address argument  In most cases  it  is the Port GUID  Example    0x08f1040023     M ulticast  Optional Show multicast forwarding tables  The param   eters  lt startlid gt  and  lt endlid gt  specify the  MLID range     s  lt smlid gt  Optional Use  lt smlid gt  as the target LID for SM SA  queries    C Optional Use the specified channel adapter or router     ca name     P   ca port     Optional Use the specified port       Mellanox Technologies 189        Rev 2 0 3 0 0    Table 25   ibportstate Flags and Options       InfiniBand Fabric Diagnostic Utilities           gt     Optional           Flag     dator  If Not Description  y Specified    t Optional Override the default timeout for the solicited   lt timeout ms           msec         lt destdr_path   Optional    Destination   s directed path  LID  or GUID                              lid   guid gt     lt startlid gt  Optional Starting LID in an MLID range    lt endlid gt  Optional Ending LID in an MLID range  Examples    1  Dump all Lids with valid out ports of the switch with Lid 2      gt  ibroute 2    Unicast l
121. all script with the     without fw update  option and    now you wish to  manually  update firmware on your adapter card s   you need to perform the  following steps       If you need to burn an Expansion ROM image  please refer to    Burning the Expan   sion ROM Image    on page 205          The following steps are also appropriate in case you wish to burn newer firmware that  you have downloaded from Mellanox Technologies    Web site  http   www mella    Adi nox com  gt  Downloads  gt  Firmware      Step 1  Start mst   host1l  mst start    Step 2  Identify your target InfiniBand device for firmware update     1  Get the list of InfiniBand device names on your machine     36 Mellanox Technologies      Rev 2 0 3 0 0      host1  mst status   MST modules   MST PCI module loaded  MST PCI configuration module loaded  MST Calibre  12C  module is not loaded    MST devices      dev mst mt25418 pciconf0   PCI configuration cycles access   bus dev fn 02 00 0 addr  reg 88  data reg 92    Chip revision is  A0    dev mst  mt25418 pci cro   PCI direct access   bus dev fnz02 00 0 bar 0xdef00000  size 0x100000  Chip revision is  A0    dev mst  mt25418 pci_msix0   PCI direct access   bus dev fn 02 00 0 bar 0xdeefe000  size 0x2000    dev mst mt25418 pci uar   PCI direct access   bus dev fnz02 00 0 bar 0xdc800000  Size 0x800000   2  Your InfiniBand device is the one with the postfix   pci cr0   In the example listed  above  this will be  dev mst mt25418 pci cro     Step3  Burn firmware     1  
122. ance      7 2 6 3 1 Running      Application      a Certain          Node    7 2 7    7 2 7 1    In order to run an application on a certain NUMA node  the process affinity should be set in  either in the command line or an external tool     For example  if the adapter s NUMA node is 1 and NUMA 1 cores are 8 15 then an application  should run with process affinity that uses 8 15 cores only      gt  To run an application  run the following commands   taskset  c 8 15 ib write bw      Or     taskset Oxff00 ib write bw  a    IRQ Affinity    The affinity of an interrupt is defined as the set of processor cores that service that interrupt  To  improve application scalability and latency  it is recommended to distribute interrupt requests   IRQs  between the available processor cores  To prevent the Linux IRQ balancer application  from interfering with the interrupt affinity scheme  the IRQ balancer must be turned off     The following command turns off the IRQ balancer    gt   etc init d irgbalance stop  The following command assigns the affinity of a single interrupt vector    gt  echo   hexadecimal bit mask    gt   proc irg   irq vector gt  smp affinity    Bit i in  lt hexadecimal bit mask gt  indicates whether processor core i is in   irq vector gt  s affinity or not     IRQ Affinity Configuration        It is recommended to set each IRQ to a different core           For Sandy Bridge or AMD systems set the irq affinity to the adapter s NUMA node      For optimizing single port t
123. and set the CC manager main settings   perform the following        To enables disables Congestion Control mechanism on the fabric nodes  set the follow   ing parameter     enable     The values are   lt TRUE   FALSE gt         The default is  true       CC manager configures CC mechanism behavior based on the fabric size  The larger the  fabric is  the more aggressive CC mechanism is in its response to congestion  To manu   ally modify CC manager behavior by providing it with an arbitrary fabric size  set the  following parameter     num_hosts  e The values are   0 48K       The default is  o  base on the CCT calculation on the current subnet size        The smaller the number value of the parameter  the faster HCAs will respond to the con   gestion and will throttle the traffic  Note that if the number is too low  it will result in  suboptimal bandwidth  To change the mean number of packets between marking eligi   ble packets with a FECN  set the following parameter     marking rate    The values are  to ox        1     e The default is  oxa       You can set the minimal packet size that can be marked with FECN  Any packet less  than this size  bytes  will not be marked with FECN  To do so  set the following param   eter     packet size    The values are   0 0x3  c0          The default is  ox200    168 Mellanox Technologies      Rev 2 0 3 0 0        When number of errors exceeds  max errors  of send receive errors or timeouts in less  than  error window  seconds  the CC MG
124. ange the value of SRP LOAD in  etc infiniband openib conf to    yes          For the changes to take effect  run   etc init d openibd restart  P    srp sg tablesize  This is the maximum number of gather scatter entries per I O     gt  When loading the ib_srp module  it is possible to set the module parameter   default  12           Mellanox Technologies 41 J      Rev 2 0 3 0 0 Driver Features      4 1 2 2 Manually Establishing      SRP Connection    The following steps describe how to manually load an SRP connection between the Initiator and  an SRP Target  Section 4 1 2 4 explains how to do this automatically       Make sure that the ib srp module is loaded  the SRP Initiator is reachable by the SRP  Target  and that an SM is running        To establish a connection with an SRP Target and create an SRP  SCSI  device for that  target under  dev  use the following command     echo  n id ext  GUID value   ioc guid  GUID value   dgid  port GID value    pkey ffff service id  service 0  value   gt      sys class infiniband_srp srp mthca hca number     port number  add target  See Section 4 1 2 3 for instructions on how the parameters in this echo command may be obtained   Notes      Execution of the above    echo    command may take some time    The SM must be running while the command executes  e  tis possible to include additional parameters in the echo command     max cmd per lun   Default  63    max sect  short for max sectors    sets the request size of a command     10 class
125. are consistent   scat  proc cpuinfo   grep  cpu MHz     Check that the output frequencies are the same as the maximum supported     If the CPU frequency is not at the maximum  check the BIOS settings according to tables in is section   Recommended BIOS Settings  on page 110 to verify that power state is disabled       Check the current CPU frequency to check whether it is configured to max available  frequency      cat  sys devices system cpu cpu  cpufreq cpuinfo cur freq    114 Mellanox Technologies      Rev 2 0 3 0 0      7 2 4 1 Setting the Scaling Governor    If the following modules are loaded  CPU scaling is supported  and you can improve perfor   mance by setting the scaling mode to performance       freq table     acpi cpufreq  this module is architecture dependent     It is also recommended to disable the module cpuspeed  this module is also architecture depen   dent      gt  To set the scaling mode to performance  use     echo performance  gt   sys devices system cpu cpu7 cpufreg scaling governor     To disable cpuspeed  use       service cpuspeed stop    7 2 4 2 Kernel Idle Loop Tuning    The mlx4 en kernel module has an optional parameter that can tune the kernel idle loop for bet   ter latency  This will improve the CPU wake up time but may result in higher power consump   tion     To tune the kernel idle loop  set the following options in the  etc modprobe d mlx4 conf  file        For MLNX OFED 2 0 x  options mlx4 core enable sys tune 1    For MLNX EN 1 5 10    
126. at is UP  physical link state is LinkUp      Examples    186 Mellanox Technologies    InfiniBand Fabric Diagnostic Utilities         Rev 2 0 3 0 0      1  Query the status of Port 1 of CA mlx4_0  using ibstatus  and use its output  the LID     3 in  this case  to obtain additional link information using ibportstate      gt  ibstatus mlx4 0 1  Infiniband device  mlx4 0  port 1 status     default gid    e80 0000 0000 0000 0000 0000 9289 3895  base lid  0x3   sm lid  0x3   State  23            phys state  5  LinkUp   rate  20 Gb sec  4X DDR      gt  ibportstate  C mlx4 0 3 1 query  PortInfo              Port info  Lid 3 port 1                  ee Es eer Initialize  DhyshpsnkSbate e e rarer         LinkUp  bali poc SING  GIONE      amana 559502355 1X or 4X  nol                                     1X or 4X                   y apasun AX  LinkSpeedSupported                2 5 Gbps or 5 0 Gbps                   mainan 2 5 Gbps or 5 0 Gbps  IinnkSpeedAGEmvOrd m sss 5 0 Gbps    2  Query the status of two channel adapters using directed paths    gt  ibportstate  C mlx4 0  D 0 1    PortInfo      Port info  DR path slid 65535  dlid 65535  0 port 1                ND eR           Initialize  Phys lim cS                   LinkUp   Miley         NS IOI                      1X or 4X    iq req EES LS CS            1X or 4X   Tapi CHAC      annua 4X  LinkSpeedSupported                2 5 Gbps or 5 0 Gbps                                             2 5 Gbps or 5 0 Gbps  Mig                       
127. at recognizes an application that is pinned to a remote  NUMA node and activates a flow that improves the out of the box latency and throughput     However  the NUMA node recognition must be enabled as described in section    Tuning for  Intel   Sandy Bridge Platform    on page 116     In systems which do not support SLIT  the following environment variable should be applied   MLX4 LOCAL CPUS 0x bit mask of local NUMA node    Example for local NUMA node which its cores are 0 7          LOCAL CPUS 0xff    Additional modification can apply to impact this feature by changing the following environment  variable     MLX4 STALL NUM LOOP  integer   default  400        The default value is optimized for most applications  However  several applications  might benefit from increasing decreasing this value     7 2 6 2 Tuning for AMD   Architecture  On AMD architecture there is a difference between a 2 socket system and a 4 socket system      With a2 socket system the PCIe adapter will be connected to socket 0  nodes 0 1         With a4 socket system the PCIe adapter will be connected either to socket 0  nodes 0 1   or to socket 3  nodes 6 7      7 2 6 3 Recognizing NUMA Node Cores   gt  To recognize NUMA node cores  run the following command     cat  sys devices system node node X  cpulist   cpumap  Example       cat  sys devices system node nodel cpulist  Upon an 7 9 115 5 15     cat  sys devices system node nodel cpumap  0000aaaa    Mellanox Technologies 117      Rev 2 0 3 0 0 Perform
128. at will  join the fabric  execute srp daemon  e  This utility continues to execute until it is either  killed by the user or encounters connection errors  such as no SM in the fabric         To execute SRP daemon as a daemon you may run           srp daemon   found under   usr sbin    providing it with the same options used for running srp daemon       Make sure only        instance of run_srp_daemon runs        port           To execute SRP daemon as a daemon on all the ports  run    srp_daemon  sh     found  under  usr sbin   srp_daemon  sh sends its log to  var log srp daemon log        It is possible to configure this script to execute automatically when the InfiniBand  driver starts by changing the value of SRPHA_ENABLE in  etc infiniband   openib conf to    yes     However  this option also enables SRP High Availability that  has some more features     see Section 4 1 2 6      For the changes in openib  conf to take effect  run    etc init d openibd restart    4 1 2 5 Multiple Connections from Initiator IB Port to the Target    Some system configurations may need multiple SRP connections from the SRP Initiator to the  same SRP Target  to the same Target IB port  or to different IB ports on the same Target HCA     In case of a single Target IB port  1      SRP connections use the same path  the configuration is  enabled using a different initiator_ext value for each SRP connection  The initiator_ext value is a  16 hexadecimal digit value specified in the connection comma
129. ated to support the AR  the AR Manager will  need to be restarted  by restarting Subnet Man   ager  to allow it to configure the AR on this  switch    This option can be changed on the fly     Default  true       AR MODE    lt bounded free gt     Adaptive Routing Mode                no constraints on output port selection      bounded  the switch does not change the output port during the  same transmission burst  This mode minimizes the appearance  of out of order packets    This option can be changed on the fly     Default  bounded       AGEING_ TIME    lt usec gt     Applicable to bounded AR mode only  Specifies  how much time there should be no traffic in  order for the switch to declare a transmission  burst as finished and allow changing the output  port for the next transmission burst  32 bit  value     This option can be changed on the fly     Default  30              ERRORS    lt N gt    ERROR WINDOW  DEN    When number of errors exceeds    MAX ERRORS   of send receive errors or time   outs in less than  ERROR WINDOW  seconds   the AR Manager will abort  returning control  back to the Subnet Manager    This option can be changed on the fly     Values for both options   0    Oxffff      MAX ERRORS   0  zero tolle   rance   abort configuration on first  error  Default 10     ERROR WINDOW   0  mecha     nism disabled   no error checking   Default  5       LOG FILE    full  path      AR Manager log file   This option can be changed on the fly     Default   var log armgr 
130. ations and Acronyms    Table 2   Abbreviations and Acronyms  Sheet 1 of 2                                                                 Abbreviation   Acronym Whole Word   Description   B  Capital     B    is used to indicate size in bytes or multiples of  bytes  e g   1       1024 bytes  and 1       1048576 bytes    b  Small     b    is used to indicate size in bits or multiples of bits   e g   IKb   1024 bits    FW Firmware   HCA Host Channel Adapter   HW Hardware   IB InfiniBand   iSER iSCSI RDMA Protocol   LSB Least significant byte   Isb Least significant bit   MSB Most significant byte   msb Most significant bit   NIC Network Interface Card   SW Software   VPI Virtual Protocol Interconnect   IPoIB IP over InfiniBand   PFC Priority Flow Control   PR Path Record   RDS Reliable Datagram Sockets   RoCE          over Converged Ethernet                12 Mellanox Technologies      Rev 2 0 3 0 0      Table 2   Abbreviations and Acronyms  Sheet 2 of 2        Abbreviation   Acronym    Whole Word   Description                                           SDP Sockets Direct Protocol   SL Service Level   SRP SCSI RDMA Protocol   MPI Message Passing Interface   EoIB Ethernet over Infiniband   QoS Quality of Service   ULP Upper Level Protocol   VL Virtual Lane   vHBA Virtual SCSI Host Bus adapter   uDAPL User Direct Access Programming Library  Glossary    The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man   agers in particular  It is inclu
131. ble is organized by  MLID        Network Interface    A network adapter card that plugs into the PCI Express slot             Card  NIC  and provides one or more ports to an Ethernet network    Standby Subnet Man    A Subnet Manager that is currently quiescent  and not in the   ager role of a Master Subnet Manager  by agency of the master SM   See Subnet Manager    Subnet Administra  An application  normally part of the Subnet Manager  that   tor  SA  implements the interface for querying and manipulating subnet  management data    Subnet Manager One of several entities involved in the configuration and con     SM  trol of the an IB fabric        Unicast Linear For     A table that exists in every switch providing the port through          warding Tables which packets should be sent to each LID     LFT    Virtual Protocol A Mellanox Technologies technology that allows Mellanox  Interconnet  VPI  channel adapter devices  ConnectX    to simultaneously con        nect to an InfiniBand subnet and a 10GigE subnet  each subnet  connects to one of the adpater ports           Related Documentation    Table 4   Reference Documents       Document Name    Description       InfiniBand Architecture Specification     Vol  1  Release 1 2 1    is provided by IBTA          IEEE Std 802 3ae    2002   Amendment to IEEE Std 802 3 2002   Document   PDF  5594996    Physical Layer Specifications  Amendment  Media Access Control  MAC     Parameters for 10 Gb s Operation          14 Mellanox Technolo
132. ble motherboard BIOS    Hypervisor that supports SR IOV such as  Red Hat Enterprise Linux Server Version 6       Mellanox ConnectX   VPI Adapter Card family with SR IOV capability    4 13 2 Setting Up SR IOV    Depending on your system  perform the steps below to set up your BIOS  The figures used in this  section are for illustration purposes only  For further information  please refer to the appropriate  BIOS User Manual            1  Enable  SR IOV  in the system BIOS     BIOS SETUP UTILITY         82 Mellanox Technologies J      Rev 2 0 3 0 0             2  Enable  Intel Virtualization Technology      BIOS SETUP UTILITY    wided b    wa lization Tech  Enabled       le Bit Cay    bility       Step 3  Install the hypervisor that supports SR IOV     Step 4  Depending on your system  update the  boot grub grub conf file to include a similar command  line load parameter for the Linux kernel     For example  to Intel systems  add     default 0  timeout 5  splashimage   hd0  0   grub splash xpm gz  hiddenmenu  title Red Hat Enterprise Linux Server  2 6 32 36 x86 645   root  hd0 0   kernel  vmlinuz 2 6 32 36 x86 64 ro root  dev VolGroup00 LogVol00 rhgb quiet    intel iommu on   initrd  initrd 2 6 32 36 x86 64 img    1  Please make sure the parameter  intel_iommu on  exists when updating the  boot grub grub conf  file  otherwise SR IOV cannot be loaded     Step 5  Install the MLNX OFED driver for Linux that supports SR IOV   Step 6  Verify the HCA is configured to support SR IOV  
133. bric links pmCounters    P  lt PM  lt Trash gt  gt  If any of the provided pm is greater then its provided  value  print it to screen    skip  lt skip option s  gt Skip the executions of the selected checks  Skip  options  one or more can be specified   dup guids  zero guids pm logical state part ipoib all    wt  lt file name gt  Write out the discovered topology into the given file   This flag is useful if you later want to check for  changes from the current state of the fabric  A directory  named ibdiag ibnl is also created by this option  and  holds the IBNL files required to load this topology   To use these files you will need to set the environment  variable named IBDM IBNL PATH to that directory   The directory is located in  tmp or in the output  directory provided by the  o flag     load db  lt file name gt  gt Load subnet data from the given  db file  and skip  subnet discovery stage    Note  Some of the checks require actual subnet discovery    and therefore would not run when load db is specified   These checks are  Duplicated zero guids  link state  SMs                status    h   help Prints the help page information   V   version Prints the version of the tool    vars Prints the tool s environment variables and their values             Output Files  Table 20   ibdiagnet  of ibutils  Output Files  Output File Description  ibdiagnet log A dump of all the application reports generate according to the  provided flags  ibdiagnet lst List of all the nodes  ports an
134. but only the actions such as joining multicast group  that need to  be taken when using the API      Since LID is a layer 2 attribute of the InfiniBand protocol stack  it is not set for a port  and is displayed as zero when querying the port       With           the alternate path is not set for RC QP and therefore APM is not supported     Since the SM is not present  querying a path is impossible  Therefore  the path record  structure must be filled with the relevant values before establishing a connection     Hence  it is recommended working with RDMA CM to establish a connection as it  takes care of filling the path record structure       The GID table for each port is populated with N 1 entries where N  N gt  0  is the num   ber of VLAN devices over this port     Mellanox Technologies 23 J      Rev 2 0 3 0 0 Installation      2 Installation    This chapter describes how to install and test the Mellanox OFED for Linux package on a single  host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed     2 1 Hardware and Software Requirements    Table 1   Software and Hardware Requirements       Requirements Description       Platforms A server platform with an adapter card based on one of the following  Mellanox Technologies    InfiniBand HCA devices          27508 ConnectX   3  VPI  IB  EN   firmware  fw ConnectX3    e MT4113 Connect IB     IB   firmware  fw Connect IB    For the list of supported architecture platforms  please refer to the  Mellanox OF
135. cessors    The following table displays the recommended BIOS settings in machines with Intel Nehalem   based processors Configuring the Completion Queue Stall Delay     Table 11   Recommended BIOS Settings for Intel   Nehalem Westmere Processors                                     BIOS Option Values  General Operating Mode  Power pro    Maximum Performance  file  Processor C States Disabled  Turbo mode Disabled  Hyper Threading  Disabled  Recommended for latency and message rate  sensitive applications   CPU frequency select Max performance  Memory Memory speed Max performance  Memory channel mode Independent  Node Interleaving Disabled   NUMA  Channel Interleaving Enabled  Thermal Mode Performance                   1  Hyper Threading can increase message rate for multi process applications by having more logical cores  It  might increase the latency of a single process  due to lower frequency of a single logical core when hyper   threading is enabled     7 1 3 4 AMD Processors    The following table displays the recommended BIOS settings in machines with AMD based pro   cessors     Table 12   Recommended BIOS Settings for AMD Processors                      BIOS Option Values  General Operating Mode  Power pro    Maximum Performance  file  Processor C States Disabled  Turbo mode Disabled  HPC Optimizations Enabled  CPU frequency select Max performance                   112 Mellanox Technologies      Rev 2 0 3 0 0      Table 12   Recommended BIOS Settings for AMD Processors  
136. ch node  but it is constrained to ranking rules  This algorithm should  be chosen if the subnet is not a pure Fat Tree  and a deadlock may occur due to a loop in the subnet     3     Fat tree Routing Algorithm       This algorithm optimizes routing for a congestion free    shift    communication pattern  It should be  chosen if a subnet is a symmetrical Fat Tree of various types  not just a K ary N Tree  non constant  K  not fully staffed  and for any CBB ratio  Similar to UPDN  Fat Tree routing is constrained to rank   ing rules     4     LASH Routing Algorithm       Uses InfiniBand virtual layers  SL  to provide deadlock free shortest path routing while also distrib   uting the paths between layers  LASH is an alternative deadlock free  topology agnostic routing algo   rithm to the non minimal UPDN algorithm  It avoids the use of a potentially congested root node     5     DOR Routing Algorithm       Based on the Min Hop algorithm  but avoids port equalization except for redundant links between the  same two switches  This provides deadlock free routes for hypercubes when the fabric is cabled as a  hypercube and for meshes when cabled as a mesh     6     Torus 2QoS Routing Algorithm     Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies  Torus 2QoS  provides deadlock free routing while supporting two quality of service  QoS  levels  Additionally     it can route around multiple failed fabric links or a single failed fabric switch without intro
137. ch rules such that the target QoS Level defini   tion is found  Given the QoS Level a path s  search is performed with the given restrictions  imposed by that level     4 5 Quality of Service Ethernet    4 5 1 Quality of Service Overview    Quality of Service  QoS  is a mechanism of assigning a priority to a network flow  socket   rdma cm connection  and manage its guarantees  limitations and its priority over other flows   This is accomplished by mapping the user s priority to a hardware TC  traffic class  through a 2   3 stages process  The TC is assigned with the QoS attributes and the different flows behave  accordingly    4 5 2 Mapping Traffic to Traffic Classes    Mapping traffic to TCs consists of several actions which are user controllable  some controlled  by the application itself and others by the system network administrators     The following is the general mapping traffic to Traffic Classes flow   1  The application sets the required Type of Service  ToS    2  The ToS is translated into a Socket Priority  sk prio      3  The sk prio is mapped to a User Priority  UP  by the system administrator  some applica   tions set sk prio directly      4  The UP is mapped to TC by the network system administrator   5  TCs hold the actual QoS parameters    QoS can be applied on the following types of traffic  However  the general QoS flow may vary  among them       Plain Ethernet   Applications use regular inet sockets and the traffic passes via the ker   nel Ethernet driver
138. cified device   May specify more than one device                      Mellanox Technologies 183           Rev 2 0 3 0 0    InfiniBand Fabric Diagnostic Utilities    Table 23   ibstatus Flags and Options                            Optional   Default  Flag    f  r  If Not Description  y Specified    lt port gt  Optional  but   All ports of Print information for the specified port only   requires the specified    ofthe specified device    specifying a   device   device name   Examples    1  List the status of all available InfiniBand devices and their ports      gt  ibstatus    Infiniband device  mlx4 0  port 1 status     default gid   base lid    sm lid   state    phys state                 fe80 0000 0000 0000 0000 0000 0007 3896  0x3   0x3   4  ACTIVE   5  LinkUp   20 Gb sec  4X DDR     Infiniband device  mlx4 0  port 2 status     default gid   base lid    sm lid   state    phys state   rate       e80 0000 0000 0000 0000 0000 0007 3897  0  1   0  1   4                 5  LinkUp   20 Gb sec  4X DDR     Infiniband device  mthca0  port 1 status     default gid   base lid    sm lid   state    phys state                 e80 0000 0000 0000 0002 c900 0101 d151  0x0   0x0   215            5  LinkUp   10 Gb sec  4X     Infiniband device  mthca0  port 2 status     default gid   base lid    sm lid   state    phys state                   e80 0000 0000 0000 0002 c900 0101 d152  0x0   0x0   23        5  LinkUp   10 Gb sec  4       184 Mellanox Technologies         Rev 2 0 3 0 0      2  List
139. com 72 3 194 0  80    connected    HTTP request sent  awaiting response    200 OK   Length  1354  1 3K   text plain    Saving to   RPM GPG KEY Mellanox     100                                                    gt   1 354     tn Qe    2013 08 20 13 52 30  247 MB s     RPM GPG KEY Mellanox  saved  1354 1354          4  Install the key     sudo rpm   import RPM GPG KEY Mellanox    Step 5  Check that the key was successfully imported       rpm  q gpg pubkey   qf    NAME    VERSION    RELEASE  t  SUMMARY  n    grep Mellanox  gpg pubkey a9e4b643 520791ba gpg  Mellanox Technologies  lt support mellanox com gt      Step 6  Create a YUM repository configuration file called   etc yum repos d mlnx ofed repo   with the following content      mlnx ofed    name MLNX OFED Repository   baseurl file     lt path to extracted MLNX OFED package     enabled 1   gpgkey file      path to the downloaded key  RPM GPG KEY Mellanox     gpgcheck 1    Step 7  Check that the repository was successfully added       yum repolist   Loaded plugins  product id  security  subscription manager   This system is not registered to Red Hat Subscription Management  You can use subscrip   tion manager to register     repo id repo name status  mlnx ofed MLNX OFED Repository 108  rpmforge RHEL 6Server   RPMforge net   dag A 597    repolist  8 351    38 Mellanox Technologies      Rev 2 0 3 0 0      2 5 2 Installing MLNX_OFED using the YUM Tool  After setting up the YUM repository for MLNX_OFED package  perform the follo
140. continue   y N   y   Running  usr sbin vendor pre uninstall sh   Removing OFED Software installations    88 Mellanox Technologies      Rev 2 0 3 0 0      Running  bin rpm  e   allmatches kernel ib kernel ib devel libibverbs libibverbs devel  libibverbs devel static libibverbs utils libmlx4 libmlx4 devel libibcm libibcm devel  libibumad libibumad devel libibumad static libibmad libibmad devel libibmad static  librdmacm librdmacm utils librdmacm devel ibacm opensm libs opensm devel perftest com   pat dapl compat dapl devel dapl dapl devel dapl devel static dapl utils srptools infini   band diags guest ofed scripts opensm devel    warning   etc infiniband openib conf saved as  etc infiniband openib conf rpmsave  Running  tmp 2818 ofed vendor post uninstall sh    Step3  Restart the server     4 13 6 Burning Firmware with SR IOV    The following procedure explains how to create a binary image with SR IOV enabled that has 63  VFs  However  the number of VFs varies according to the working mode requirements        To burn the firmware          1  Verify you have MFT installed in your machine   Step 2  Enter the firmware directory  according to the HCA type  e g  ConnectX   3    The path is   mlnx_ofed firmware  lt device gt   lt FW version  Step 3  Find the ini file that contains the HCA s PSID  Run     ibv_devinfo   grep board id  board id  MT 1090120019  If such ini file cannot be found in the firmware directory  you may want to dump the configura   tion file using mstflint  Run  
141. creases the log verbosity level  The  v  option may be specified multiple times to further  increase the verbosity level  See the  vf option for  more information about log verbosity     V This option sets the maximum verbosity level and  forces log flushing  The  V is equivalent to   vf OxFF    d 2   See the  vf option for more information about log  verbosity    zit This option sets the log verbosity level  A flags  field must follow the  D option  A bit set clear in the  flags enables disables a specific log level as follows    BIT LOG LEVEL ENABLED   0x01   ERROR  error messages    0x02   INFO  basic messages  low volume    0x04   VERBOSE  interesting stuff  moderate volume     132 Mellanox Technologies      Rev 2 0 3 0 0      0x08  0x10   FUNCS  function entry exit  very high volume   0x20   FRAMES  dumps all SMP and GMP frames   0x40   ROUTING  dump FDB routing information   0x80   currently unused   Without  vf  osmtest defaults to ERROR   INFO  0x3   Specifying  vf 0 disables all messages Specifying    DEBUG  diagnostic  high volume      vf OxFF enables all messages  see  V  High verbosity  levels may require increasing the transaction timeout  with the  t option    h    help Display this usage info then exit     8 3 2 Running osmtest  To run osmtest in the default mode  simply enter   host1  osmtest  The default mode runs all the flows except for the Quality of Service flow  see Section 8 6      After installing opensm  and if the InfiniBand fabric is stable   it is 
142. d links in the fabric                Mellanox Technologies 177           Rev 2 0 3 0 0          Table 20   ibdiagnet  of ibutils  Output Files             Output File Description  ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches  ibdiagnet mcfdbs A dump of the multicast forwarding tables of the fabric switches       ibdiagnet masks    In case of duplicate port node Guids  these file include the map  between masked Guid and real Guids       ibdiagnet sm    List of all the SM  state and priority  in the fabric                   ibdiagnet pm A dump of the pm Counters values  of the fabric links   ibdiagnet pkey A dump of the existing partitions and their member host ports   ibdiagnet mcg A dump of the multicast groups  their properties and member  host ports   ibdiagnet db A dump of the internal subnet database  This file can be loaded       in later runs using the  load_db option       In addition to generating the files above  the discovery phase also checks for duplicate node port  GUIDs in the IB fabric  If such an error is detected  it is displayed on the standard output  After  the discovery phase is completed  directed route packets are sent multiple times  according to the   c option  to detect possible problematic paths on which packets may be lost  Such paths are  explored  and a report of the suspected bad links is displayed on the standard output     After scanning the fabric  if the    option is provided  a full report of the fabric qual
143. ded here for ease of reference  but the main reference remains the    InfiniBand Architecture Specification     Table 3   Glossary  Sheet 1 of 2              Channel Adapter An IB device that terminates an IB link and executes transport    CA   Host Channel functions  This may be an HCA  Host CA  or a TCA  Target   Adapter  HCA  CA     HCA Card A network adapter card based on an InfiniBand channel  adapter device    IB Devices Integrated circuit implementing InfiniBand compliant commu     nication        IB Cluster Fabric   Subnet    A set of IB devices connected by IB cables                    In Band A term assigned to administration activities traversing the IB  connectivity only    Local Identifier  ID    An address assigned to a port  data sink or source point  by the  Subnet Manager  unique within the subnet  used for directing  packets within the subnet    Local Device Node    The IB Host Channel Adapter  HCA  Card installed on the   System machine running IBDIAG tools        Mellanox Technologies 13 J            Rev 2 0 3 0 0      Table 3   Glossary  Sheet 2 of 2        Local Port    The IB port of the HCA through which IBDIAG tools connect  to the IB fabric        Master Subnet Man   ager    The Subnet Manager that is authoritative  that has the refer   ence configuration information for the subnet  See Subnet  Manager        Multicast Forward   ing Tables    A table that exists in every switch providing the list of ports to  forward received multicast packet  The ta
144. destroy flow struct ibv flow  flow id   Input parameters            destroy flow requires struct        flow which is the return value of ibv create flowin  case of success     Output parameters   Returns 0 on success  or the value of errno on failure     For further information  please refer to the ibv destroy flow man page     Ethtool  Ethtool domain is used to attach an RX ring  specifically its QP to a specified flow     Please refer to the most recent ethtool manpage for all the ways to specify a flow     Examples           ethtool    U eth5 flow type ether dst 00 11 22 33 44 55 loc 5 action 2    All packets that contain the above destination MAC address are to be steered into rx ring 2  its  underlying QP   with priority 5  within the ethtool domain     ethtool    U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2    All packets that contain the above destination IP address and source port are to be steered into rx   ring 2  When destination MAC is not given  the user s destination MAC is filled automatically     80 Mellanox Technologies      Rev 2 0 3 0 0         ethtool    u eth5  Shows all of ethtool s steering rule    When configuring two rules with the same priority  the second rule will overwrite the first one  so this  ethtool interface is effectively a table  Inserting Flow Steering rules in the kernel requires support  from both the ethtool in the user space and in kernel  v2 6 28      MLXA Driver Support    The mlx4 driver supports only a subset
145. dm and extends its functionality  In addition to the ibsr   pdm functionality described above  srp_daemon can also       Establish an SRP connection by itself  without the need to issue the    echo    command  described in Section 4 1 2 2        Continue running in background  detecting new targets and establishing SRP connec   tions with them  daemon mode       Discover reachable SRP Targets given an infiniband HCA name and port  rather than  just by   dev umad lt N gt  where  lt N gt  is a digit    Enable High Availability operation  together with Device Mapper Multipath     Haveaconfiguration file that determines the targets to connect to  l  srp daemon commands equivalent to ibsrpdm    srp daemon  a  o  is equivalent to  ibsrpdm    srp daemon  c  a  o  is equivalent to  ibsrpdm  c       These srp_daemon commands can behave differently than the equivalent    ibsrpdm command when  etc srp_daemon conf is not empty           2  srp_daemon extensions to ibsrpdm       To discover SRP Targets reachable from the HCA device  lt InfiniBand HCA name gt  and  the port  lt port num gt    and to generate output suitable for  echo    you may execute     host1  srp daemon  c  a  o  i  lt InfiniBand HCA name gt   p   port number          To obtain the list of InfiniBand HCA device names  you can either use the ibstat tool or  run    15  sys class infiniband          To both discover the SRP Targets and establish connections with them  just add the  e  option to the above command        Ex
146. ducing  deadlocks  and without changing path SLvalues granted before the failure     OpenSM provides an optional unicast routing cache  enabled by  A or   ucast_cache options   When  enabled  unicast routing cache prevents routing recalculation  which is a heavy task in a large cluster   when there was no topology change detected during the heavy sweep  or when the topology change  does not require new routing calculation  e g  when one or more CAs RTRs leaf switches going down   or one or more of these nodes coming back after being down  A very common case that is handled by  the unicast routing cache is host reboot  which otherwise would cause two full routing recalculations   one when the host goes down  and the other when the host comes back online     OpenSM also supports a file method which can load routes from a table     see Modular Routing  Engine below     The basic routing algorithm is comprised of two stages     1  MinHop matrix calculation  How many hops are required to get from each port to each LID   The algorithm to fill these tables is different if you run standard  min hop  or Up Down  For  standard routing  a  relaxation  algorithm is used to propagate min hop from every destina   tion LID through neighbor switches  For Up Down routing  a BFS from every target is used   The BFS tracks link direction  up or down  and avoid steps that will perform up after a down  step was used     2  Once MinHop matrices exist  each switch is visited and for each target LID 
147. e  and one of its loops may experience a deadlock  due  for example   to high pressure      The UPDN algorithm is based on the following main stages     1  Auto detect root nodes   based on the CA hop length from any switch in the subnet  a statisti   cal histogram is built for each switch  hop num vs number of occurrences   If the histogram  reflects a specific column  higher than others  for a certain node  then it is marked as a root  node  Since the algorithm is statistical  it may not find any root nodes  The list of the root  nodes found by this auto detect stage is used by the ranking process stage       The user        override the node list manually            If this stage cannot find any root nodes  and the user did not specify a guid list file     OpenSM defaults back to the Min Hop routing algorithm           2  Ranking process   All root switch nodes  found in stage 1  are assigned a rank of 0  Using  the BFS algorithm  the rest of the switch nodes in the subnet are ranked incrementally  This  ranking aids in the process of enforcing rules that ensure loop free paths     3  Min Hop Table setting   after ranking is done  a BFS algorithm is run from each  CA or  switch  node in the subnet  During the BFS process  the FDB table of each switch node tra   versed by BFS is updated  in reference to the starting node  based on the ranking rules and  guid values     At the end of the process  the updated FDB tables ensure loop free paths through the subnet         Up Dow
148. e ID  The Service ID for RDS is  0x000000000106PPPP  where PPPP are 4 hex digits holding the remote TCP IP Port Number to  connect to  Default port number for RDS is 0x48CA  which makes a default Service ID  0x00000000010648CA  The following two match rules are equivalent     rds     SL    any  Service id 0x00000000010648CA     SL      156 Mellanox Technologies      Rev 2 0 3 0 0      8 6 6 4 SRP    Service ID for SRP varies from storage vendor to vendor  thus SRP query is matched by the tar   get IB port GUID  The following two match rules are equivalent     srp  target port guid 0x1234    lt SL gt   any  target port guid 0x1234     SL      Note that any of the above ULPs might contain target port GUID in the PR query  so in order for  these queries not to be recognized by the QoS manager as SRP  the SRP match rule  or any match  rule that refers to the target port guid only  should be placed at the end of the qos ulps match  rules     8 6 6 5 MPI    SL for MPI is manually configured by MPI admin  OpenSM is not forcing any SL on the MPI  traffic  and that s why it is the only ULP that did not appear in the qos ulps section     8 6 7 SL2VL Mapping and VL Arbitration    OpenSM cached options file has a set of QoS related configuration parameters  that are used to  configure SL2VL mapping and VL arbitration on IB ports  These parameters are       Max VLs  the maximum number of VLs that will be on the subnet     High limit  the limit of High Priority component of VL Arbitration 
149. e ID through the notion of port space as a prefix to the port  number  which is part of the sockaddr provided to rdma_resolve_add    The CMA also allows the  ULP  like SDP  to propagate a request for a specific QoS Class  The CMA uses the provided  QoS Class and Service ID in the sent PR MPR     4 4 4 1 IPoIB    IPoIB queries the SA for its broadcast group information and uses the SL  MTU  RATE and  Packet Lifetime available on the multicast group which forms this broadcast group     4 4 4 2 SRP    The current SRP implementation uses its own CM callbacks  not CMA   So SRP fills in the Ser   vice ID in the PR MPR by itself and use that information in setting up the QP     SRP Service ID is defined by the SRP target I O Controller  it also complies with IBTA Service   ID rules   The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute  and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs     58 Mellanox Technologies      Rev 2 0 3 0 0      4 4 5 OpenSM Features    The QoS related functionality that is provided by OpenSM   the Subnet Manager described in  Chapter 8 can be split into two main parts     I  Fabric Setup    During fabric initialization  the Subnet Manager parses the policy and apply its settings to the  discovered fabric elements     II  PR MPR Query Handling    OpenSM enforces the provided policy on client request  The overall flow for such requests is   first the request is matched against the defined mat
150. e appropriate driver stack  InfiniBand or Ethernet      For example  if the first port is connected to an InfiniBand switch and the second to Ethernet  switch  the NIC will automatically load the first switch as InfiniBand and the second as Ethernet     6 2 1 Enabling Auto Sensing    Upon driver start up   1  Sense the adapter card   s port type   If a valid cable or module is connected  QSFP  SFP   or SFP with EEPROM in the cable module         Set the port type to the sensed link type  IB Ethernet   Otherwise        Set the port type as default  Ethernet     During driver run time      Sense a link every 3 seconds if no link is sensed detected         fsensed  set the port type as sensed    Mellanox Technologies 109      Rev 2 0 3 0 0 Performance      7 Performance    7 1 General System Configurations    The following sections describe recommended configurations for system components and or  interfaces  Different systems may have different features  thus some recommendations below  may not be applicable     7 1 1 PCI Express  PCle  Capabilities    Table 9   Recommended PCle Configuration                      PCIe Generation 3 0  Speed 8GT s  Width x8 or x16  Max Payload size 256   Max Read Request 4096               For ConnectX3  based network adapters  40GbE Ethernet adapters it is recommended    to use an x16 PCIe slot to benefit from the additional buffers allocated by the CPU           7 1 2 Memory Configuration    For high performance it is recommended to use the high
151. e bonding master configuration file  e g  ifcfg bond0   in addition to Linux bond   ing semantics  use the following parameter  MTU 65520    65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode  See    Section 4 3 2     IPoIB Mode Setting     on page 49  and are configured with the  same value  For IPoIB slaves that work in datagram mode  use MTU 2044  If you do       not set      correct MTU      do not set MTU at all  performance of the interface might  decrease         n the bonding slave configuration file  e g  ifcfg ib0   use the same Linux Network  Scripts semantics  In particular  DEVICE ib0        In the bonding slave configuration file  e g  ifcfg ib0 8003   the line TYPE InfiniBand  is necessary when using bonding over devices configured with partitions  p key       For RHEL users   In  etc modprobe b bond conf add the following lines     alias bond0 bonding    For SLES users     It is necessary to update the MANDATORY DEVICES environment variable in  etc sysconfig net   work config with the names of the IPoIB slave devices  e g  ib0  ibl  etc    Otherwise  bonding mas   ter may be created before IPoIB slave interfaces at boot time     Itis possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet  bonding master  However  It is NOT possible to mix Ethernet and IPoIB slaves under the same bond   ing master        Restarting openibd does no keep the bonding configuration via Network Scripts  You     have 
152. e boot process if timeouts and max retries are set too high         Example for discovering and connecting targets over ISER   iscsiadm  m discovery  o new  o old  t st     iser  p  lt ip port gt   1    iSER also supports RoCE without any additional configuration required  To bond the RoCE  interfaces  set the fail over mac option in the bonding driver     48 Mellanox Technologies      Rev 2 0 3 0 0      4 3 IP over InfiniBand    4 3 1 Introduction    The IP over IB  IPoIB  driver is a network interface implementation over InfiniBand  IPoIB  encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service  The  IPoIB driver  ib_ipoib  exploits the following capabilities     e VLAN simulation over an InfiniBand network via child interfaces    High Availability via Bonding     Varies MTU values      up to 4k in Datagram mode     up to 64k in Connected mode    Uses any ConnectX   IB ports  one or two     Inserts IP UDP TCP checksum on outgoing packets  e Calculates checksum on received packets       Support net device TSO through ConnectX   LSO capability to defragment large data   grams to MTU quantas       Dual operation mode   datagram and connected     Large MTU support through connected mode   IPoIB also supports the following software based enhancements      Giant Receive Offload                 Ethtool support    4 3 2  IPoIB Mode Setting    IPoIB can run in two modes of operation  Connected mode and Datagram mode  By default   IPoIB is set to work in 
153. e file based routing  any topology changes are currently ignored  The  file   routing engine just loads the LFTs from the file specified  with no reaction to real topology   Obviously  this will not be able to recheck LIDs  by GUID  for disconnected nodes  and LFTs for  non existent switches will be skipped  Multicast is not affected by  file  routing engine  this uses  min hop tables      8 5 2 Min Hop Algorithm  The Min Hop algorithm is invoked by default if no routing algorithm is specified  It can also be  invoked by specifying   R minhop      The Min Hop algorithm is divided into two stages  computation of min hop tables on every  switch and LFT output port assignment  Link subscription is also equalized with the ability to  override based on port GUID  The latter 1s supplied by      i  lt equalize ignore guids file gt    ignore guids  lt equalize ignore guids file gt    This option provides the means to define a set of ports  by guids   that will be ignored by the link load equalization algorithm     LMC awareness routes based on  remote  system or switch basis     Mellanox Technologies 137      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 5 3          Algorithm    The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet  A  loop deadlock is a situation in which it is no longer possible to send data between any two  hosts connected through the loop  As such  the UPDN routing algorithm should be used if the  subnet is not a pure Fat Tre
154. e is not configured correctly    173 Failed to start the mst driver                Mellanox Technologies 27 J      Rev 2 0 3 0 0 Installation      2 3 3 Installation Procedure  Step 1  Login to the installation machine as root   Step 2  Mount the ISO image on your machine  host1  mount  o ro loop MLNX OFED LINUX   ver     0S label gt   lt CPU arch gt  iso  mnt  Step 3         the installation script       mlnxofedinstall   This program will install the MLNX OFED LINUX package on your machine    Note that all other Mellanox  OEM  OFED  or Distribution IB packages will be removed   Do you want to continue   y N  y    Uninstalling the previous version of MLNX OFED LINUX                         Starting MLNX OFED LINUX 2 0 2 6 7 installation       Installing mlnx ofa kernel RP   Preparing  hs HH H      HH HH H HH H HH HH H Het HH H HHH HH H HH H HH HH  mlnx ofa_kernel HHH HHH HH HH H HHH HH HH H HHH   H H HH H HHH HH HH  Installing kmod mlnx ofa_kernel RPM   Preparing    HH H      HH HH H HHH HH HHH        H  H  HH H HH H HH HH  kmod mlnx ofa kernel HH H Het HH HH H HH H HH   H Het HH HHHH HH H HH H HH HH  Installing mlnx ofa_kernel devel RPM   Preparing  AE   H HHH HH   H HHH     H       H H HHH   H HH HH  mlnx ofa_kernel devel HHH Het HH Het HHH HH HHH Het   H  H  HHH HH H HH HH  Installing kmod kernel mft mlnx RP   Prepari ng  A HHH HHH HH HH H HHH HH HH H Het HHH H HHH HHH HH HH  kmod kernel mft mlnx HH H HHH HH Het HHH HH   H      HH          Het   H HH HH  Installing knem m
155. e y e c oao   E  Can t auto detect fw configuration file     Step4  In case the installation script performed firmware updates to your network adapter hardware  it  will ask you to reboot your machine            5         script adds the following lines to  etc security limits conf for the userspace com   ponents such as MPI       soft memlock unlimited     hard memlock unlimited  These settings unlimit the amount of memory that can be pinned by a user space application   If desired  tune the value unlimited to a specific amount of RAM     Step 6  For your machine to be part of the InfiniBand VPI fabric  a Subnet Manager must be running  on one of the fabric nodes  At this point  Mellanox OFED for Linux has already installed the  OpenSM Subnet Manager on your machine  For details on starting OpenSM  see Chapter 8      OpenSM     Subnet Manager             7   InfiniBand only  Run the hca_self_test ofed utility to verify whether or not the Infini   Band link is up  The utility also checks for and displays additional information such as      HCA firmware version     Kernel architecture     Driver version     Number of active HCA ports along with their states    Node GUID    34 Mellanox Technologies      Rev 2 0 3 0 0      Note  For more details on hca self test ofed seethefilehca self test readme  under docs        hca self test ofed         Performing Adapter Device Self Test         NumbensosgeASSDeucc    ss 2   PeneDevicertGhecke suasana nm PASS   Kerne      aap as        86
156. ecuting srp_daemon over a port without the  a option will only display the reachable  targets via the port and to which the initiator is not connected  If executing with the  e  option it is better to omit  a         tis recommended to use the  n option  This option adds the initiator_ext to the connecting  string   See Section 4 1 2 5 for more details      e srp daemon has a configuration file that can be set  where the default is  etc   srp daemon conf  Use the  f to supply a different configuration file that configures the tar   gets srp daemon is allowed to connect to  The configuration file can also be used to set  values for additional parameters  e g   max cmd per lun  max sect        A continuous background  daemon  operation  providing an automatic ongoing detection  and connection capability  See Section 4 1 2 4     44 Mellanox Technologies      Rev 2 0 3 0 0      4 1 2 4 Automatic Discovery and Connection to Targets       Make sure that the ib_srp module is loaded  the SRP Initiator can reach an SRP Target   and that an SM is running        To connect to all the existing Targets in the fabric  run           daemon  e  o   This util   ity will scan the fabric once  connect to every Target it detects  and then exit       srp_daemon will follow the configuration it finds in  etc srp_daemon conf  Thus  it    will ignore a target that is disallowed in the configuration file             To connect to all the existing Targets in the fabric and to connect to new targets th
157. ed  and incoming packets will have the VLAN tag removed  Any vlan tagged  packets sent by the VF are silently dropped  The default behavior is VGT     The feature may be controlled on the Hypervisor from userspace via iprout2   netlink   ip link set   dev DEVICE   group DEVGROUP       up   down          v   NUM   mac LLADDR      vlan VLANID   qos VLAN QOS          spoofchk   on   off  1      use   ip link set dev   PF device   vf   NUM   vlan   vlan id    qos  lt qos gt       where NUM   0  max vf num  e vlan 14 0  4095  4095 means  set              qos 0  7  For example        ip link set dev eth2 vf 2 gos 3 sets VST mode for VF  2 belonging to PF eth2   with qos   3       ip link set dev eth2 vf 4095   sets mode for VF 2 back to VGT    94 Mellanox Technologies      Rev 2 0 3 0 0      4 13 7 3 2Additional Ethernet VF Configuration Options     Guest MAC configuration    By default  guest MAC addresses are configured to be all zeroes  In the mlnx_ofed guest driver    if a guest sees a zero MAC  it generates a random MAC address for itself  If the administrator wishes  the guest to always start up with the same MAC  he she should configure guest MACs before the  guest driver comes up     The guest MAC may be configured by using   ip link set dev  lt PF device gt  vf  lt NUM gt  mac  lt LLADDR gt     For legacy guests  which do not generate random MACs  the adminstrator should always configure  their MAC addresses via ip link  as above        Spoof checking  Spoof checking is cu
158. ed docs      HH Ht      HH                        HH         HH    Device  05 00 0     05 00 0 Ethernet controller  Mellanox Technologies MT26448  ConnectX EN   10GigE  PCIe 2 0 5GT s   rev b0          Link Width is not 8x  PCI Link Speed  5Gb s  Device  07 00 0    07 00 0 Ethernet controller  Mellanox Technologies MT27500 Family  ConnectX   3        Link Width  8x  PCI Link Speed  5Gb s    Installation finished successfully   The firmware version on  dev mst mt26448 pci cr0   2 9 1000 is up to date   Note  To force firmware update use    force fw update  flag     The firmware version on  dev mst mt4099 pci cr0   2 30 4450 is up to date   Note  To force firmware update use    force fw update  flag     Mellanox Technologies 33 J      Rev 2 0 3 0 0 Installation      In case your machine has the latest firmware       firmware update will occur        the        installation script will print at the end of installation a message similar to the following     de The firmware version on  dev mst mt26448 pci cr0   2 9 1000 is up    to date    Note  To force firmware update use    force fw update  flag    The firmware version on  dev mst mt4099 pci cr0   2 11 500 is up  to date    Note  To force firmware update use    force fw update  flag     In case your machine has an unsupported network adapter device  no firmware update   W will occur and the error message below will be printed  Please contact your hardware  vendor for help on firmware updates     de Error message              ic
159. ed to the client machine     3  An initrd file        4  To add an Ethernet driver into initrd  you need to copy the Ethernet modules to the diskless  image  Your machine needs to be pre installed with a MLNX EN Linux Driver that is appro   priate for the kernel version the diskless image will run     Adding the Ethernet Driver to the initrd File    Ah     Step 1   Step 2     Step 3     Step 4     Step 5   Step 6     Step 7     The following procedure modifies critical files used in the boot procedure  It must be  executed by users with expertise in the boot process  Improper application of this pro   cedure may prevent the diskless machine from booting     Back up your current initrd file   Make a new working directory and change to it   host1  mkdir  tmp initrd en  host1  cd  tmp initrd en  Normally  the initrd image is zipped  Extract it using the following command   host1  gzip  dc   initrd image     cpio  id  The initrd files should now be found under  tmp initrd_en  Create a directory for the ConnectX EN modules and copy them   hostl  mkdir  p  tmp initrd en lib modules mlnx en  host1  cd  lib modules  uname  r  updates kernel drivers  hostl  cp net mlx4 mlx4 core ko  tmp initrd en lib modules mlnx en  hostl  cp net mlx4 mlx4 en ko  tmp initrd en lib modules mlnx en  To load the modules  you need the insmod executable  If you do not have it in your  initrd  please add it using the following command   hostl  cp  sbin insmod  tmp initrd en sbin   If you plan to give your 
160. ed with major operating system distributions     Mellanox OFED is certified with the following products        Mellanox Messaging Accelerator  VMA     software  Multicast socket acceleration  library that performs OS bypass for standard socket based applications       Mellanox Unified Fabric Manager  UFM    software  Powerful platform for managing  demanding scale out computing fabric environments  built on top of the OpenSM  industry standard routing engine       Fabric Collective Accelerator  FCA    FCA is a Mellanox MPl integrated software  package that utilizes CORE Direct technology for implementing the MPI collectives  communications     1 2  Mellanox OFED Package    1 2 1 ISO Image    Mellanox OFED for Linux  MLNX OFED LINUX  is provided as ISO images or as a tarball   one per supported Linux distribution and CPU architecture  that includes source code and binary  RPMs  firmware  utilities  and documentation  The ISO image contains an installation script   called m1nxofedinstall  that performs the necessary steps to accomplish the following       Discover the currently installed kernel      Uninstall any InfiniBand stacks that are part of the standard operating system distribu   tion or another vendor s commercial stack       Install the MLNX OFED LINUX binary RPMs  if they are available for the current  kernel        Identify the currently installed InfiniBand HCAs and perform the required firmware  updates    1 2 2 Software Components  MLNX OFED LINUX contains the fo
161. educing  CPU overhead  It works by aggregating multiple incoming  packets from a single stream into a larger buffer before  they are passed higher up the networking stack  thus  reducing the number of packets that have to be processed   LRO is available in kernel versions    3 1 for untagged  traffic     Note  LRO will be done whenever possible  Otherwise  GRO will be done  Generic Receive Offload  GRO  is  available throughout all kernels        ethtool  c eth lt x gt  Queries interrupt coalescing settings                 96 Mellanox Technologies      Rev 2 0 3 0 0      Table 6   ethtool Supported Options    4 16       Options    Description       ethtool  C eth lt x gt  adaptive rx  on off    Enables disables adaptive interrupt moderation     By default  the driver uses adaptive interrupt moderation  for the receive path  which adjusts the moderation time to  the traffic pattern        ethtool  C eth lt x gt   pkt rate low N    pkt rate high N   rx usecs low N    rx usecs high N     Sets the values for packet rate limits and for moderation  time high and low values        Above an upper limit of packet rate  adaptive moderation  will set the moderation time to its highest value       Below a lower limit of packet rate  the moderation time will  be set to its lowest value        ethtool  C eth lt x gt   rx usecs N          frames N     Sets the interrupt coalescing settings when the adaptive  moderation is disabled     Note  usec settings correspond to the time to wait after
162. ee of credit loops routing      Two levels of QoS  assuming switches support 8 data VLs      Ability to route around a single failed switch  and or multiple failed links  without     introducing credit loops     changing path SL values       Very short run times  with good scaling properties as fabric size increases    8 5 7 1 Unicast Routing    Torus 2QoS is    DOR based algorithm that avoids deadlocks that would otherwise occur in a  torus using the concept of a dateline for each torus dimension  It encodes into a path SL which  datelines the path crosses as follows     all    e   for  d   0  d  lt  torus dimensions  dt         path crosses dateline d  returns 0 or 1     sl    path crosses dateline d   lt  lt  d     For a 3D torus  that leaves one SL bit free  which torus 2QoS uses to implement two QoS levels   Torus 2QoS also makes use of the output port dependence of switch SL2VL maps to encode into  one VL bit the information encoded in three SL bits  It computes in which torus coordinate direc   tion each inter switch link  points   and writes SL2VL maps for such ports as follows     for  sl   0  sl  lt  16  sl          cdir port  reports which torus coordinate direction a switch port     points  in  and returns 0  1  or 2      sl2vl iport oport sl    0  1  amp   sl  gt  gt  cdir oport       142 Mellanox Technologies      Rev 2 0 3 0 0      Thus  on a pristine 3D torus  i e   in the absence of failed fabric switches  torus 2QoS consumes 8  SL values  SL bits 0 2  and 2 
163. en saved in  home  lt username gt   ssh id_rsa   Your public key has been saved in  home  lt username gt   ssh id_rsa pub   The key fingerprint is   38 10 29 0  4  08 00 4   0   50 0  05 44   7 9  05  lt username gt  host1l    Step 2  Check that the public and private keys have been generated     host1  cd  home  lt username gt   ssh     host1  18   host1  15  la   total 40   Gl geom 2 root root 4096 Mar 5 04 57    drwxr x    13 root root 4096 Mar 4 18 27      i       1 root root 1675 Mar 5 04 57 id rsa    Mellanox Technologies 101      Rev 2 0 3 0 0 HPC Features       rw r  r   1 root root 404 Mar 5 04 57 id rsa pub  Step 3  Check the public key     host1  cat id_rsa pub    ssh rsa   AAAAB3NzaClyc2EAAAABI WAAAQEA1zVY8VBHQh90kZN70A11bUQ74RxXm4 zHeczyVxpYHaDPyDmqezbYMKrCIVz  d10bH ZkCOrpLYviU00UHd3fvNT  Ms0gcGg08PysUf 12FyYjira2Plxyg  mkHLGGqVut fEMmABZ3wNCUg6J2X  3G uiuSWXeubZmbXcMrP   wAIWByfH8ajwo6A5SWioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorX  NCOZBe4kTnUqm63nQ2zi1qVMdL9FrCmalxIOu9 SQUAjwONevaMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrod  ZlfCQ     username  Ghostl    Step 4  Now you need to add the public key to the authorized keys2 file on the target machine     host1  cat id rsa pub   xargs ssh host2 V echo      home   username    ssh   authorized keys2      lt username gt  host2 s pass   word     Enter password    For a local machine  simply add the key to authorized keys2     hostl  cat id rsa pub  gt  gt  authorized keys2    Step 5  Test        hos
164. erion   its goal is to match a certain ULP  or a certain application on top of  this ULP  PR MPR request  and QoS Level has only one constraint   Service Level  SL      The simple policy section may appear in the policy file in combine with the advanced policy  or  as a stand alone policy definition  See more details and list of match rule criteria below     Mellanox Technologies 151      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 6 4    8 6 5    Policy File Syntax Guidelines       Leading and trailing blanks  as well as empty lines  are ignored  so the indentation in  the example is just for better readability        Comments are started with the pound sign     and terminated by EOL      Any keyword should be the first non blank in the line  unless it s a comment     Keywords that denote section subsection start have matching closing keywords       Having a QoS Level named  DEFAULT  is a must   it is applied to PR MPR requests  that didn t match any of the matching rules        Any section subsection of the policy file is optional     Examples of Advanced Policy File    As mentioned earlier  any section of the policy file 1s optional  and the only mandatory part of the  policy file is a default QoS Level     Here s an example of the shortest policy file     gos levels  gos level  name  DEFAULT  lla Qi  end qos level  end qos levels    Port groups section is missing because there are no match rules  which means that port groups are  not referred anywhere  and there 1s no 
165. es a work flow for local HCA  adapter  sniffing       Run ibdump with the desired options      Run the application that you wish its traffic to be analyzed      Stop ibdump  CTRL C  or wait for the data buffer to fill  in   mem mode      Open Wireshark and load the generated file   How to Get Wireshark     Download the current release from www wireshark org for a Linux or Windows environment   See the ibdump release notes txt file for more details       Although ibdump is a Linux application  the generated  pcap file may be analyzed on    either operating system         Synopsis  ibdump  options     Output Files     d    ib dev  lt dev gt  use RDMA device  lt dev gt   default first device found   The relevant devices can be listed by running the   ibv devinfo  command      1    ib port  lt port gt  use port   port   of IB device  default 1     W    write  lt file gt  dump file name  default  sniffer pcap        stands for stdout   enables piping to tcpdump or  tshark     0    output  lt file gt  alias for the   w  option  Do not use   for backward  compatibility     b    max burst  lt log2 burst   092 of the maximal burst size that can be captured with  1      packets loss   Each entry takes   MTU bytes of memory  default 12    4096 entries      8    silent do not print progress indication     Mellanox Technologies 205      Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities                           1  Run ibdump          206 Mellanox Technologies J      Rev 2 0 3 0 0    Ap
166. est memory speed with fewest DIMMs  and populate all memory channels for every CPU installed     For further information  please refer to your vendor s memory configuration instructions or mem   ory configuration tool available Online     7 1 3 Recommended BIOS Settings        These performance optimizations may result in higher power consumption        7 1 3 1 General    Set BIOS power management to Maximum Performance     110 Mellanox Technologies      Rev 2 0 3 0 0      7 1 3 2 Intel  Sandy Bridge Processors    The following table displays the recommended BIOS settings in machines with Intel code name  Sandy Bridge based processors     Table 10   Recommended BIOS Settings for Intel Sandy Bridge Processors                                              BIOS Option Values  General Operating Mode  Power pro    Maximum Performance  file  Processor C States Disabled  Turbo mode Enabled  Hyper Threading  HPC  disabled  Data Centers  enabled  CPU frequency select Max performance  Memory Memory speed Max performance  Memory channel mode Independent  Node Interleaving Disabled   NUMA  Channel Interleaving Enabled  Thermal Mode Performance          1  Hyper Threading can increase message rate for multi process applications by having more logical cores  It  might increase the latency of a single process  due to lower frequency of a single logical core when hyper   threading is enabled     Mellanox Technologies 111      Rev 2 0 3 0 0 Performance      7 1 3 3 Intel   Nehalem Westmere Pro
167. et detected  in human readable form     Sample output     IO Unit Info     POE 0103  port GID    e800000000000000002c90200402bd5  change ID  0002    max controllers  0x10  controller  1    GUID  0002c90200402bd4   vendor ID  0002c9   device ID  005a44   IO class   0100   IND  LSI Storage Systems SRP Driver 200400a0b81146a1  service entries  1   service  0   200400  0081146  1   SRP T10 200400A0B81146A1    b  To detect all the SRP Targets reachable by the SRP Initiator via another umad device  use the following  command     ibsrpdm  d  lt umad device gt     2  Assistance in creating an SRP connection    a  To generate output suitable for utilization in the    echo    command of Section 4 1 2 2  add the         option to ibsrpdm     ibsrpdm  c   Sample output   id ext 200400A0B81146A1 ioc guid 0002c90200402bd4   dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1    b  To establish a connection with an SRP Target using the output from the    ibsrpdm  c  example above   execute the following command     echo  n id ext 200400A0B81146A1 i1oc guid 0002c90200402bd4   dgid fe800000000000000002c90200402bd5  pkey ffff  service id 200400a0b81146al  gt   sys   class infiniband srp srp mthca0 1 add target    The SRP connection should now be up  the newly created SCSI devices should appear in the listing  obtained from the    disk  1  command     Mellanox Technologies 43 J      Rev 2 0 3 0 0 Driver Features      srp_daemon    The srp daemon utility is based on ibsrp
168. et irq affinity cpulist sh 0 1 eth2    set irq affinity cpulist sh 2 3 eth3    set irq affinity cpulist sh 4 5 eth4    set irq affinity cpulist sh 6 7 eth5    Mellanox Technologies 119      Rev 2 0 3 0 0 Performance      7 2 8 Tuning Multi Threaded      Forwarding   gt  To optimize NIC usage as IP forwarding   1  Set the following options in  etc modprobe d mlx4 conf      For MLNX OFED 2 0 x     options mlx4 en inline thold 0  options mlx4 core high rate steer 1      ForMLNX EN 1 5 10     options mlx4 en num lro 0 inline thold 0  options mlx4 core high rate steer 1    2  Apply interrupt affinity tuning   3  Forwarding on the same interface       set irq affinity bynode sh   numa node     interface      4  Forwarding from one interface to another       set irq affinity bynode sh   numa node     interfacel    lt interface2 gt     5  Disable adaptive interrupt moderation and set status values  using             ethtool  C adaptive rx off    120 Mellanox Technologies      Rev 2 0 3 0 0    8   OpenSM   Subnet Manager    8 1 Overview    OpenSM is an InfiniBand compliant Subnet Manager  SM   It is provided as a fixed flow execut   able called opensm  accompanied by a testing application called osmtest  OpenSM implements  an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters   Management Model  13   Subnet Management  14   and Subnet Administration  15      8 2              Description    opensm is an InfiniBand compliant Subnet Manager and Subnet
169. evice to boot from an iSCSI target     host hosti   filename          For a ConnectX device with ports configured as InfiniBand  comment out  the following  line     option dhcp client identifier         00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39       For a ConnectX device with ports configured as Ethernet  comment out  the following    line    hardware ethernet 00 02 c9 00 00 bb     A 11 WinPE    Mellanox FlexBoot enables WinPE boot via TFTP  For instructions on preparing a WinPE  image  please see http   etherboot org wiki winpe     218 Mellanox Technologies      Rev 2 0 3 0 0    Appendix B  SRP Target Driver    The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks   http   www openfabrics org  or InfiniBand drivers in Linux kernel tree  kernel org   It also inter   faces with Generic SCSI target mid level driver   SCST  http   scst sourceforge net      By interfacing with an SCST driver  it is possible to work with and support a lot of IO modes on  real or virtual devices in the back end     1  scst vdisk     fileio and blockio modes  This allows turning software raid volumes  LVM vol   umes  IDE disks  block devices and normal files into SRP luns    2  NULLIO mode allows measuring the performance without sending IOs to real devices    B 1 Prerequisites and Installation    1  SRP targer is part of the OpenFabrics OFED software stacks  Use the latest OFED distribu   tion package to install SRP target         On distribu
170. fabric 1s routed with torus 2QoS     Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover  provided that all OpenSM instances have the same idea of dateline location  See torus   2QoS conf 5  for details  Torus 2QoS will detect configurations of failed switches and links that  prevent routing that is free of credit loops  and will log warnings and refuse to route  If   no fallback  was configured in the list of OpenSM routing engines  then no other routing  engine will attempt to route the fabric  In that case all paths that do not transit the failed compo   nents will continue to work  and the subset of paths that are still operational will continue to  remain free of credit loops  OpenSM will continue to attempt to route the fabric after every  sweep interval  and after any change  such as a link up  in the fabric topology  When the fabric  components are repaired  full functionality will be restored  In the event OpenSM was config   ured to allow some other engine to route the fabric if torus 2QoS fails  then credit loops and mes   sage deadlock are likely if torus 2QoS had previously routed the fabric successfully  Even if the  other engine is capable of routing a torus without credit loops  applications that built connections  with path SL values granted under torus 2QoS will likely experience message deadlock under  routing generated by a different engine  unless they repath  To verify that a torus fabric is routed  free of credit 
171. ffic cannot be steered  It is treated as  other  protocol by hardware     from the first packet  and not considered as UDP traffic           Mellanox Technologies 81 J      Rev 2 0 3 0 0 Driver Features      We recommend using 1ibibverbs   2 0 3 0 0 and libm1x4 v2 0 3 0 0 and higher as  of MLNX_OFED v2 0 3 0 0 due to API changes        4 13 Single Root IO Virtualization  SR IOV     Single Root IO Virtualization  SR IOV  is a technology that allows a physical PCIe device to  present itself multiple times through the PCIe bus  This technology enables multiple virtual  instances of the device with separate resources  Mellanox adapters are capable of exposing in  ConnectX   3 adapter cards 63 virtual instances called Virtual Functions  VFs   These virtual  functions can then be provisioned separately  Each VF can be seen as an addition device con   nected to the Physical Function  It shares the same resources with the Physical Function  and its  number of ports equals those of the Physical Function     SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual  machines direct hardware access to network resources hence increasing its performance     In this chapter we will demonstrate setup and configuration of SR IOV in a Red Hat Linux envi   ronment using Mellanox ConnectX   VPI adapter cards family     4 13 1 System Requirements  To set up an SR IOV environment  the following is required     MLNX OFED Driver    Aserver blade with an SR IOV capa
172. file is prof sel     The supported values for profiles are      0          number of resources      medium number of resources       2 large number of resources  default     Mellanox Technologies 225      Rev 2 0 3 0 0      Appendix E  Lustre Compilation over MLNX_OFED     gt  To compile Lustre version 2 3 65 and higher         configure   with o2ib  usr src ofa kernel default     make rpms       To compile older Lustre versions       EXTRA LNET INCLUDE   I usr src ofa_kernel default include   include  usr src   ofa kernel default include linux compat 2 6 h    configure   with o2ib  usr src   ofa kernel default      EXTRA LNET INCLUDE   I usr src ofa kernel default include   include  usr src   ofa kernel default include linux compat 2 6 h  make rpms    For Lustre 2 1 3  due to a duplicate definition of INVALID UID macro  the following patch must be  applied        lustre 2 1 3 lustre include lustre cfg h 2012 09 17 14 26 46 000000000  0200       lustre 2 1 3 lustre include lustre cfg h new 2013 09 07 10 45 07 121772824  0200       288 7   288 9       include   lustre lustre user h       ifndef INVALID UID    define INVALID UID   1      endif    226 Mellanox Technologies    
173. fo     2  Create and burn the composite image  Run   flint  dev  lt mst device name gt  brom  lt expansion ROM image gt   Example on Linux   flint  dev  dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom    Example on Windows        flint  dev mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom    Removing the Expansion ROM Image  Remove the expansion ROM image  Run     flint  dev   mst device name   drom        When removing the expansion ROM image  you also remove Flexboot from the boot  device list     A 3 Preparing the DHCP Server in Linux Environment    The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot  clients and instructing the clients where to boot from  FlexBoot requires that the DHCP server  run on a machine which supports IP over IB     A 3 1 Installing the DHCP Server    Install DHCP client server in embedded within the Linux Distribution        1  Depending on the OS  the device name may be superceded with a prefix     206 Mellanox Technologies      Rev 2 0 3 0 0      A 3 2 Configuring the DHCP Server    A 3 2 1 For ConnectX Family Devices    When a FlexBoot client boots  it sends the DHCP server various information  including its  DHCP client identifier  This identifier is used to distinguish between the various DHCP sessions   The value of the client identifier is composed of a prefix     ff 00 00 00 00 00 02 00 00 02 c9 00      and an 8 byte port GUID  all separated by colons and represented in hexadecima
174. for SHMEM programs running over  InfiniBand     The latest ScalableSHMEM software can be downloaded from the Mellanox website     98 Mellanox Technologies      Rev 2 0 3 0 0      5 1 2 Running SHMEM with FCA    The Mellanox Fabric Collective Accelerator  FCA  is a unique solution for offloading collective  operations from the Message Passing Interface  MPI  or ScalableSHMEM process onto Mella   nox InfiniBand managed switch CPUs  As a system wide solution  FCA utilizes intelligence on  Mellanox InfiniBand switches  Unified Fabric Manager and MPI nodes without requiring addi   tional hardware  The FCA manager creates a topology based collective tree  and orchestrates an  efficient collective operation using the switch based CPUs on the MPI ScalableSHMEM nodes     FCA accelerates MPI ScalableSHMEM collective operation performance by up to 100 times  providing a reduction in the overall job runtime  Implementation is simple and transparent during  the job runtime       FCA is disabled by default and must be configured prior to using it from the Scal     ableSHMEM         gt  To enable FCA by default in the ScalableSHMEM   1  Edit the  opt mellanox openshmem 2 2 etc openmpi mca params conf file   2  Set the       11 fca enable parameter to 1   Scoll fca enable 1  3  Setthe scoll fca np parameter to 0   Scoll fca np 0   gt  To enable FCA in the shmemrun command line  add the following      mca scoll fca enable 1   mca scoll fca enable np 0     gt  To disable FCA    mca scoll fca e
175. for all users     2  The mpi selector command     This command is a CLI equivalent of the mpi selector menu  allowing for the same functionality as  mpi selector menu but without the interactive menus and prompts  It is suitable for scripting     102 Mellanox Technologies      Rev 2 0 3 0 0      5 2 4 Compiling MPI Applications    Compiling MVAPICH Applications  Please refer to http   mvapich cse ohio state edu support mvapich user guide html   To review the default configuration of the installation  check the default configuration file      usr mpi  lt compiler gt  mvapich  lt mvapich ver gt  etc mvapich conf    Compiling Open MPI Applications  Please refer to http   www open mpi org faq  category mpi apps     5 3  MellanoX Messaging    MellanoX Messaging  MXM  provides enhancements to parallel communication libraries by  fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hard   ware  This includes a variety of enhancements that take advantage of Mellanox networking hard   ware including        Multiple transport support including RC and UD     Proper management of HCA resources and memory structures    Efficient memory registration     One sided communication semantics     Connection management     Receive side tag matching      Intra node shared memory communication    These enhancements significantly increase the scalability and performance of message commu   nications in the network  alleviating bottlenecks within the parallel commun
176. from 11 4 3 175 to 11 4 3 176   The following example shows how to enter the ping command     host1  ping  c 5 11 4 3 176   PING 11 4 3 176  11 4 3 176  56 84  bytes of data    64 bytes from 11 4 3 176  icmp seq 0 ttl 64 time 0 079 ms  64 bytes from 11 4 3 176  icmp seg 1 ttl 64 time 0 044 ms  64 bytes from 11 4 3 176  icmp seq 2 ttl 64 time 0 055 ms  64 bytes from 11 4 3 176  icmp seq 3 ttl 64 time 0 049 ms  64 bytes from 11 4 3 176  icmp seg 4 ttl 64 time 0 065 ms      13  0 9 304  jus  Stable        54 Mellanox Technologies      Rev 2 0 3 0 0      5 packets transmitted  5 received  0  packet loss  time 3999ms rtt min avg max mdev    0 044 0 058 0 079 0 014 ms  pipe 2    4 3 6 Bonding IPoIB    To create an interface configuration script for the ibX and bondX interfaces  you should use the  standard syntax  depending on your OS      Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet  interfaces  via the Linux Bonding Driver     e Network Script files for IPoIB slaves are named after the IPoIB interfaces  e g  ifcfg   100       The only meaningful bonding policy in IPoIB is High Availability  bonding mode num   ber 1  or active backup       Bonding parameter  fail over mac  is meaningless in IPoIB interfaces  hence  the only  supported value is the default  0  or  none  in SLES11    For a persistent bonding IPoIB Network configuration  use the same Linux Network Scripts   semantics  with the following exceptions  additions        Jn th
177. fy an individual configuration for each HCA   This parameter should be specified as an options line in the file  etc modprobe d   mlx4 core conf   For example  to configure all HCAs to have Port  as ETH and Port2 as IB  insert the following  line    options mlx4 core port type array 1 2  To set HCAs individually  you may use a string of Domain bus device function x y  For example  if you have a pair of HCAs  whose PFs are 0000 04 00 0 and 0000 05 00 0  you  may specify that the first will have both ports as IB  and the second will have both ports as ETH  as follows     options mlx4 core port type array  0000 04 00 0 1 1 0000 05 00 0 2 2      Only the PFs are set via this mechanism  The VFs inherit their port types from their asso     ciated PF   adi    4 13 7 2 Virtual Function InfiniBand Ports    Each VF presents itself as an independent vHCA to the host  while a single HCA 1s observable  by the network which is unaware of the vHCAs  No changes are required by the InfiniBand sub   system  ULPs  and applications to support SR IOV  and vHCAs are interoperable with any exist   ing  non virtualized  IB deployments     Sharing the same physical port s  among multiple VHCAs is achieved as follows      Each vHCA port presents its own virtual GID table  The virtual GID table for the InfiniBand ports consists of a single entry  at index 0  that maps to a  unique index in the physical GID table  The vHCA of the PF maps to physical GID index 0  To obtain  GIDs for other vHCAs  alias GU
178. g     mft  Preparing     srptools  Preparing     rds tools  Preparing     rds devel  Preparing     ibutils2  Preparing     ibutils  Preparing     cc mgr  Preparing     dump pr  Preparing     ar mgr  Preparing     ibdump  Preparing     infiniband diags  Preparing     infiniband diags compat  Preparing      qperf  Preparing      fca   INFO  updating           IMPORTANT NOTE        Mellanox Technologies                                                                                                                                                                                                                                                                                                                    Installation      Rev 2 0 3 0 0        The FCA Manager and FCA MPI Runtime library are installed in  opt mellanox fca  directory      The FCA Manager will not be started automatically      To start FCA Manager now  type     etc init d fca managerd start                                                                                                                                                                                                                                                                                                                     There should be single process of FCA Manager running per fabric     To start FCA Manager automatically after boot  type     etc init d fca managerd install service    Check  opt mellanox fca share doc fca README txt for quick 
179. get    timeouts on the AR related queries to these switches        162 Mellanox Technologies      Rev 2 0 3 0 0      8 8 2 Installing the Adaptive Routing    Adaptive Routing Manager is a Subnet Manager plug in  i e  it is a shared library  libarmgr so   that is dynamically loaded by the Subnet Manager  Adaptive Routing Manager is installed as a  part of Mellanox OFED installation     8 83 Running Subnet Manager with Adaptive Routing Manager  Adaptive Routing  AR  Manager can be enabled disabled through SM options file     8 8 3 4 Enabling Adaptive Routing  To enable Adaptive Routing  perform the following   1  Create the Subnet Manager options file  Run     opensm  c  lt options file name gt     2  Add  armgr  to the  event plugin name option in the file       Event plugin name s   event plugin name armgr    3  Run Subnet Manager with the new options file     opensm  F  lt options file name gt     Adaptive Routig Manager can read options file with various configuration parameters to fine   tune AR mechanism and AR Manager behavior  Default location of the AR Manager options file    Is  etc opensm ar mgr conf   To provide an alternative location  please perform the following     1  Add  armgr   conf file  lt ar mgr options file name gt   to the    event plugin options     option in the file   options string that would be passed to the plugin s   event plugin options armgr   conf file  lt ar mgr options file name gt     2  Run Subnet Manager with the new options file     opens
180. gies    The InfiniBand Architecture Specification that    Part 3  Carrier Sense Multiple Access with Colli   sion Detection  CSMA CD  Access Method and    Parameters  Physical Layers  and Management         Rev 2 0 3 0 0      Table 4   Reference Documents          Document Name Description  Firmware Release Notes for Mellanox See the Release Notes PDF file relevant to your  adapter devices adapter device under    docs  folder of installed package        MFT User   s Manual Mellanox Firmware Tools User   s Manual  See  under docs  folder of installed package        MFT Release Notes Release Notes for the Mellanox Firmware Tools   See under docs  folder of installed package                 Mellanox Technologies 15 J      Rev 2 0 3 0 0      Support and Updates Webpage    Please visit http   www mellanox com  gt  Products  gt  InfiniBand VPI Drivers  gt  Linux SW Drivers  for downloads  FAQ  troubleshooting  future updates to this manual  etc        16 Mellanox Technologies      Rev 2 0 3 0 0      1 Mellanox OFED Overview    1 1 Introduction to Mellanox OFED    Mellanox OFED is a single Virtual Protocol Internconnect  VPI  software stack which operates  across all Mellanox network adapter solutions supporting 10  20  40 and 56 Gb s InfiniBand  IB    10  40 and 56 Gb s Ethernet  and 2 5 or 5 0 GT s PCI Express 2 0 and 8 GT s PCI Express 3 0  uplinks to servers     All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols  and software  and are support
181. gt vdisk vdisk    echo  add vdisk0 0   gt  proc scsi_tgt groups Default devices    echo  add vdisk1 1   gt  proc scsi_tgt groups Default devices    echo  add vdisk2 2   gt  proc scsi_tgt groups Default devices    Example 2  working with scst_vdisk FILEIO mode   Using md0 device and file 10G file     a  modprobe scst    b        d     e           modprobe scst_vdisk    echo  open vdisk0  dev md0   gt   proc scsi_tgt vdisk vdisk    echo  open vdisk1  10G file   gt   proc scsi_tgt vdisk vdisk    echo  add vdisk0 0   gt  proc scsi_tgt groups Default devices    echo  add vdisk1 1   gt  proc scsi_tgt groups Default devices  2  Run     For all distributions except SLES 11   gt  modprobe ib srpt  For SLES 11   gt  modprobe  f ib srpt    For SLES 11  please ignore the following error messages in  var log messages when loading  ib srpt to SLES 11 distribution s kernel     ib srpt   NS  ib_srpt   ib_srpt   ib_srpt   ib_srpt     no symbol version for scst_unregister   Unknown symbol scst_unregister   no symbol version for scst_register   Unknown symbol scst_register   no symbol version for scst unregister target template  Unknown symbol scst unregister target template    B  On Initiator Machines    On Initiator machines  manually perform the following steps     220 Mellanox Technologies      Rev 2 0 3 0 0      1  Run   modprobe ib srp  2  Run     ibsrpdm      d  dev infiniband umadX   to discover a new SRP target   umad0  port 1 of the first HCA    umadl  port 2 of the first HCA  umad2  
182. guration for VMA use  to be used with  any installation parameter       guest Install packages required by guest os     hypervisor Install packages required by hypervisor os    v  vv  vvv Set verbosity level     umad dev rw Grant non root users read write permission for  umad devices instead of default     enable affinity Run mlnx_affinity script upon boot     disable affinity Disable mlnx_affinity script  Default      enable sriov Burn SR IOV enabled firmware     add kernel support Add kernel support  Run  mlnx add kernel support  sh      skip distro check Do not check MLNX OFED vs Distro matching     total vfs   0 63   Maximum number of Virtual Functions in SR IOV  mode  Default  16   Implies    enable sriov      hugepages overcommit Setting 80  of MAX MEMORY as overcommit for huge  page allocation Per priority bit mask  uint    Default 0     q Set quiet   no messages will be printed    2 3 2 1 mlnxofedinstall Return Codes    Table 2 lists the m1nxofedinsta11 script return codes and their meanings     Table 2   mInxofedinstall Return Codes                               Return Code Meaning   0 The Installation ended successfully   1 The installation failed   2 No firmware was found for the adapter device   22 Invalid parameter   28 Not enough free space   171 Not applicable to this system configuration  This can occur  when the required hardware is not present on the system    172 Prerequisites are not met  For example  missing the required  software installed or the hardwar
183. h  device  It includes query functions to the burnt firmware image and to the binary image file  The tool  accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific         over the InfiniBand fabric  In Band tool         Debug utilities  A set of debug utilities  e g   itrace  mstdump  isw  and i2c   For additional details  please refer to the MFT User s Manual docs      1 4 Quality of Service    Quality of Service  QoS  requirements stem from the realization of I O consolidation over an IB  and Eth network  As multiple applications and ULPs share the same fabric  a means is needed to  control their use of network resources     QoS over Mellanox OFED for Linux is discussed in Chapter 8   OpenSM   Subnet Manager      1 5  RDMA over Converged Ethernet  RoCE     RoCE allows InfiniBand  IB  transport applications to work over Ethernet network  RoCE  encapsulates the InfiniBand transport and the GRH headers in Ethernet packets bearing a dedi   cated ether type  0x8195   Thus  any VERB application that works in an InfiniBand fabric can  work in an Ethernet fabric as well     RoCE is enabled only for drivers that support VPI  currently  only mlx4   When working with  RDMA applications over Ethernet link layer the following points should be noted       The presence of a Subnet Manager  SM  is not required in the fabric  Thus  operations  that require communication with the SM are managed in a different way in RoCE  This  does not affect the API 
184. h_ipoib interface     cat  sys class net ethX eth vifs  For example       cat  sys class net eth5 eth vifs    SLAVE ib0 1 MAC 9a c2 1   d7 3b 63 VLAN N A  SLAVE ib0 2 MAC 52 54 00 60 55 88 VLAN N A  SLAVE ib0 3 MAC 52 54 00 60 55 89 VLAN N A    Each ethX interface has at lease one ibX Y slave to serve the PIF itself  In the VIFs list of ethX  you will notice that ibX 1 is always created to serve applications running from the Hypervisor on  top of the ethX interface directly     For InfiniBand applications that require native IPoIB interfaces  e g  CMA   the original IPoIB  interfaces ibX can still be used  For example  CMA and ethX drivers can co exist and make use  of IPoIB ports  CMA can use ib0  while eth0 ipoib interface will use ibX Y interfaces      gt  To see the list of eIPoIB interfaces     cat  sys class net eth ipoib interfaces  For example       cat  sys class net eth ipoib interfaces  eth4 over IB port  ib0  eth5 over IB port  ibl    The example above shows  two elIPoIB interfaces  where eth4 runs traffic over ib0  and eth5 runs  traffic over ibl     74 Mellanox Technologies      Rev 2 0 3 0 0      Figure 3  An Example of a Virtual Network                Host  ib0 2 ib0 3       bot KVM GUEST1     or one                     LAN            t     vito 2     via port  1       etho fe ip  ge             tapo        1                 1                         Ss KVM GUEST2   v a     bro   vifo 3                  The example above shows a few IPoIB instances that server
185. here  x  is  the root switch and each     is a non root switch     4               I I I I I I  3               I I I I I I  2                               I I I I I I  1           _   I I I I I I  y 0           _  x 0 x 2 3 4 5    For multicast traffic routed from root to tip  every turn in the above spanning tree is a legal DOR  turn  For traffic routed from tip to root  and some traffic routed through the root  turns are not  legal DOR turns  However  to construct a credit loop  the union of multicast routing on this span   ning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop  In addi   tion  if none of the above spanning tree branches crosses a dateline used for unicast credit loop  avoidance on a torus  and if multicast traffic is confined to SL 0 or SL 8  recall that torus 2QoS  uses SL bit 3 to differentiate QoS level   then multicast traffic also cannot contribute to the  ring   credit loops that are otherwise possible in a torus  Torus 2QoS uses these ideas to create a master  spanning tree  Every multicast group spanning tree will be constructed as a subset of the master  tree  with the same root as the master tree  Such multicast group spanning trees will in general  not be optimal for groups which are a subset of the full fabric  However  this compromise must  be made to enable support for two QoS levels on a torus while preventing credit loops  In the  presence of link or switch failures that result in a fabric for which torus
186. hmalloc use hugepages 5  If using compound pages is not possible  then the user will fall back to regular hugepages mechanism    gt  To force use of compound pages allocator  Run the following command        opt mellanox openshmem 2 1 bin shmemrun  mca shmalloc use hugepages 5  x  MR FORCE CONTIG PAGES 1    For further information on the Contiguous Pages  please refer to Section 4 9     Contiguous Pages     on  page 74     5 1 5 Running ScalableSHMEM Application    The ScalableSHMEM framework contains the shmemrun utility which launches the executable  from a service node to compute nodes  This utility accepts the same command line parameters as  mpirun from the OpenMPI package     For further information  please refer to OpenMPI MCA parameters documentation at   http   www open mpi org faq  category running     Run  shmemrun   help  to obtain ScalableSHMEM job launcher runtime parameters     ScalableSHMEM contains support for environment module system  http   mod   ules sf net    The modules configuration file can be found at   p  opt mellanox openshmem 2 2 etc shmem modulefile    5 2 Message Passing Interface    5 2 1 Overview    Mellanox OFED for Linux includes the following Message Passing Interface  MPI  implementa   tions over InfiniBand       Open MPI 1 4 6  amp  1 6 1     an open source MPI 2 implementation by the Open MPI  Project    e OSU MVAPICH2 1 7     an MPI 1 implementation by Ohio State University    100 Mellanox Technologies      Rev 2 0 3 0 0      These MPI 
187. i EE 181  9 7 byv devini xo m u eed ste tene f uo a 181  9 8  abdev2nieldeV su             ER dE Ur 182  9 0  Abstatusz cc mo             safe SR yau ws apaspa EU RTI E TUBES 183  9 10  abportstate                  COEPI SR US 185             tute al aie 188  DAZ SMP GUC sz ont pa          pote       NYA deis 191  9  T3   pertqueLy    eee uyta uy                   wah tede Pape ng 194  9 14 ibcheckerns  eee seu ua EXER ER NEN UR 197         uo iod dere ed exe tee ee      re    ARES eR 199  9 16 IBV casynewalche                                                            Qa CEA 202  9 T7  abdump  ts y        k tase vied rath    ees ado NES es 203  Appendix A Mellanox FlexBoot                                       205  AGT  OVGIVICW  coe rime      205  A 2 Burning the Expansion ROM Image                                205        Preparing the DHCP Server in Linux Environment                    206  A4 Subnet Manager   OpenSM                                      208              SEV ED nette eei ed eere kostet tre Deis Oed 208  A 6 BIOS Configuration                                            208  AJ Operation zuo u uui ewe RD eee eins cg  pa ents SEEN S AV 209  A 8 Command Line Interface  CLI                                     210  A9  Diskless Machines                                         hn 212  AAO  3SCSEBOOL  ht  eet ere ttt Rhye aah    tiet hd Sh hasa be 217  AT O WARE    u                          oe E asa St 218  Appendix    SRP Target Driver                          
188. ication libraries     The latest MXM software can be downloaded from the Mellanox website     5 3 1 Compiling OpenMPI with MXM  Step 1  Install MXM from RPM     rpm  ihv mxm x y z 1 x86 64 rpm  MXM will be installed automatically in the  opt  mellanox mxm folder          2  Enter OpenMPI source directory and run   5 cd  OMPI HOME      configure   with mxm  opt mellanox mxm  lt     other configure parameters         make all  amp  amp  make install    oo    MLNX OFED v2 0 or later comes with a pre installed version of MXM v1 1 and OpenMPI  compiled with MXM v1 1     Mellanox Technologies 103      Rev 2 0 3 0 0 HPC Features       gt       upgrade MLNX_OFED v2 0 or later with    newer MXM   Step 1  Remove MXM v1 1    rpm  e mxm  Step 2  Remove the pre compiled OpenMPI    rpm  e mlnx openmpi_gcc    Step 3  Install the new MXM and compile the OpenMPI with it       To run OpenMPI without MXM  run     mpirun  mca mtl  mxm  lt     gt              When upgrading to MXM v1 5  OMPI compiled with MXM v1 1 should be recompiled  with MXM v1 5     5 3 2 Enabling MXM    OpenMPI    MXM Rev 2 0 3 0 0 is automatically selected by OpenMPI when the Number of Processes  NP   is higher or equal to 128  To enable MXM for any NP use the following OpenMPI parameter       mca mtl mxm np  lt number gt          To activate MXM for any NP  run        mpirun  mca mtl mxm np 0      other mpirun parameters     gt     5 3 3 Tuning MXM Settings    The default MXM settings are already optimized  To check the av
189. ids  0x0 0x8  of switch Lid 2 guid 0x0002c902fffff00a  MT47396 Infiniscale III    Mellanox Technologies      Lid Out Destination    Port Info    0x0002 000    Switch portguid 0x0002c902fffff00a   MT47396 Infi    Technologies     0x0003 021    Switch portguid 0x000b8cffff004016   MT47396 Infi    Technologies     0x0006 007    Channel Adapter portguid 0x0002c90300001039   swl  0x0007 021    Channel Adapter portguid 0x0002c9020025874a   swl  0x0008 008    Channel Adapter portguid 0x0002c902002582cd   swl             5 valid lids dumped       2  Dump all Lids with valid out ports of the switch with Lid 2        ibroute 2    niscale III Mellanox    niscale III Mellanox    37 HCA 1   Si SHEAR  36 HCA 1        Unicast lids  0x0 0x8  of switch Lid 2 guid 0x0002c902fffff00a  MT47396 Infiniscale III    Mellanox Technologies      Lid Out Destination    Port Info    0x0002 000    Switch portguid 0x0002c902fffff00a   MT47396 Infi    Technologies     0x0003 021    Switch portguid 0x000b8cffff004016   MT47396 Infi    Technologies     0x0006 007    Channel Adapter portguid 0x0002c90300001039   swl  0x0007 021    Channel Adapter portguid 0x0002c9020025874a   swl  0x0008 008    Channel Adapter portguid 0x0002c902002582cd   swl             5 valid lids dumped       niscale III Mellanox    niscale III Mellanox    Sr          57            36 HCA 1        3  Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2      gt  Honours 2 9 7    190 Mellanox Technologies      Rev
190. imilar to the advanced policy definition  matching of PR MPR queries is done in order of  appearance in the QoS policy file such as the first match takes precedence  except for the   default  rule  which is applied only if the query didn t match any other rule  All other sections of  the QoS policy file take precedence over the qos ulps section  That is  if a policy file has both  qos match rules and qos ulps sections  then any query is matched first against the rules in the  qos match rules section  and only if there was no match  the query 1s matched against the rules in  qos ulps section   Note that some of these match rules may overlap  so in order to use the simple QoS definition  effectively  it is important to understand how each of the ULPs is matched     8 6 6 1 IPoIB    IPoIB query is matched by PKey or by destination GID  in which case this is the GID of the mul   ticast group that OpenSM creates for each IPoIB partition     Default PKey for IPoIB partition is 0x7fff  so the following three match rules are equivalent     ipoib    lt SL gt   ipoib  pkey Ox7fff     SL    any  pkey Ox7fff     SL      8 6 6 2 SDP    SDP PR query is matched by Service ID  The Service ID for SDP is 0x000000000001PPPP   where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to  The follow   ing two match rules are equivalent     sdp ESSI  any  service id 0x0000000000010000 0x000000000001ffff     SL      8 6 6 3 RDS    Similar to SDP  RDS PR query is matched by Servic
191. implementations  along with MPI benchmark tests such as OSU BW LAT  Intel MPI  Benchmark  and Presta  are installed on your machine as part of the Mellanox OFED for Linux  installation  Table 7 lists some useful MPI links     Table 7   Useful MPI Links                   MPI Standard http   www unix mcs anl gov mpi  Open MPI http   www open mpi org  MVAPICH 2 MPI http   mvapich cse ohio state edu   MPI Forum http   www mpi forum org                This chapter includes the following sections       Section 5 2 2    Prerequisites for Running MPI     on page 99      Section 5 2 3  MPI Selector   Which MPI Runs     on page 100     Section 5 2 4    Compiling MPI Applications     on page 100    5 2 2 Prerequisites for Running MPI    For launching multiple MPI processes on multiple remote machines  the MPI standard provides a  launcher program that requires automatic login  i e   password less  onto the remote machines   SSH  Secure Shell  is both a computer program and a network protocol that can be used for log   ging and running commands on remote computers and or servers     5 2 2 1 SSH Configuration   The following steps describe how to configure password less access over SSH           1  Generate an ssh key on the initiator machine  host1    host1  ssh keygen  t rsa  Generating public private rsa key pair   Enter file in which to save the key   home   username    ssh id rsa    Enter passphrase  empty for no passphrase     Enter same passphrase again   Your identification has be
192. in a dimension that is not the last dimension routed by DOR  here the  failed switches are O and T        5     p  I I I I I I  4                                                         I I I I I I                              0         I I I I I I    SS                    I I I I I I  1      m    S    n    O    T    p    I I I I I I  y 0    S_         FM                                          9  I I I I I I  x 0 s d 2 3 4 5    In a pristine fabric  torus 2QoS would generate the path from S to D as S n O T r D  With failed  switches O and T  torus 2QoS will generate the path S n I q r D  with illegal turn at switch I  and  with hop I q using a VL with bit 1 set  In contrast to the earlier examples  the second hop after  the illegal turn  q r  can be used to construct a credit loop encircling the failed switches     8 5 7 2 Multicast Routing    Since torus 2QoS uses all four available SL bits  and the three data VL bits that are typically  available in current switches  there is no way to use SL VL values to separate multicast traffic  from unicast traffic  Thus  torus 2QoS must generate multicast routing such that credit loops can     144 Mellanox Technologies      Rev 2 0 3 0 0      not arise from a combination of multicast and unicast path segments  It turns out that it is possi   ble to construct spanning trees for multicast routing that have that property  For the 2D 6x5 torus  Wee    example above  here is the full fabric spanning tree that torus 2QoS will construct  w
193. interfaces    You can create subinterfaces for a primary IPoIB interface to provide traffic isolation  Each such  subinterface  also called a child interface  has a different IP and network addresses from the pri   mary  parent  interface  The default Partition Key  PKey   ff ff  applies to the primary  parent   interface     This section describes how to     Create a subinterface  Section 4 3 4 1       Remove a subinterface  Section 4 3 4 2     4 3 4 1 Creating a Subinterface      In the following procedure  ib0 is used as an example of      IB subinterface        To create a child interface  subinterface   follow this procedure     Step 1  Decide on the PKey to be used in the subnet  valid values can be 0 or any 16 bit unsigned  value   The actual PKey used is a 16 bit number with the most significant bit set  For example   a value of 1 will give a PKey with the value 0x8001     Step 2  Create a child interface by running   host1  echo  lt PKey gt   gt   sys class net  lt IB subinterface gt  create child    Example     hostl  echo 1  gt   sys class net ib0 create child    This will create the interface ib0 8001     Mellanox Technologies 53 J      Rev 2 0 3 0 0 Driver Features      4 3 4 2    4 3 5    Step 3  Verify the configuration of this interface by running     host1  ifconfig   subinterface     subinterface          gt     Using the example of Step 2   host1  ifconfig ib0 8001  ib0 8001 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00  BROADCAST
194. ion  By default  the common switch in a torus seed is  taken as the origin of the coordinate system used to describe switch location  The position param     148 Mellanox Technologies      Rev 2 0 3 0 0      eter for a dateline keyword moves the origin  and hence the dateline  the specified amount rela   tive to the common switch in a torus seed     next_seed    If any of the switches used to specify a seed were to fail torus 2QoS would be unable to complete  topology discovery successfully  The next seed keyword specifies that the following link and  dateline keywords apply to a new seed specification     For maximum resiliency  no seed specification should share a switch with any other seed specifi   cation  Multiple seed specifications should use dateline configuration to ensure that torus 2QoS  can grant path SL values that are constant  regardless of which seed was used to initiate topology  discovery     portgroup max ports max ports   This keyword specifies the maximum number of parallel  inter switch links  and also the maximum number of host ports per switch  that torus 2QoS can  accommodate  The default value is 16  Torus 2QoS will log an error message during topology  discovery if this parameter needs to be increased  If this keyword appears multiple times  the last  instance prevails     port order pl p2 p3       This keyword specifies the order in which CA ports on a destination  switch are visited when computing routes  When the fabric contains switches connected
195. ion driver  probe vf      options mlx4 core num vfs 5 port type array 1 2 probe vf 1       Parameter Recommended Value       num vfs Absent  or zero        The SRI OV mode is not enabled in the driver  hence no VFs  will be available       Its value is a single number in the range of 0 63  The driver will  enable the num v  s VFs on the HCA and this will be applied  to all ConnectX   HCAs on the host        ts format is a string which allows the user to specify the  num vfs parameter separately per installed HCA    e Its format is   bb dd f v bb dd f v            bb dd f   bus device function of the PF of the HCA    v number of VFs to enable for that HCA   This parameter can be set in one of the following ways  For   example      num vfs 5   The driver will enable 5 VFs on the HCA and  this will be applied to all ConnectX   HCAs on the host     num vfs 00 04 0 5 00 07 0 8   The driver will enable 5  VFs on the HCA positioned in BDF 00 04 0 and 8 on the one in  00 07 0    Note  PFs not included in the above list will not have SR    IOV enabled     84 Mellanox Technologies                  Rev 2 0 3 0 0               Parameter Recommended Value  port type array Specifies the protocol type of the ports  It is either one array  of 2 port types  t1 t2  for all devices or list of BDF to  port type array  bb dd f t1 t2        string   probe vf Absent  or zero       No VFs will be used by the PF driver      Its value is a single number in the range of 0 63  Physical Func   tion drive
196. is performed instead  BO Steering    When using SR IOV  flow steering is enabled if there is adequate amount of space to store the  flow steering table for the guest master      gt  To enable Flow Steering   Step 1  Open the  etc modprobe d mlnx  conf file            2  Set the parameter 1og num mgm entry size to  1 by writing the option mlx4_core    log num mgm entry size  1   Step3  Restart the driver     To disable Flow Steering           Step 1  Open the  etc modprobe d mlnx  conf file   Step 2  Remove the options mlx4 core log num mgm entry size   1   Step3  Restart the driver       4 12 2 Flow Domains and Priorities    Flow steering defines the concept of domain and priority  Each domain represents a user agent  that can attach a flow  The domains are prioritized  A higher priority domain will always super   sede a lower priority domain when their flow specifications overlap  Setting a lower priority  value will result in higher priority     In addition to the domain  there is priority within each of the domains  Each domain can have at  most 2 12 priorities in accordance to its needs     The following are the domains at a descending order of priority       User Verbs allows a user application QP to be attached into a specified flow when  using ibv create flowand ibv destroy flow verbs      ibv create flow  struct ibv flow  ibv create flow struct ibv qp  qp  struct ibv flow attr  flow   Input parameters      struct               the attached QP     Mellanox Technologies
197. ith torus 2QoS     Since SL to VL map configuration must be under the complete control of torus 2QoS  any con   figuration via qos_sl2vl  qos_swe_sl2vl  etc   must and will be ignored  and a warning will be  generated  Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels  and VL  values 4 7 to implement the other  Hard to diagnose application issues may arise if traffic is not  delivered fairly across each of these two VL ranges  Torus 2QoS will detect and warn 1f VL arbi   tration is configured unfairly across VLs in the range 0 3  and also in the range 4 7  Note that the  default OpenSM VL arbitration configuration does not meet this constraint  so all torus 2QoS  users should configure VL arbitration via qos vlarb high  qos vlarb low  etc     8 5 7 5 Operational Considerations    Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops  As  a result  all applications run over such fabrics must perform a path record query to obtain the cor   rect path SL for connection setup  Applications that use rdma cm for connection setup will auto   matically meet this requirement     If a change in fabric topology causes changes in path SL values required to route without credit  loops  in general all applications would need to repath to avoid message deadlock  Since torus   2QoS has the ability to reroute after a single switch failure without changing path SL values   repathing by running applications is not required when the 
198. ities is dis     played  This report includes               178 Mellanox Technologies    SM report    Number of nodes and systems    Hop count information  maximal hop count  an example path  and a hop count histo     gram    All CA to CA paths traced  Credit loop report    mgid mlid HCAs multicast group and report    Partitions report  IPoIB report    In case the IB fabric includes only one CA  then CA to CA paths are not reported   Furthermore  if a topology file is provided  ibdiagnet uses the names defined in it for  the output reports     InfiniBand Fabric Diagnostic Utilities         Rev 2 0 3 0 0      Error Codes      Failed to fully discover the fabric     Failed to parse command line options     Failed to intract with IB fabric   Failed to use local device or local port    Failed to use Topology File     Failed to load requierd Package    Ov U1 H9 CO hO     1    9 5  ibdiagpath   IB diagnostic path    ibdiagpath traces a path between two end points and provides information regarding the nodes  and ports traversed along the path  It utilizes device specific health queries for the different  devices along the path     The way ibdiagpath operates depends on the addressing mode used on the command line  If  directed route addressing is used   d flag   the local node is the source node and the route to the  destination port is known apriori  On the other hand  if LID route  or by name  addressing is  employed  then the source and destination ports of a route are specified
199. itrd ib  hostl  cd  tmp initrd ib   Step3  Normally  the initrd image is zipped  Extract it using the following command   host1  gzip  dc  lt initrd image gt    cpio  id  The initrd files should now be found under  tmp initrd_ib   Step 4  Create a directory for the InfiniBand modules and copy them   host1  mkdir  p  tmp initrd ib lib modules ib  host1  cd  lib modules  uname  r  updates kernel drivers  hostl  cp infiniband core ib addr ko  tmp initrd ib lib modules ib  host1  cp infiniband core ib core ko  tmp initrd ib lib modules ib  host1  cp infiniband core ib mad ko  tmp initrd ib lib modules ib  hostl  cp infiniband core ib sa ko  tmp initrd ib lib modules ib  host1  cp infiniband core ib cm ko  tmp initrd ib lib modules ib  hostl  cp infiniband core ib uverbs ko  tmp initrd ib lib modules ib  hostl  cp infiniband core ib ucm ko  tmp initrd ib lib modules ib  host1  cp infiniband core ib umad ko  tmp initrd ib lib modules ib  hostl  cp infiniband core iw cm ko  tmp initrd ib lib modules ib  host1  cp infiniband core rdma cm ko  tmp initrd ib lib modules ib  hostl  cp infiniband core rdma ucm ko  tmp initrd ib lib modules ib  host1  cp net mlx4 mlx4 core ko  tmp initrd ib lib modules ib  host1  cp infiniband hw mlx4 mlx4 ib ko  tmp initrd ib lib modules ib  host1  cp infiniband hw mthca ib mthca ko  tmp initrd ib lib modules ib  hostl  cp infiniband ulp ipoib ipoib helper ko  tmp initrd ib lib modules ib  host1  cp infiniband ulp ipoib ib ipoib ko  tmp initrd ib lib modu
200. its will  be used  When omitted  P Key will be autogenerated     flag used to indicate IPOIB capability of this partition   defmeuber full limited  Specifies default membership for port guid list  Default is limited     Currently recognized flags are     ipoib indicates that this partition may be used for IPoIB  as      result IPoIB capable MC group will be created    rate  lt val gt  specifies rate for this IPoIB MC group  default is 3  10GBps    mtu  lt val gt  specifies MTU for this IPoIB MC group  default is 4  2048    sl  lt val gt  specifies SL for this IPoIB MC group  default is 0    scope  lt val gt  specifies scope for this IPoIB MC group  default is 2  link local       Note that values for rate  mtu  and scope should be specified as defined in the IBTA specification   for example  mtu 4 for 2048      PortGUIDs list     PortGUID GUID of partition member EndPort  Hexadecimal numbers should start  from 0x  decimal numbers are accepted too     full or limited indicates full or limited membership for this port  When omitted  or  unrecognized  limited membership is assumed     There are two useful keywords for PortGUID definition        ALL  means all end ports in this subnet       SELF  means subnet manager s port    An empty list means that there are no ports in this partition     Notes    e White space is permitted between delimiters                5        The line can be wrapped after     after a Partition Definition and between         PartitionName does not need to
201. l  it can also be explicitly referred by any match rule     IV  QoS Matching Rules  denoted by qos match rules     Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of  matching rules  Rules are scanned in order of appearance in the QoS policy file such as the first  match takes precedence        Each rule has a name of QoS level that will be applied to the matching query  A default QoS level  is applied to a query that did not match any rule     Queries can be matched by       Source port group  whether a source port is a member of a specified group     Destination port group  same as above  only for destination port       PKey     QoS class     Service ID    To match a certain matching rule  PR MPR query has to match ALL the rule s criteria  However   not all the fields of the PR MPR query have to appear in the matching rule     For instance  if the rule has a single criterion   Service ID  it will match any query that has this  Service ID  disregarding rest of the query fields  However  if a certain query has only Service ID   which means that this is the only bit in the PR MPR component mask that is on   it will not  match any rule that has other matching criteria besides Service ID     8 6  3 Simple QoS Policy Definition    Simple QoS policy definition comprises of a single section denoted by qos ulps  Similar to the  advanced QoS policy  it has a list of match rules and their QoS Level  but in this case a match  rule has only one crit
202. l digits      Extracting the Port GUID     Method      To obtain the port GUID  run the following commands       The following        commands assume that the Mellanox Firmware Tools             package has been installed on the client machine        host1  mst start  host1  mst status    The device name will be of the form   dev mst mt  dev id   pci _cr0 conf0   Use this device  name to obtain the Port GUID via the following query command   flint  d   MST DEVICE NAME   q    Example with ConnectX 2 QDR  MHJH29B XTR Dual 4X IB QDR Port  PCIe Gen2 x8  Tall  Bracket  ROHS R6 HCA Card  CX4 Connectors  as the adapter device        Image type  ConnectX   FW Version  2 9 1000   Rom Info  type PXE version 3 3 400 devid 26428 proto VPI  Device ID  26428   Description  Node Porti Port2 Sys image  GUIDs  0002c9030005cffa 0002c9030005cffb 0002c9030005cffc  0002c9030005cffd    ACS   0002c905cffa 0002c905cffb   Board ID   MT_0DD0110009    VSD    PSID  MT 0DD0110009    Assuming that FlexBoot is connected via Port 1  then the Port GUID is 00 02 c9 03 00 05 cf fb     Extracting the Port GUID     Method Il    An alternative method for obtaining the port GUID involves booting the client machine via Flex   Boot  This requires having a Subnet Manager running on one of the machines in the InfiniBand  subnet  The 8 bytes can be captured from the boot session as shown in the figure below     Mellanox Technologies 207      Rev 2 0 3 0 0      Mellanox ConnectX FlexBoot v3 3 400  iPXE 1 0 0     Open So
203. l system        Specifies the local device s port number    used to connect to the IB fabric       Specifies the local port GUID value of the    port used to connect to the IB fabric  If  GUID given is 0 than ibdiagnet displays  a list of possible port GUIDs and waits  for user input          Specifies opensm path records dump file path     src dst to SL mapping generated by SM plugin   ibdiagnet will use this mapping for MADs  sending and credit loop check  if  r option  selected        Provides a report of the fabric qualities     Indicates that UpDown credit loop checking    should be done against automatically determined  roots     Specifies the directory where the output  files will be placed    default   var tmp ibdiagnet2           Skip the executions of the given stage  Applicable    skip stages     all   dup guids   dup node desc   lids     links   sm   pm   nodes info   speed width check    pkey   aguid        Skip the load of the given library name     Applicable skip plugins    libibdiagnet_cable diag plugin    libibdiagnet_cable diag plugin 2 1 1      Reset all the fabric PM counters     If any of the provided PM is greater then  its provided value than print it     Specifies the seconds to wait between first  counters sample and second counters sample   If seconds given is 0 than no second counters  sample will be done    default 1       InfiniBand Fabric Diagnostic Utilities      Rev 2 0 3 0 0                       Provides    BER test for each port  Calculate 
204. le FCA module under Scalable UPC     export GASNET FCA ENABLE CMD LINE 1     gt  To set FCA verbose level     export GASNET FCA VERBOSE CMD LINE 10           gt  To set the minimal number of processes threshold to activate FCA                export GASNET FCA NP CMD LINE 1    Mellanox Technologies 107      Rev 2 0 3 0 0 HPC Features        ScalableUPC contains modules configuration file  http   modules sf net  which can be    found at  opt mellanox bupc 2 2 etc bupc modulefile   Fr    5 5 3 Various Executable Examples    The following are various executable examples    gt  To run a ScalableUPC application without FCA support     9      upcrun  np 128  fca enable 0   executable filename         Torun UPC applications with FCA enabled for any number of processes     export GASNET FCA ENABLE CMD LINE 1 GASNET FCA NP CMD LINE 0    upcrun  np 64   executable filename         Torun UPC application on 128 processes  verbose mode     9      upcrun  np 128  fca enable 1  fca np 10  fca verbose 5   executable filename       gt  To run UPC application  offload to FCA Barrier and Broadcast only     9      upcrun  np 128  fca ops                        executable filename      108 Mellanox Technologies    Rev 2 0 3 0 0    Mellanox Technologies 109      Rev 2 0 3 0 0 Working With VPI    6 Working With VPI    VPI allows ConnectX ports to be independently configured as either IB or Eth     6 1 Port Type Management    ConnectX ports can be individually configured to work as InfiniBand or Ethe
205. les ib       Mellanox Technologies 213      Rev 2 0 3 0 0             5  IB requires loading      IPv6 module  If you do not have it in your initrd  please add it using  the following command   host1  cp  lib modules  uname  r  kernel net ipv6 ipv6 ko      tmp initrd ib lib modules    Step6       load the modules  you need the insmod executable  If you do not have it in your  initrd  please add it using the following command     hostl  cp  sbin insmod  tmp initrd ib sbin     Step 7  If you plan to give your IB device a static IP address  then          ifconfig  Otherwise  skip  this step     hostl  cp  sbin ifconfig  tmp initrd ib sbin           8  Ifyou plan to obtain an IP address for the IB device through DHCP  then you need to copy the  DHCP client which was compiled specifically to support IB   Otherwise  skip this step   To continue with this step  DHCP client v3 1 3 needs to be already installed on the machine you  are working with   Copy the DHCP client v3 1 3 file and all the relevant files as described below   host1  cp   path to DHCP client v3 1 3   dhclient  tmp initrd ib sbin  host1  cp   path to DHCP client v3 1 3   dhclient script  tmp initrd ib sbin  host1  mkdir  p  tmp initrd ib var state dhcp  host1  touch  tmp initrd ib var state dhcp dhclient leases  host1  cp  bin uname  tmp initrd ib bin  host1  cp  usr bin expr  tmp initrd ib bin  host1  cp  sbin ifconfig  tmp initrd ib bin  host1  cp  bin hostname  tmp initrd ib bin       Step 9  Create a configuratio
206. ling SR IOV Driver                                          86  4 13 6 Burning Firmware with SR IOV                                       86  4 13 7 Configuring Pkeys and GUIDs under SR IOV                            87                                  got us      freed HL pe bogota p 93  4 14 1 CORE Drrect Overview uu cererii dn eda Hea EEA MERI esa 93   AAD  Bthtool   2 voie nto e      eda n 94  4 16 Dynamically Connected Transport                                             95  Chapter 5 HPC Features                                                  96  5 1 Shared Memory Access                                             96  5 1 1 Mellanox ScalableSHMEM                   eens 96   5 1 2  Running SHMEM with FCA                                          97   5 1 3 Running ScalableSHMEM with                                          97   5 1 4  Running SHMEM with Contiguous                                          98   5 1 5 Running ScalableSHMEM Application                                  98   5 2 Message Passing Interface                                           98     98   5 2 2 Prerequisites for Running MPI           teens 99   5 2 3 MPI Selector   Which MPI Runs                                      100   5 2 4 Compiling MPI Applications                                         100   5 3 MellanoX Messaging                                              101  5 3 1 Compiling OpenMPI with MXM                                      101   5 3 2 Enabling MXM in OpenMPI               
207. ll effect SM authentication   Note that OpenSM version 3 2 1 and below used the  default value  1  in a host byte order  it is fixed  now but you may need this option to interoperate       with old OpenSM running on a little endian machine       reassign lids   r  This option causes OpenSM to reassign LIDs to all  end nodes  Specifying  r on a running subnet  may disrupt subnet traffic   Without  r  OpenSM attempts to preserve existing  LID assignments resolving multiple use of same LID       routing engine   R   engine name    This option chooses routing engine s  to use instead of default  Min Hop algorithm  Multiple routing engines can be specified  Separated by commas so that specific ordering of routing  algorithms will be tried if earlier routing engines fail   If all configured routing engines fail  OpenSM will always  attempt to route with Min Hop unless  no fallback  is  included in the list of routing engines        Supported engines  updn  file  ftree  lash  dor  torus 2QoS    122 Mellanox Technologies      Rev 2 0 3 0 0        do mesh analysis  This option enables additional analysis for the lash  routing engine to precondition switch port assignments  in regular cartesian meshes which may reduce the number  of SLs required to give a deadlock free routing      lash start vl  lt vl number gt   Sets the starting VL to use for the lash routing algorithm   Defaults to 0      sm_sl   sl number    Sets the SL to use to communicate with the SM SA  Defaults to 0       con
208. llowing software components     Mellanox Host Channel Adapter Drivers      mlx5  mlx4  VPI   which is split into multiple modules      mlx4 core  low level helper      mlx4 ib  IB      mlx  ib       mlx 5 core    Mellanox Technologies 17 J      Rev 2 0 3 0 0 Mellanox OFED Overview         mlx4 en  Ethernet   Mid layer core     Verbs  MADs  SA        CMA  uVerbs  uMADs  Upper Layer Protocols  ULPs     IPoIB  RDS     SRP Initiator        SRP    NOTE  RDS was not tested by Mellanox Technologies   MPI    Open MPI stack supporting the InfiniBand  RoCE and Ethernet interfaces    OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces     MPI benchmark tests  OSU BW LAT  Intel MPI Benchmark  Presta   OpenSM  InfiniBand Subnet Manager  Utilities    Diagnostic tools     Performance tests  Firmware tools  MFT     Source code for all the OFED software modules  for use under the conditions men   tioned in the modules  LICENSE files     Documentation    12 3 Firmware    The ISO image includes the following firmware items     Firmware images   mlx format  for ConnectX   3 ConnectX   3 Pro Connect IB    net   work adapters    Firmware configuration   IND files for Mellanox standard network adapter cards and  custom cards    FlexBoot for ConnectX   3 HCA devices    1 2 4 Directory Structure  The ISO image of MLNX_OFED_ LINUX contains the following files and directories     minxofedinstall   This is the MLNX OFED LINUX installation script   ofed uninstall sh   This is      MLNX OFED L
209. lnx RPM   Preparing  ak HH H      HH HH H HH H HH HH H Het HH H HHH HHH   H HH HH  knem mlnx   H      HH HH H  H   HH HHH HHH HHH HH HHH HHH HH HH  Installing kmod knem mlnx RPM   Prepari ng  7 HH H      HH HHH HHH HH HH H        H  H  HHH HHH HH HH  kmod knem mlnx HH H HHH HH HH H HHH HH   H      HH HHHH HH H HH H HH HH  Installing mpi selector RPM   Preparing     HH H      HH HH H HHH HH   H Het HH          HH H HH H HH HH  mpi sel ector Het HHH HH HH H HHH HH HHH Het   H  H  HH H HHH HH HH                                                                                                                                                                                                                                                                                                                28 Mellanox Technologies    Installing user level RPMs   Preparing      ofed scripts   Preparing     libibverbs  Preparing     libibverbs  Preparing     libibverbs devel  Preparing     libibverbs devel  Preparing     libibverbs devel static  Preparing     libibverbs devel static  Preparing     libibverbs utils  Preparing      libmlx4  Preparing     libmlx4  Preparing     libmlx4 devel  Preparing     libmlx4 devel  Preparing     libmlx5  Preparing     libmlx5  Preparing     libmlx5 devel  Preparing     libmlx5 devel  Preparing     libexgb3  Preparing     libexgb3  Preparing     libexgb3 devel  Preparing     libexgb3 devel  Preparing     libcxgb4                                               
210. log          LOG SIZE    size  in MB gt        This option defines maximal AR Manager log  file size in MB  The logfile will be truncated and  restarted upon reaching this limit    This option cannot be changed on the fly        0  unlimited log file size   Default  5          8 8 5 1 1 Per switch AR Options    A user can provide per switch configuration options with the following syntax     Mellanox Technologies 165      Rev 2 0 3 0 0 OpenSM     Subnet Manager      SWITCH  lt GUID gt      lt switch option 1 gt     lt switch option 2 gt         The following are the per switch options     Table 14   Adaptive Routing Manager Pre Switch Options File          Option File Description Values  ENABLE  Allows you to enable disable the AR on this Default  true   lt true false gt  switch  If the general ENABLE option value is   set to  false   then this per switch option is  ignored     This option can be changed on the fly        AGEING_TIME  Applicable to bounded AR mode only  Specifies   Default  30   lt usec gt  how much time there should be no traffic in  order for the switch to declare a transmission  burst as finished and allow changing the output  port for the next transmission burst  32 bit  value     In the pre switch options file this option refers to  the particular switch only   This option can be changed on the fly                    8 8 5 1 2 Example of Adaptive Routing Manager Options File    ENABLE  true    LOG FILE   tmp ar_mgr log   LOG SIZE  100    MAX ERRORS  10  
211. loops  use ibdmchk to analyze data collected via ibdiagnet  vlr     Mellanox Technologies 147      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 5 7 6            2005 Configuration File Syntax           file torus 2QoS conf contains configuration information that is specific to the OpenSM rout   ing engine torus 2QoS  Blank lines and lines where the first non whitespace character is     are  ignored  A token is any contiguous group of non whitespace characters  Any tokens on a line fol   lowing the recognized configuration tokens described below are ignored      torus mesh  x radix m M t T     radix m M t T  z _radix m M t T     Either torus or mesh must be the first keyword in the configuration  and sets the topology that  torus 2QoS will try to construct  A 2D topology can be configured by specifying one of x_radix   y_radix  or z_radix as 1  An individual dimension can be configured as mesh  open  or torus   looped  by suffixing its radix specification with one of m  M  t  or T  Thus   mesh 3T 4 5  and   torus 3 4M 5M  both specify the same topology     Note that although torus 2QoS can route mesh fabrics  its ability to route around failed compo   nents is severely compromised on such fabrics  A failed fabric componentis very likely to cause a  disjoint ring  see UNICAST ROUTING in torus 2QoS 8      xp link sw0 GUID swl GUID  yp link sw0 GUID swl GUID  zp link sw0 GUID swl GUID  xm link sw0 GUID swl GUID  ym link sw0 GUID swl GUID  zm link sw0 GUID swl GUID    These keyw
212. m  F  lt options file name gt     See an example of AR Manager options file with all the default values in    Example of Adaptive  Routing Manager Options File    on page 166     8 8 3 2 Disabling Adaptive Routing  There are two ways to disable Adaptive Routing Manager   1  By disabling it explicitly in the Adaptive Routing configuration file     2       removing the  armgr  option from the Subnet Manager options file     Mellanox Technologies 163      Rev 2 0 3 0 0 OpenSM     Subnet Manager          Adaptive Routing mechanism is automatically disabled once the switch  receives setting of the usual linear routing table  LFT    e    Therefore  no action is required to clear Adaptive Routing configuration on the switches if you  do not wish to use Adaptive Routing     8 8 4 Querying Adaptive Routing Tables    When Adaptive Routing is active  the content of the usual Linear Forwarding Routing Table on  the switch is invalid  thus the standard tools that query LFT  e g  smpquery  dump Ifts sh  and  others  cannot be used  To query the switch for the content of its Adaptive Routing table  use the   smparquery  tool that is installed as a part of the Adaptive Routing Manager package  To see its  usage details  run  smparquery  h      8 8 5 Adaptive Routing Manager Options File    The default location of the AR Manager options file is  etc opensm ar mgr conf  To set an alter   native location  please perform the following     1  Add  armgr   conf file  lt ar mgr options file name gt
213. min guids sysfs interface     Mellanox Technologies 91 J      Rev 2 0 3 0 0 Driver Features      To configure the GUID at index  lt    gt  on port  lt port_num gt      cd  sys class infiniband mlx4 0 iov ports   port num   admin guids  echo   your desired guid      n    Example   cd  sys class infiniband mlx4 0 iov ports 1 admin guids  echo   0x002f  ffff8118   gt  3    1  echo  0x0  means let the SM assign a value to that GUID  echo  Oxffffffffffffffff  means delete that GUID  echo   any other value   means request the SM to assign this GUID to this index    Step3  Read the administrative status of the GUID index   To read the administrative status of GUID index m on port n   cat  sys class infiniband mlx4 0 iov ports   n   admin guids   m    Step 4  Check the operational state of a GUID    sys class infiniband mlx4 0 iov ports   n   gids  where n   1 or 2   The values indicate what gids are actually configured on the firmware hardware  and all the  entries are R O     Step 5  Compare the value you read under the  admin_guids  directory at that index with the value  under the  gids  directory  to verify the change requested in Step 3 has been accepted by the  SM  and programmed into the hardware port GID table     If the value under admin guids  lt m gt  is different that the value under gids  lt m gt   the request is  still in progress     4 13 7 2 3Partitioning IPoIB Communication using PKeys    PKeys are used to partition IPoIB communication between the Virtual Machines a
214. n 30  BW                Traffic class  IPoIB  Service Level  3    Policy  min 1   k  IT   D      LA          App A Server    Virtual Server  App B Server    8 7 QoS Configuration Examples    The following are examples of QoS configuration for different cluster deployments  Each exam   ple provides the QoS level assignment and their administration via OpenSM configuration files     8 7 1 Typical HPC Example  MPI and Lustre    Assignment of QoS Levels     MPI     Separate from I O load     Min BW of 70     Storage Control  Lustre MDS      Low latency     Storage Data  Lustre OST      Min BW 30     Administration       MPI 15 assigned an SL via the command line    host1  mpirun  s1 0       OpenSM QoS policy file      In the following policy file example  replace        and        with the real port  GUIDs     Mellanox Technologies 159      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 7 2    qos ulps   default  0   default SL  for  MPT    any  target port guid OST1 0ST2 0ST3 0ST4 1   SL for Lustre OST   any  target port guid MDS1 MDS2  2   SL for Lustre  MDS    end qos ulps    OpenSM options file    qos max vls 8   qos high limit 0   qos vlarb high 2 1   qos vlarb low 0 96 1 224   Cos SAW 01 2210 9 0907  109       ls 309        10  IS    EDC SOA  2 tier   IPoIB and SRP    The following is an example of QoS configuration for a typical enterprise data center  EDC  with  service oriented architecture  SOA   with IPoIB carrying all application traffic and SRP used for  storage     
215. n file for the DHCP client  as described in Section 4 3 3 1  and place it  under  tmp initrd ib sbin  The following is an example of such a file  called  dclient conf     dhclient conf          The value indicates a hexadecimal number      For a ConnectX device interface  ib0   send dhcp client identifier       00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39          Step 10  Now you can add the commands for loading the copied modules into the file init  Edit the file     tmp initrd_ib init and add the following lines at the point you wish the IB driver to be  loaded     Hm  The order of the following commands  for loading modules  is critical     echo  loading ipv6     sbin insmod  lib modules ipv6 ko   echo  loading IB driver     sbin insmod  lib modules ib ib addr ko    214 Mellanox Technologies      Rev 2 0 3 0 0       sbin insmod  lib modules ib ib core ko   sbin insmod  lib modules ib ib mad ko   sbin insmod  lib modules ib ib sa ko   sbin insmod  lib modules ib ib cm ko   sbin insmod  lib modules ib ib uverbs ko   sbin insmod  lib modules ib ib ucm ko   sbin insmod  lib modules ib ib umad ko   sbin insmod  lib modules ib iw cm ko   sbin insmod  lib modules ib rdma cm ko   sbin insmod  lib modules ib rdma ucm ko   sbin insmod  lib modules ib mlx4 core ko   sbin insmod  lib modules ib mlx4 ib ko   sbin insmod  lib modules ib ib mthca ko            The following command  loading ipoib_helper ko  is not required for all OS kernels     Please check the release note
216. n routing does not allow LID routing communication between switches that            located inside spine    switch systems     The reason is that there is      way to allow     LID route between them that does not break the Up Down rule  One ramification          of this is that you cannot        SM on switches other        the leaf switches of the fabric     8 5 3 1 UPDN Algorithm Usage    Activation through OpenSM  e Use   R updn  option  instead of old   u   to activate the UPDN algorithm         Use  a   root guid file gt   for adding an          guid file that contains the root nodes for  ranking  If the   a  option is not used  OpenSM uses its auto detect root nodes algo   rithm     Notes on the guid list file     138 Mellanox Technologies      Rev 2 0 3 0 0      1  A valid guid file specifies one guid in each line  Lines with an invalid format will be dis   carded     2  The user should specify the root switch guids  However  it is also possible to specify CA  guids  OpenSM will use the guid of the switch  if it exists  that connects the CA to the subnet  as a root node     8 5 4 Fat tree Routing Algorithm    The fat tree algorithm optimizes routing for  shift  communication pattern  It should be chosen if  a subnet is a symmetrical or almost symmetrical fat tree of various types  It supports not just  K ary N Trees  by handling for non constant K  cases where not all leafs  CAs  are present  any  Constant Bisectional Ratio  CBB  ratio  As in UPDN  fat tree also preven
217. nable 0  mca coll fca enable 0    For more details on FCA installation and configuration  please refer to the FCA User Manual  found in the Mellanox website     5 1 3 Running ScalableSHMEM with MXM    MellanoX Messaging  MXM  library provides enhancements to parallel communication libraries  by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch  hardware  This includes a variety of enhancements that take advantage of Mellanox networking  hardware including       Multiple transport support including RC  XRC and UD     Proper management of HCA resources and memory structures    Efficient memory registration     One sided communication semantics     Connection management     Receive side tag matching      Intra node shared memory communication    Mellanox Technologies 99 J      Rev 2 0 3 0 0 HPC Features      These enhancements significantly increase the scalability and performance of message com   muni cations in the network  alleviating bottlenecks within the parallel communication libraries    5 1 4 Running SHMEM with Contiguous Pages    Contiguous Pages improves performance by allocating user memory regions over contiguous  pages  It enables a user application to ask low level drivers to allocate contiguous memory for it  as part of ibv_reg_mr      gt  To activate MLNX_OFED 2 0 and the contiguous pages allocator with SHMEM   Run the following argument to enable compound pages with SHMEM      opt mellanox openshmem 2 1 bin shmemrun  mca s
218. nd     Also in case of two physical connections  i e   network paths  from a single initiator IB port to  two different IB ports on the same Target HCA  there is need for a different initiator_ext value on  each path  The conventions is to use the Target port GUID as the initiator_ext value for the rele   vant path     Mellanox Technologies 45 J      Rev 2 0 3 0 0 Driver Features      If you use srp_daemon with  n flag  it automatically assigns initiator_ext values according to this  convention  For example   id ext 200500A0B81146A1 ioc 9011 0002  90200402           dgid fe800000000000000002c90200402bed                        service id 200500a0b81146a1 initiator ext ed2b400002c90200    Notes   1  It is recommended to use the  n flag for all stp daemon invocations   2  ibsrpdm does not have a corresponding option     3  srp daemon sh always uses the  n option  whether invoked manually by the user  or automat   ically at startup by setting SRPHA ENABLE to yes      4 1 2 6 High Availability  HA     Overview    High Availability works using the Device Mapper  DM  multipath and the SRP daemon  Each  initiator is connected to the same target from several ports HCAs  The DM multipath is responsi   ble for joining together different paths to the same target and for fail over between paths when  one of them goes offline  Multipath will be executed on newly joined SCSI devices     Each initiator should execute several instances of the SRP daemon  one for each port  At startup   each S
219. nd the Dom0  by mapping a non default full membership PKey to virtual index 0  and mapping the default  PKey to a virtual pkey index other than zero     The below describes how to set up two hosts  each with 2 Virtual Machines  Host 1 vm 1 will be  able to communicate via IPoIB only with Host2 vm1 and Host1 vm2 only with Host2 vm2     In addition  Host1 Dom0 will be able to communicate only with Host2 Dom0 over 160  vm1 and  vm2 will not be able to communicate with each other  nor with Dom0     This is done by configuring the virtual to physical PKey mappings for all the VMs  such that at  virtual PKey index 0  both vm 1s will have the same pkey and both vm 2s will have the same  PKey  different from the vm 1 s   and the Dom0 s will have the default pkey  different from the  vm s pkeys at index 0     OpenSM must be used to configure the physical Pkey tables on both hosts      The physical Pkey table on both hosts  Dom0  will be configured by OpenSM to be     index 0   Oxffff  index 1   0xb000  index 2   0xb030        The vml s virt to physical PKey mapping will be     pkey idx 0   1  pkey idx 1   0    92 Mellanox Technologies      Rev 2 0 3 0 0         The vm2 s virt to phys pkey mapping will be   pkey_ idx 0   2  pkey        1   0  so that the default pkey will reside on the vms at index 1 instead of at index 0     The IPoIB QPs are created to use the PKey at index 0  As a result  the Dom0  vm1 and vm2  IPoIB QPs will all use different PKeys      gt  To partition IPoIB commu
220. nect roots   z  This option enforces routing engines  up down and  fat tree  to make connectivity between root switches  and in this way be IBA compliant  In many cases   this can violate  pure  deadlock free algorithm  so  use it carefully                          This option enables unicast routing cache to prevent  routing recalculation  which is a heavy task in a  large cluster  when there was no topology change  detected during the heavy sweep  or when the topology  change does not require new routing calculation   e g  in case of host reboot    This option becomes very handy when the cluster size    is thousands of nodes          lid matrix file   M   file name gt   This option specifies the name of the lid matrix dump file  from where switch lid matrices  min hops tables will be       loaded     eed iile      ule mee  This option specifies the name of the LFTs file  from where switch forwarding tables will be loaded       sadb file   S   file name gt   This option specifies the name of the SA DB dump file  from where SA database will be loaded       root guid file   a   path to file      Set the root nodes for the Up Down or Fat Tree routing  algorithm to the guids provided in the given file  one    Mellanox Technologies 123      Rev 2 0 3 0 0 OpenSM     Subnet Manager      to    line       cn guid file   u   path to file gt   Set the compute nodes for the Fat Tree routing algorithm  to the guids provided in the given file  one to a line       io guid file   G   path t
221. need defining them  And since this policy file doesn t have  any matching rules  PR MPR query will not match any rule  and OpenSM will enforce default  QoS level  Essentially  the above example 1s equivalent to not having a QoS policy file at all   The following example shows all the possible options and keywords in the policy file and their  syntax           See the comments in the following example      They explain different keywords and their meaning       port groups    port group   using port GUIDs  name  Storage     use  is just a description that is used for logging    Other than that  it is just a comment  use  SRP Targets  port guid  0x10000000000001  0x10000000000005 0x1000000000FFFA  port guid  0x1000000000FFFF  end port group    152 Mellanox Technologies      Rev 2 0 3 0 0      port group  name  Virtual Servers    The syntax of the port name is as follows      node description Pnum      node description is compared to the NodeDescription of the node     and  Pnum  is a port number on that node   port name  vsl HCA 1 P1  vs2 HCA 1 P1  end port group      using partitions defined in the partition policy  port group   name  Partitions   partition  Partl   pkey  0x1234  end port group      using node types  CA  ROUTER  SWITCH  SELF  for node that runs SM     or ALL  for all the nodes in the subnet   port group  name  CAs and SM  node type  CA  SELF  end port group    end port groups    qos setup    This section of the policy file describes how to set up SL2VL and VL 
222. nes the order on which the ports would be    chosen for routing    pong order 1    iil    WA 25 29 25 2D 2 SQ    Mellanox Technologies 149    N       H      N                         Rev 2 0 3 0 0 OpenSM   Subnet Manager  8 6 Quality of Service Management in OpenSM    8 6 1 Overview    When Quality of Service  QoS  in OpenSM is enabled  using the     Q    or      qos    flags   OpenSM  looks for a QoS Policy file  During fabric initialization and at every heavy sweep  OpenSM  parses the QoS policy file  applies its settings to the discovered fabric elements  and enforces the  provided policy on client requests  The overall flow for such requests is as follows        The request is matched against the defined matching rules such that the QoS Level def   inition is found       Given the QoS Level  a path s  search is performed with the given restrictions imposed  by that level    Figure 4  QoS Manager    M   Administrator          QoS Policy Config File InfiniBand    subnet with  QoS      OFED 1 3  Manager based nodes   OSM       Z    There are two ways to define QoS policy       Advanced     the advanced policy file syntax provides the administrator various ways to  match a PathRecord MultiPathRecord  PR MPR  request  and to enforce various QoS  constraints on the requested PR MPR       Simple     the simple policy file syntax enables the administrator to match PR MPR  requests by various ULPs and applications running on top of these ULPs    8 6 2 Advanced QoS Policy File  The
223. ng   d2   Force log flushing after each log message   d3   Disable multicast support   910   Put OpenSM in testability mode  Without  d  no debug options are enabled           I  em  Display this usage info then exit     8 2 2 Environment Variables  The following environment variables control opensm behavior     OSM TMP DIR    Controls the directory in which the temporary files generated by opensm are created  These files are   opensm subnet lst  opensm fdbs  and opensm mcfdbs  By default  this directory is   var    log       OSM CACHE DIR    Mellanox Technologies 129      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 2 3    8 2 4    8 2 4 1    8 3    opensm stores certain data to the disk such that subsequent runs are consistent  The default directory  used is  var cache opensm  The following file is included in it        guid21id   stores the LID range assigned to each GUID    Signaling  When OpenSM receives a HUP signal  it starts a new heavy sweep as if a trap has been received  or a topology change has been found     Also  SIGUSR1 can be used to trigger a reopen of  var log opensm 1log for logrotate pur   poses     Running opensm    The defaults of opensm were designed to meet the common case usage on clusters with up to a  few hundred nodes  Thus  in this default mode  opensm will scan the IB fabric  initialize it  and  sweep occasionally for changes    To run opensm in the default mode  simply enter     host1  opensm  Note that opensm needs to be run on at least one m
224. nge 0 to 15 in their header SL field      Each switch can map the incoming packet by its SL to a particular output VL  based on  a programmable table VL SL to VL MAP in port  out port  SL        The Subnet Administrator controls the parameters of each communication flow by pro   viding them as a response to Path Record  PR  or MultiPathRecord  MPR  queries    DiffServ architecture  IETF RFC 2474  amp  2475  is widely used in highly dynamic fabrics  The  following subsections provide the functional definition of the various software elements that  enable a DiffServ like architecture over the Mellanox OFED software stack     56 Mellanox Technologies      Rev 2 0 3 0 0      44 2 QoS Architecture    QoS functionality is split between the SM SA  CMA and the various ULPs  We take the    chro   nology approach    to describe how the overall system works     1  The network manager  human  provides a set of rules  policy  that define how the network is  being configured and how its resources are split to different QoS Levels  The policy also  define how to decide which QoS Level each application or ULP or service use     2  The SM analyzes the provided policy to see if it is realizable and performs the necessary fab   ric setup  Part of this policy defines the default QoS Level of each partition  The SA is  enhanced to match the requested Source  Destination  QoS Class  Service ID  PKey against  the policy  so clients  ULPs  programs  can obtain a policy enforced QoS  The SM may also 
225. nication using PKeys   Step 1  Create a file   etc opensm partitions conf  on the host on which OpenSM runs  contain   ing lines     Default 0x7fff ipoib   ALL full    Pkey1 0x3000 ipoib   ALL full   Pkey3 0x3030 ipoib   ALL full     This will cause OpenSM to configure the physical Port Pkey tables on all physical ports on the  network as follows     pkey idx   pkey value    0   OxFFFF  1   0xB000  2   0xB030     the most significant bit indicates if a PKey is a full PKey        The    ipoib  causes OpenSM to pre create IPoIB the broadcast group for the indicated    PKeys           Step 2  Configure       Dom0  the virtual to physical          mappings for the VMs   Step     Check the PCI ID for the Physical Function and the Virtual Functions   lspci   grep Mel    Stepb    Assuming that on Hostl  the physical function displayed by Ispci 15  0000 02 00 0   and that  on Host2 it is  0000 03 00 0   On Hostl do the following   cd  sys class infiniband mlx4 0 iov    0000 02 00 0 0000 02 00 1 0000 02 00 2         1  0000 02 00 0 contains the virtual to physical mapping tables for the physical func   tion   0000 02 00 X contain the virt to phys mapping tables for the virtual functions     Do not touch the Dom0 mapping table  under  lt nnnn gt   lt nn gt  00 0   Modify only  tables under 0000 02 00 1 and or 0000 02 00 2  We assume that vml uses VF  0000 02 00 1 and vm2 uses VF 0000 02 00 2              Configure the virtual to physical PKey mapping for the VMs     echo 0  gt  0000 02
226. numbers   Example  0 0 0 0 1 1 1 1 maps UPs 0 3 to TCO  and UPs         1000  TCI    s LIST    tsa LIST Transmission algorithm for each TC  LIST is comma  seperated algorithm names for each TC  Possible  algorithms  strict  etc  Example  ets strict ets sets  TCO TC2 to ETS and     1 to strict  The rest are  unchanged     t LIST    tcbw LIST Set minimal guaranteed  BW for ETS TCs  LIST is comma  Seperated percents for each TC  Values set to TCs that  are not configured to ETS algorithm are ignored  but  must be present  Example  if TCO TC2 are set to ETS   then 10 0 90 will set TCO to 10  and TC2 to 90    Percents must sum to 100     r LIST    ratelimit LIST  Rate limit for TCs  in Gbps   LIST is a comma  Seperated Gbps limit for each TC  Example  1 8 8 will  limit        to 1Gbps  and TC1 TC2 to 8 Gbps each     1 INTF    interface INTF  Interface name    a Show all interface s TCs    Mellanox Technologies 63 J      Rev 2 0 3 0 0 Driver Features      Get Current Configuration        64 Mellanox Technologies      Rev 2 0 3 0 0      Set ratelimit  3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2        Configure QoS  map UP 0 7 to tc0  1 2 3 to tc1 and 4 5 6 to tc 2  set tc0 tc1 as ets and tc2    Mellanox Technologies 65      Rev 2 0 3 0 0 Driver Features      as strict  divide ets 30  for tc0 and 70  for               cos al        ets Era        DEO IPIE T ME RES QUITO  tc  0 ratelimit  3 Gbps  tsa  ets  bw  30           up  0  Skprio  0  Skprio  1  Skprio  2  tos  8   Skprio  3 
227. o file    Set the I O nodes for the Fat Tree routing algorithm  to the guids provided in the given file  one to a line       port shifting  Attempt to shift port routes around to remove alignment problems  in routing tables      Scatter ports   random seed    Randomize best port chosen for a route      max reverse hops   H   hop count    Set the max number of hops the wrong way around  an I O node is allowed to do  connectivity for I O nodes on top swithces       ids guid file   m   path to file    Name of the map file with set of the IDs which will be used  by Up Down routing algorithm instead of node GUIDs   format    guid     id   per line       guid routing order file   X   path to file    Set the order port guids will be routed for the MinHop  and Up Down routing algorithms to the guids provided in the  given file  one to a line       torus config   path to file    This option defines the file name for the extra configuration  info needed for the torus 2QoS routing engine  The default  name is   etc opensm torus 20QoS conf       once   o  This option causes OpenSM to configure the subnet  once  then exit  Ports remain in the ACTIVE state       Sweep   s   interval    This option specifies the number of seconds between  subnet sweeps  Specifying  s 0 disables sweeping   Without  s  OpenSM defaults to a sweep interval of  10 seconds     124 Mellanox Technologies      Rev 2 0 3 0 0        timeout   t  lt milliseconds gt   This option specifies the time in milliseconds  used
228. o this  although the algo   rithm allows leaf switches to have any number of CAs  the closer the tree is to be fully popu   lated  the more effective the  shift  communication pattern will be  In general  even if the  root list is provided  the closer the topology to a pure and symmetrical fat tree  the more optimal  the routing will be     The algorithm also dumps compute node ordering file  opensm ftree ca order dump  in  the same directory where the OpenSM log resides  This ordering file provides the CN order  that may be used to create efficient communication pattern  that will match the routing tables        1  Ports that are connected to the same remote switch are referenced as    port group     2  List of compute nodes  CNs  can be specified by   u  or      cn_guid_file    OpenSM options     Mellanox Technologies 139      Rev 2 0 3 0 0 OpenSM     Subnet Manager      8 5 4 1 Routing between non CN Nodes    The use of the cn_guid file option allows non CN nodes to be located on different levels in the  fat tree  In such case  it is not guaranteed that the Fat Tree algorithm will route between two non   CN nodes  In the scheme below  N1  N2 and N3 are non CN nodes  Although all the CN have  routes to and from them  there will not necessarily be a route between N1 N2 and N3  Such  routes would require to use at least one of the switches the wrong way around     Spinel Spine2 Spine 3      N ei  N f  N      X y   N           Switch N2 Switch N3  ZIN      aa          Going do
229. ock  So in the  example above with failed switch T  the location of the illegal turn at I in the path from S to D  requires that any credit loop caused by that turn must encircle the failed switch at T  Thus the  second and later hops after the illegal turn at I  1      hop r D  cannot contribute to a credit loop    Mellanox Technologies 143      Rev 2 0 3 0 0 OpenSM     Subnet Manager      because they cannot be used to construct a loop encircling T  The hop       uses a separate VL  so  it cannot contribute to a credit loop encircling T  Extending this argument shows that in addition  to being capable of routing around a single switch failure without introducing deadlock  torus   2QoS can also route around multiple failed switches on the condition they are adjacent in the last  dimension routed by DOR  For example  consider the following case on a 6x6 2D torus           5                                 E 4    I I I I I  4   lt  F  1 I I I I I  3           t         I I I I I I        a R      1 I I I I I  3       m            S            n              T            0              p       1 I I I I I                                 I I I I I I  x 0 x 2 3 4 5    Suppose switches T and R have failed  and consider the path from S to D  Torus 2QoS will gen   erate the path S n q I u D  with an illegal turn at switch I  and with hop I u using a VL with bit 1  set  As a further example  consider a case that torus 2QoS cannot route without deadlock  two  failed switches adjacent 
230. ode  Assume the answer is    yes    to all ques   tions     no All Non interactive mode  Assume the answer is    no    to all ques           tions        Mellanox Technologies 201       Rev 2 0 3 0 0        Table 29   mstflint Switches  Sheet 3 of 3                    Affected   Switch Relevant Description  Commands   vsd burn Write this string of up to 208 characters to VSD upon a burn   lt string gt  command     burn Burn vsd as it appears in the given image   do not keep existing  use image p VSD on Flash   5   dual image   burn Make the burn process burn two images on Flash  The current  default failsafe burn process burns a single image  in alternat   ing locations     V Print version info                Table 30   mstflint Commands                      Command Description  b urn  Burn Flash  q uery  Query miscellaneous Flash firmware characteristics  v erify  Verify the entire Flash  bb Burn Block  Burn the given image as is  without running any  checks  sg Set GUIDs       ri  lt out file gt     Read the firmware image on the Flash into the specified file       dc  lt out file gt     Dump Configuration  Print a firmware configuration file for  the given image to the specified output file       e rase   lt addr gt     Erase sector       rw  lt addr gt     Read one DWORD from Flash       ww  lt addr gt   lt  data gt     Write one DWORD to Flash       wwne  lt addr gt     Write one DWORD to Flash without sector erase       wbne  lt addr gt    lt size gt   lt data    gt    
231. oduces the following files in the output directory   which is defined by the  o option described below      Synopsis      i   device  lt dev name gt     p   port  lt port num gt      g   guid   GUID in hex       vlr  lt file gt      r   routing    u   fat tree    o   output path  lt directory gt       skip  lt stage gt      skip plugin   library name gt        pe    P    counter  lt  lt PM gt   lt value gt  gt        pm pause time   seconds       ber test       ber use data     ber thresh  lt value gt        extended speeds   dev type       pm per lane       1s  lt 2 5 5 10 14  25   FDR10       lw lt 1x 4x 8x 12x gt       w   write topo file   file name gt       t   topo file  lt file gt      out ibnl dir  lt directory gt        screen num errs   num       smp window   num         gmp window  lt num gt      max hops  lt max hops gt       V   version    h   help    H   deep help        172 Mellanox Technologies    Rev 2 0 3 0 0    Options    Mellanox Technologies 173           Rev 2 0 3 0 0        i   device  lt dev name gt      p   port  lt port num gt      g   guid  lt GUID in hex gt       vlr  lt file gt      r   routing   u   fat tree     o   output path  lt directory gt          skip  lt stage gt       skip_plugin  lt library name gt       pc   P   counter  lt  lt PM gt   lt value gt  gt       pm pause time   seconds      174 Mellanox Technologies      Specifies the name of the device of the port    used to connect to the IB fabric  in case  of multiple devices on he loca
232. omotes data center  application data messaging performance  scalability  and reliability over RDMA interconnects   InfiniBand and RoCE  The uDAPL interface is defined by the DAT collaborative     This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifi   cation is timed to coincide with OFED release of the Open Fabrics  www openfabrics org  soft   ware stack     For more information about the DAT collaborative  go to the following site     http   www datcollaborative org    Mellanox Technologies 21 J      Rev 2 0 3 0 0 Mellanox OFED Overview      1 3 5           Message Passing Interface  MPI  is a library specification that enables the development of paral   lel software libraries to utilize parallel computers  clusters  and heterogeneous networks  Mella   nox OFED includes the following MPI implementations over InfiniBand       Open MPI   an open source MPI 2 implementation by the Open MPI Project  e OSU MVAPICH   an MPI 1 implementation by Ohio State University    Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT  Intel MPI Bench   mark  and Presta     1 3 6 InfiniBand Subnet Manager    All InfiniBand compliant ULPs require a proper operation of a Subnet Manager  SM  running on  the InfiniBand fabric  at all times  An SM can run on any node or on an IB switch  OpenSM is an    InfiniBand compliant Subnet Manager  and it is installed as part of Mellanox OFED   See Chap   ter 8     OpenSM     Subnet Manager      1 3 7 Diagno
233. options mlx4 en enable sys tune 1    7 2 4 3 OS Controlled Power Management    Some operating systems can override BIOS power management configuration and enable c   states by default  which results in a higher latency        To resolve the high latency issue  please follow the instructions below   1  Edit the  boot grub grub conf file or any other bootloader configuration file   2  Add the following kernel parameters to the bootloader command   intel idle max cstate 0 processor max cstate 1  3  Reboot the system   Example     title RH6 2x64   root  hd0 0    kernel  wmlinuz RH6 2x64 2 6 32 220 e16 x86 64  root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0  console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max cstate l    Mellanox Technologies 115      Rev 2 0 3 0 0 Performance      7 2 5    7 2 6    7 2 6 1    Interrupt Moderation    Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU   Mellanox network adapters use an adaptive interrupt moderation algorithm by default  The algo   rithm checks the transmission  Tx  and receive  Rx  packet rates and modifies the Rx interrupt  moderation settings accordingly     To manually set Tx and or Rx interrupt moderation  use the ethtool utility  For example  the fol   lowing commands first show the current  default  setting of interrupt moderation on the interface  ethl  then turns off Rx interrupt moderation  and last shows the new setting     gt  ethtool  c ethl    Coale
234. ords are used to seed the torus mesh topology  For example   xp link 0x2000  0x2001  specifies that a link from the switch with node GUID 0x2000 to the switch with node  GUID 0x2001 would point in the positive x direction  while  xm link 0x2000 0x2001  specifies  that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would  point in the negative x direction  AII the link keywords for a given seed must specify the same   from  switch     In general  it is not necessary to configure both the positive and negative directions for a given  coordinate  either is sufficient  However  the algorithm used for topology discovery needs extra  information for torus dimensions of radix four  see TOPOLOGY DISCOVERY in torus   2005 8    For such cases both the positive and negative coordinate directions must be specified     Based on the topology specified via the torus mesh keyword  torus 2QoS will detect and log  when it has insufficient seed configuration     X dateline position  y dateline position  z dateline position    In order for torus 2QoS to provide the guarantee that path SL values do not change under any  conditions for which it can still route the fabric  its idea of dateline position must not change rel   ative to physical switch locations  The dateline keywords provide the means to configure such  behavior     The dateline for a torus dimension is always between the switch with coordinate 0 and the switch  with coordinate radix 1 for that dimens
235. ou choose to continue loading the OS  after boot  through the HCA device driver  please ver   ify that the initrd image includes the HCA driver as described in Section A 8     Mellanox Technologies 217      Rev 2 0 3 0 0      A 10 1 Configuring an iSCSI Target      Linux Environment    Prerequisites  Step 1  Make sure that an iSCSI Target is installed on your server side     You can download and install an iSCSI Target from the following location   http   sourceforge net projects iscsitarget files iscsitarget     Step 2  Dedicate a partition on your iSCSI Target on which you will later install the operating system    Step 3  Configure your iSCSI Target to work with the partition you dedicated  If  for example  you  choose partition  dev sda5  then edit the iSCSI Target configuration file  etc ietd conf to  include the following line under the iSCSI Target iqn line     Lun 0 Path  dev sda5  Type fileio  Example of an iSCSI Target iqn line   Target iqn 2007 08 7 3 4 10 iscsiboot  Step 4  Start your iSCSI Target   Example     host1   etc init d iscsitarget start    Configuring the DHCP Server to Boot From an iSCSI Target  Configure DHCP as described in Section 4 3 3 1     IPoIB Configuration Based on DHCP      Edit your DHCP configuration file   etc dhcpd conf  and add the following lines for the  machine s  you wish to boot from the iSCSI target   Filename      option root path  iscsi iscsi target ip    iscsi target ign    The following is an example for configuring an IB ETH d
236. owing      Discovers the currently installed kernel       Uninstalls any software stacks that are part of the standard operating system distribution  or another vendor s commercial stack       Installs      MLNX OFED LINUX binary RPMs  if they are available for the current  kernel        Identifies the currently installed InfiniBand and Ethernet network adapters and automat   ically  upgrades the firmware    2 3 1 Pre installation Notes       The installation script removes all previously installed Mellanox OFED packages and  re installs from scratch  You will be prompted to acknowledge the deletion of the old  packages       Pre existing configuration files will be saved with      extension     conf rpmsave                   If you need to install Mellanox OFED on an entire  homogeneous  cluster     common  strategy is to mount the ISO image on one of the cluster nodes and then copy it to a  shared file system such as NFS  To install on all the cluster nodes  use cluster aware  tools  such as pdsh         If your kernel version does not match with any of the offered pre built RPMs  you can  add your kernel version by using the  m1nx add kernel support sh  script located  under the docs  directory            On Redhat and SLES distributions with errata kernel installed there is no need to use the  mlnx add kernel support sh script  The regular installation can be performed and weak     updates mechanism will create symbolic links to the MLNX OFED kernel modules         Usage 
237. pendix A  Mellanox FlexBoot    A 1 Overview  Mellanox FlexBoot is a multiprotocol remote boot technology  FlexBoot supports remote Boot  over InfiniBand  BoIB  and over Ethernet     Using Mellanox Virtual Protocol Interconnect  VPI  technologies available in ConnectX  adapt   ers  FlexBoot gives IT Managers    the choice to boot from a remote storage target  iSCSI target   or a LAN target  Ethernet Remote Boot Server  using a single ROM image on Mellanox Con   nectX products     FlexBoot is based on the open source project iPXE available at http   www  ipxe org     FlexBoot first initializes the adapter device  senses the port protocol     Ethernet or InfiniBand   and brings up the port  Then it connects to a DHCP server to obtain its assigned IP address and  network parameters  and also to obtain the source location of the kernel OS to boot from  The  DHCP server instructs FlexBoot to access the kernel OS through a TFTP server  an iSCSI target   or some other service     For an InfiniBand port  Mellanox FlexBoot implements a network driver with IP over IB acting  as the transport layer  IP over IB is part of the Mellanox OFED for Linux software package  see  www mellanox com  gt  Products  gt  Software  gt  InfiniBand VPI Drivers      The binary code is exported by the device as an expansion ROM image     A 1 1 Tested Platforms    See the Mellanox FlexBoot Release Notes  FlexBoot release notes txt      A 1 2 FlexBoot in Mellanox OFED  The FlexBoot package is provided as a ta
238. port 1 of the second HCA    3  echo  new target info   gt   sys class infinband srp srp mthca0 l add target  4  fdisk  1  will show the newly discovered scsi disks     Example   Assume that you use port 1 of first HCA in the system  i e   mthca0     root lab104     ibsrpdm  c  d  dev infiniband umad0  id ext 0002c90200226cf4 ioc guid 0002c90200226cf4   dgid fe800000000000000002c90200226cf  5 pkey ffff service id 0002c90200226cf4   root lab104     echo id ext 0002c90200226cf4 ioc guid 0002c90200226cf4   dgid fe800000000000000002c90200226cf  5 pkey ffff service id 0002c90200226cf4  gt   sys   class infiniband srp srp mthca0 1 add target  OR     You can edit  etc infiniband openib conf to load the SRP driver and SRP High Avail   ability  HA  daemon automatically  that is  set    SRP LOAD yes    and   SRPHA ENABLE yes      To set up and use the HA feature  you need the dm multipath driver and multipath tool      Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to  enable use the HA feature    The following is an example of an SRP Target setup file   kkkkkkkkkkkkkkkkkkkkkkk srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk     bin sh  modprobe scst scst_threads 1  modprobe scst_vdisk scst_vdisk_ID 100    echo  open vdisk0  dev cciss c1d0 BLOCKIO   gt   proc scsi_tgt vdisk vdisk  echo  open vdiskl  dev sdb BLOCKIO   gt   proc scsi_tgt vdisk vdisk   echo  open vdisk2  dev sdc BLOCKIO   gt   proc scsi tgt vdisk vdisk   echo  open vdisk3  dev sdd BLOCKIO   gt   p
239. ported  Run     lspci   grep Mellanox                                                                   03 00 0 InfiniBand  Mellanox Technologies MT26428  ConnectX VPI PCIe 2 0 5GT s   IB QDR    10GigE   rev b0    03 00 1 InfiniBand  Mellanox Technologies MT27500 Family  ConnectX 3 Virtual Function   rev 10   03 00 2 InfiniBand  Mellanox Technologies MT27500 Family  ConnectX 3 Virtual Function   rev 10   03 00 3 InfiniBand  Mellanox Technologies MT27500 Family  ConnectX 3 Virtual Function   rev 10   03 00 4 InfiniBand  Mellanox Technologies MT27500 Family  ConnectX 3 Virtual Function   rev 10   03 00 5 InfiniBand  Mellanox Technologies MT27500 Family  ConnectX 3 Virtual Function   rev b0   Where        03 00  represents the Physical Function          03 00 X  represents the Virtual Function connected to the Physical Function    4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup     gt  To enable SR IOV and Para Virtualization on the same setup          1  Create a bridge     vim  etc sysconfig network scripts ifcfg bridge0  DEVICE bridge0   TYPE Bridge   PADDR Are        NETMASK 255 255 0 0   BOOTPROTO static   ONBOOT yes   NM CONTROLLED no   DELAY 0    Step 2  Change the related interface  in the example below bridge0 is created over eth5      DEVICE eth5  BOOTPROTO none  STARTMODE on  HWADDR 00 02   c9 2e 66 52  TYPE Ethernet  NM_CONTROLLED no  ONBOOT yes  BRIDGE bridge0          Step3  Restart the service network     86 Mellanox Technologies      Rev 2 0 3 0 
240. ppears   see figure      Press Ctrl B for the iPXE command line          Alternatively  you may skip invoking CLI right after POST and invoke it  instead  right after  FlexBoot starts booting     Once the CLI is invoked  you will see the following prompt   iPXE gt     Operation    The CLI resembles a Linux shell  where the user can run commands to configure and manage one  or more PXE port network interfaces  Each port is assigned a network interface called neti   where i is 0  1  2     lt  of interface gt   Some commands are general and are applied to all network  interfaces  Other commands are port specific  therefore the relevant network interface is speci   fied in the command     Command Reference    210 Mellanox Technologies      Rev 2 0 3 0 0      A 8 3 1 ifstat    Displays the available network interfaces  in a similar manner to Linux   s ifconfig      iPXE gt  ifstat   neto  00 0Z c9 03 00 0c 78 11 on PCIOZ 00 0     1            gt   CLink down  TX 8 TXE 2 RX 11 RXE 111   Link status  The socket is not connected 1            2 x  No such file or directory       RXE     x  The socket is not connected    CRXE  8 x    Operation canceled    neti  00 02 c9 O0c 78 12 on PCIOZ 00 0 Copen    CLink up  TX 12 TXE O HRxX O HRXE 0O1   iPXE gt           8 3 2 ifopen    Opens the network interface net lt x gt   The list of network interfaces is available via the ifstat com   mand     Example     iPXE gt  ifopen netl    A 8 3 3 ifclose    Closes the network interface net lt x gt   
241. r will use probe v   VFs and this will be applied to  all ConnectX amp  HCAs on the host       Its format is a string which allows the user to specify the  probe vf parameter separately per installed HCA      Its format is   bb dd f v bb dd f v          bb dd f   bus device function of the PF of the HCA                      of VFs to use in the PF driver for that HCA   This parameter can be set in one of the following ways  For   example       probe vfs 5  The PF driver will probe 5 VFs on the HCA  and this will be applied to all ConnectX amp  HCAs on the host     probe vfs 00 04 0 5 00 07 0 8   The PF driver will  probe 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for  the one in 00 07 0    Note  PFs not included in the above list will not use any of   their VFs in the PF driver                 The example above loads the driver with 5 VFs  num vfs   The standard use of a VF is a single VF  per a single VM  However  the number of VFs varies upon the working mode requirements     The protocol types are      Port I   IB     Port 2   Ethernet     port type array 2 2  Ethernet  Ethernet      port type          1 1  IB  IB      port type array 1 2  VPI  IB  Ethernet     NO port type array module parameter  ports are IB         9  Reboot the server       If the SR IOV is not supported by the server  the machine might not come out of boot   load           Mellanox Technologies 85 J      Rev 2 0 3 0 0 Driver Features      Step 10  Load the driver        verify      SR IOV is sup
242. r with the larger send_queue_size recv_queue_size values  set the follow   ing ib_ipoib module parameters        send_queue_size 1024 recv_queue_size 1024     Use Jumbo Frames  JF  up to 64K  domu domu      In UD mode  the maximum MTU value is 4092 Bytes    In CM mode  the maximum MTU value is 65520 Bytes    Make sure that all interfaces  including the guest interface and its virtual bridge  have the same MTU  value  For further information of MTU and JF settings  please refer to the Hypervisor User Manual        Tune the TCP IP stack using sysctl  dom0 domu      sbin sysctl perf tuning     Enable irqbalancer  dom0 domu      etc init d irqbalance start      Other performance tuning for KVM environment such as vCPU pinning and NUMA  tuning may apply  For further information  please refer to the Hypervisor User Manual     Contiguous Pages    Contiguous Pages improves performance by allocating user memory regions over physical con   tiguous pages  It enables a user application to ask low level drivers to allocate contiguous mem   ory for itas part of ibv reg mx     Additional performance improvements can be reached by allocating Queue Pair  QP  and Com   pletion Queue  CQJ  buffers to the Contiguous Pages     To activate set the below environment variables with values of PREFER CONTIG or CONTIG     For QP              ALLOC TYPE     ForCQ  MLX CQ ALLOC TYPE   The following are all the possible values that can be allocated to the buffer     Table 3   Buffer Values                
243. raffic  run   set irq affinity bynode sh   numa node     interface       For optimizing dual port traffic  run   set irq affinity bynode sh   numa node     interfacel     interface2       To show the current      affinity settings  run     show irq affinity sh   interface      7 2 7 2 Auto Tuning Utility    MLNX OFED 2 0 x introduces a new affinity tool called mlnx affinity  This tool can automati   cally adjust your affinity settings for each network interface according to the system architecture     Usage      Start      mlnx affinity start    118 Mellanox Technologies      Rev 2 0 3 0 0         Stop    mlnx_affinity stop    Restart    mlnx_affinity restart  mlnx_affinity can also be started by driver load unload   gt  To enable mlnx_affinity by default      Add the line below to the  etc infiniband openib conf file   RUN AFFINITY TUNER yes    7 2 7 3 Tuning for Multiple Adapters    When optimizing the system performance for using more than one adapter  It is recommended to  separate the adapter s core utilization so there will be no interleaving between interfaces   The following script can be used to separate each adapter s IRQs to different set of cores      set irq affinity cpulist sh   cpu list       interface        cpu list   can be either a comma separated list of single core numbers  0 1 2 3   or core groups  0 3     Example   Ifthe system has 2 adapters on the same NUMA node  0 7  each with 2 interfaces run the follow   ing        etc init d irgbalancer stop     s
244. rball   tgz extension  containing the files specified in  Appendix A 1 1     Tested Platforms    page 205     1  A PXE ROM image file for each of the supported Mellanox network adapter devices  Specif   ically  the following images are included     ConnectX   ConnectX 2   ConnectX 3 images   e ConnectX FlexBoot   PCI Device ID gt _ROM  lt version gt  mrom    where the number after the  ConnectX_FlexBoot  prefix indicates the corresponding PCI Device  ID of the ConnectX   ConnectX 2   ConnectX 3 device   2  Additional documents under docs dhcp     A 2 Burning the Expansion ROM Image    A 2 1 Burning the Image on ConnectX     ConnectX   2   ConnectX   3      This section 15 valid for ConnectX                       2 devices with firmware versions    2 8 0600 or later and ConnectX  3 firmware   de    Mellanox Technologies 205      Rev 2 0 3 0 0      Prerequisites  1  Expansion ROM Image    The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in  the release notes file FlexBoot_release_notes txt     2  Firmware Burning Tools    You need to install the Mellanox Firmware Tools  MFT  package  version 2 7 0 or later  in order to  burn the PXE ROM image  To download MFT  see Firmware Tools under www mellanox com  gt   Downloads     Image Burning Procedure  To burn the composite image  perform the following steps     1  Obtain the MST device name  Run     mst start    mst status  The device name will be of the form  mt lt dev_id gt  pci  cro con
245. re carried in the Extended Transport Header  Atomic response genera   tion and packet format for MskCmpSwap is as for standard IB Atomic operations     4 7 1 2 Masked Fetch and Add  MFetchAdd     The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by  allowing the user to split the target into multiple fields of selectable length  The atomic add is  done independently on each one of this fields  A bit set in the field boundary parameter specifies  the field boundaries  The pseudocode below describes the operation      bit adder ci  bl  b2   co             value   ci   bl   b2     co       value  amp  2        Mellanox Technologies 71 J      Rev 2 0 3 0 0 Driver Features      4 8    return value  amp  1     define MASK IS SET mask  attr   1     mask   amp  attr       bit position   1   carry   0   atomic response   0    Mo 1        63       de  1150   bit position   bit position  lt  lt  1    bit add res   bit adder  carry  MASK IS SET  va  bit position    MASK IS SET  compare add  bit position    amp new carry   if  bit add res   atomic response    bit position       carry     new carry   amp  amp    MASK IS SET compare add mask  bit position       return atomic response       Ethernet Tunneling Over IPoIB Driver  elPolB     The eth ipoib driver provides a standard Ethernet interface to be used as a Physical Interface   PIF  into the Hypervisor virtual network  and serves one or more Virtual Interfaces  VIF   This  driver supports L2 Switching
246. re esis             tau      SS eee eee tobe es drop d des 18   1 2 4 Directory Structure      tuqyasqa tenet teen eens 18   1 35 JAP CHILE CHUTE                                               19   1 311 mIXq VP Diver ns ot ese baa Sa                     hie aa pais 19   1 3 2 MXS Driver isd uu as                          aaa eee heels ae RR xU CURE RR 20   133   Midslayer  Core      is hohe tes SL bho tee beady oats hase eels poss  21   13 4 v UBbPs e    Ladies    Ph e t            21   3 5       cea           epe e      22   1 3 6 InfiniBand Subnet Manager                                           22   1 3 7 Diagnostic Utilities    see eee hee e b HI RR Rn 22   1 3 8 Mellanox Firmware Tools isinan               e 22   1 4 Quality of Service                                                  23   1 5          over Converged Ethernet  RoCE                                23   Chapter 2 Installation                                                      24  2 1 Hardware and Software Requirements                                  24   2 2 Downloading Mellanox OFED                                        24   2 3 Installing Mellanox OFED                                           25   2 3 1  Pre installation Notes  aa sce sete ee icen eee Sea i eee apa           25   23 2     Installation Script      Moe ee gts od Sapien dg HOR Seed uqha      26   2 3 3 Installation Procedure    0 2    se disse se bj  sse ccc tne ence nee 28   2 3 4  Installation Results vis  ua sne awk appe peat ee ae 3
247. re shared by the new MR  Once the MR is shared  it can be  used even if the original MR was destroyed   The request to share the MR can be repeated numerous times and arbitrary number of Memory  Regions can potentially share the same physical memory locations   Usage      Uses the    handle    field that was returned from the ibv reg mr as the mr handle     Supplies the desired    access mode    for that MR       Supplies the address field which can be either NULL or any hint as the required output  The address and  its length are returned as part of the ibv_mr struct     To achieve high performance it is highly recommended to supply an address that is aligned as the origi   nal memory region address  Generally  it may be an alignment to 4M address   For further information on how to use the ibv reg shared mr verb  please refer to the  ibv_reg_shared_mr man page and or to the ibv_shared_mr sample program which demonstrates  a basic usage of this verb   Further information on the ibv shared mr sample program can be found in the ibv_shared_mr  man page     4 11 XRC   eXtended Reliable Connected Transport Service for InfiniBand    XRC allows significant savings in the number of QPs and its associated memory resources  required to establish all to all process connectivity in large clusters     It significantly improves the scalability of the solution for large clusters of multicore end nodes  by reducing the required resources    For further details  please refer to the  Annex A1
248. recommended to run the fol   lowing command in order to generate the inventory file     hostl  osmtest  f c  Immediately afterwards  run the following command to test opensm     host1  osmtest  f a    Finally  it is recommended to occasionally run  osmtest  v   with verbosity  to verify that noth   ing in the fabric has changed     8 4 Partitions    OpenSM enables the configuration of partitions  PKeys  in an InfiniBand fabric  By default   OpenSM searches for the partitions configuration file under the name  usr etc opensm par   titions conf  To change this filename  you can use opensm with the      Pconfig    or     P    flags     The default partition 1s created by OpenSM unconditionally  even when a partition configuration  file does not exist or cannot be accessed     The default partition has a P Key value of Ox7fff  The port out of which runs OpenSM is  assigned full membership in the default partition  All other end ports are assigned partial mem   bership     8 4 4 File Format  Notes       Line content followed after         character is comment and ignored by parser     General File Format      Partition Definition gt   lt PortGUIDs list        Partition Definition      PartitionName    PKey     flag  value       defmember full limited     Mellanox Technologies 133      Rev 2 0 3 0 0 OpenSM     Subnet Manager      where  PartitionName string  will be used with logging  When omitted  an empty  string will be used     PKey P Key value for this partition  Only low 15 b
249. rget IB port GUID in the PR MPR query    Since any section of the policy file is optional  as long as basic rules of the file are kept  such as  no referring to nonexisting port group  having default QoS Level  etc   the simple policy section   qos ulps  can serve as a complete QoS policy file     The shortest policy file in this case would be as follows     qos ulps  default   0  default SL  end qos ulps  It is equivalent to the previous example of the shortest policy file  and it is also equivalent to not  having policy file at all  Below is an example of simple QoS policy with all the possible key        words    qos ulps   default   0   default SL   Sdp  port num 30000   0   SL for application running on    top of SDP when a destination    TCP IPport is 30000   Sdp  port num 10000 20000 8 0   sdp   1 4 default SL for any other    application running on top of SDP   rds 32  Sh ORI          ipoib  pkey 0x0001   0   SL for IPoIB on partition with    pkey 0x0001   ipoib   4   default IPoIB partition     pkey 0x7FFF   any  service id 0x6234   6   match any PR MPR query with a    specific Service ID          Mellanox Technologies 155      Rev 2 0 3 0 0 OpenSM     Subnet Manager              pkey 0x0ABC   6   match any PR MPR query with a    specific PKey   srp  target port guid 0x1234   5   SRP when SRP Target is located    on a specified IB port GUID  any  target port guid 0x0ABC 0xFFFFF   6   match any PR MPR query            with a specific target port GUID   end qos ulps  S
250. ring  to map device function numbers to their probe vf values  e g    0000 04 00 0 3 002b 1c 0b a 13     Hexadecimal digits for the device function  e g  002b 1c 0b a   and decimal for probe vf value  e g  13    string     Mellanox Technologies 223         Rev 2 0 3 0 0      log num mgm entry size     high rate steer     fast drop   enable 64b                   log num mac   log num vlan   log mtts per seg   port type array     log num gp   log num srq   log rdmarc per qp   log num cq   log num mcg     log num mpt              log num mtt     enable qos     internal err reset                4 en Parameters    inline thold     udp rs      pf  ctx     pfcrx     log mgm size  that defines the num of qp per mcg  for example   10 gives 248 range  7     log num mgm entry size     12  To  activate device managed flow steering when available  set to    T            Enable steering mode for higher packet rate  default off     int    Enable fast packet drop when no recieve WQEs are posted  int   Enable 64 byte CQEs EQEs when the FW supports this if non zero   default  1   int  Log2 max number of MACs per ETH port  1 7   int     Obsolete  Log2 max number of VLANs per ETH port  0 7   int   Log2 number of MTT entries per segment  0 7   default  0   int        Either pair of values  e g   1 2   to define uniform portl   port2 types configuration for all devices functions or a  string to map device function numbers to their pair of port  types values  e g   0000 04 00 0 1 2 002b 1c 0b a 1 1
251. rmally  osmtest expects to find an inventory file   which osmtest uses to validate real time information    Mellanox Technologies 131      Rev 2 0 3 0 0 OpenSM     Subnet Manager      received from the SA during testing  If  1 is not  specified  osmtest defaults to the file  osmtest dat See  c option for related information   8    stress This option runs the specified stress test instead of the  normal test suite Stress test options are as follows   OPT Description   81 Single MAD response SA queries   82 Multi MAD  RMPP  response SA queries   83 Multi MAD  RMPP  Path Record SA queries  Without  s  stress testing is not performed   M    Multicast ModeThis option specify length of Multicast test   OPT Description   M1 Short Multicast Flow  default    single mode   M2 Short Multicast Flow   multiple mode   M3 Long Multicast Flow   single mode   M4 Long Multicast Flow   multiple mode  Single mode   Osmtest is tested alone  with no other apps  that interact with OpenSM MC  ultiple mode   Could be run with other apps using       MC with OpenSM  Without  M  default flow testing is per  formed   t    timeout This option specifies the time in milliseconds used for  transaction timeouts  Specifying  t 0 disables  timeouts  Without  t  OpenSM defaults to a timeout value  of 200 milliseconds   ol     falle This option defines the log to be the given file  By  default the log goes to  var log osm log  For the log to  go to standard output use  f stdout         v    verbose This option in
252. rnet ports  By  default both ConnectX ports are initialized as InfiniBand ports  If you wish to change the port  type use the connectx_port_config script after the driver is loaded     Running   sbin connectx port config  s  will show current port configuration for all  ConnectX devices     Port configuration is saved in the file   etc infiniband connectx conf  This saved con   figuration is restored at driver restart only if restarting via   etc init d openibd  restart        Possible port types are      eth     Ethernet     ib    Infiniband       auto     Link sensing mode   Detect port type based on the attached network type  If no  link is detected  the driver retries link sensing every few seconds     The port link type can be configured for each device in the system at run time using the     sbin   connectx port config  script  This utility will prompt for the PCI device to be modified  1f  there is only one it will be selected automatically      In the next stage the user will be prompted for the desired mode for each port  The desired port  configuration will then be set for the selected device     This utility also has a non interactive mode      sbin connectx port config    d   device   PCI device ID gt    c   conf   portl port2        108 Mellanox Technologies      Mellanox OFED for Linux User s Manual Rev 2 0 3 0 0    62 Auto Sensing    Auto Sensing enables the NIC to automatically sense the link type  InfiniBand or Ethernet  based  on the link partner and load th
253. roc scsi tgt vdisk vdisk   echo  add vdisk0 0   gt   proc scsi_tgt groups Default devices   echo  add vdiskl 1   gt   proc scsi_tgt groups Default devices   echo  add vdisk2 2   gt   proc scsi_tgt groups Default devices   echo  add vdisk3 3   gt   proc scsi_tgt groups Default devices    modprobe ib srpt    Mellanox Technologies 221      Rev 2 0 3 0 0      echo  add  mgmt    gt   proc scsi tgt trace level  echo  add  mgmt dbg    gt   proc scsi tgt trace level  echo  add  out of mem       proc scsi tgt trace level    kkkkkkkkkkkkkkkkkkkkkkk End srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkk    B 3 How to Unload Shutdown    1  Unload ib srpt    modprobe  r ib srpt  2  Unload scst and its dev_handlers first      modprobe  r scst_vdisk scst       3  Unload ofed     etc rc d openibd stop    222 Mellanox Technologies      Rev 2 0 3 0 0      Appendix C  mlx4 Module Parameters    In order to set m1x4 parameters  add the following line s  to  etc modprobe conf     options mlx4 core parameter  lt value gt     and or    options mlx4 ib parameter  lt value gt     and or                options mlx4 en parameter  lt value gt     The following sections list the available m1x4 parameters     C 1  mlx4 ib Parameters    sm guid assign  Enable SM alias GUID assignment if sm guid assign    0   Default  1   int   dev assign str   Map device function numbers to IB device numbers     e 9g  0000 04 00 0 0 002b 1c 0b a 1          Hexadecimal digits for the device function  e g  002b 1c 0b a   and decimal for IB
254. ror window   0  mechanism dis   abled   no error checking     Default  5          cc statistics cycle       Enables CC MGR to collect statistics from all  nodes every cc statistics cycle  seconds        Default  0  When the value is set to 0   no statistics are collected        170 Mellanox Technologies            Rev 2 0 3 0 0      9 InfiniBand Fabric Diagnostic Utilities    9 1 Overview    The diagnostic utilities described in this chapter provide means for debugging the connectivity  and status of InfiniBand  IB  devices in a fabric     92 Utilties Usage    This section first describes common configuration  interface  and addressing for all the tools in  the package  Then it provides detailed descriptions of the tools themselves including  operation   synopsis and options descriptions  error codes  and examples     9 2 1 Common Configuration  Interface and Addressing    Topology File  Optional     An InfiniBand fabric is composed of switches and channel adapter  HCA TCA  devices  To iden   tify devices in a fabric  or even in one switch system   each device is given a GUID  a MAC  equivalent   Since a GUID is a non user friendly string of characters  it is better to alias it to a  meaningful  user given name  For this objective  the IB Diagnostic Tools can be provided with a     topology file     which is an optional configuration file specifying the IB fabric topology in user   given names     For diagnostic tools to fully support the topology file  the user may need to p
255. rovide the local sys   tem name  if the local hostname is not used in the topology file      To specify a topology file to a diagnostic tool use one of the following two options    1  On the command line  specify the file name using the option     t   topology file name gt      2  Define the environment variable IBDIAG TOPO FILE   To specify the local system name to an diagnostic tool use one of the following two options    1  On the command line  specify the system name using the option         lt local system name gt      2  Define the environment variable IBDIAG SYS NAME    9 2 2 InfiniBand Interface Definition    The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port   through which they send MADs  To specify this port to an IB diagnostic tool use one of the fol    lowing options    1  On      command line  specify the port number using the option   p   local port number gt      see  below    2  Define the environment variable IBDIAG PORT NUM    In case more than one HCA device is installed on the local machine  it is necessary to specify the  device s index to the tool as well  For this use on of the following options     1  On the command line  specify the index of the local device using the following option     1   index of local device gt        2  Define the environment variable IBDIAG DEV IDX    Mellanox Technologies 171    9 2 3    9 3    Rev 2 0 3 0 0           InfiniBand Fabric Diagnostic Utilities       Addressing      This 
256. rrently available only on upstream kernels newer than 3 1     ip link set dev  lt PF device gt  vf  lt NUM gt  spoofchk  on   off     4 13 7 3 3ROCE Support    RoCE is supported on VFs and VLANs may be used  For RoCE  the hypervisor can support  RoCE over up to 15 vlans  There are 127 vlans available per port  for the Hypervisor all guests  together   The Hypervisor is allocated 16 GIDs  which can support 15 VLANs  The remaining  VLANs are allocated equally among the number of VFs requested in the  num vfs  mlx4 core  module parameter                VLANs will not work in VST mode  packets will simply not be sent nor will they  arrive    ade    4 14 CORE Direct    4 14 1 CORE Direct Overview    CORE Direct provides a solution for off loading the MPI collectives operations from the soft   ware library to the network  CORE Direct accelerates MPI applications and solves the scalability  issues in large scale systems by eliminating the issues of operating systems noise and jitter     It addresses the collectives communication scalability problem by off loading a sequence of data   dependent communications to the Host Channel Adapter  HCA   This solution provides the  hooks needed to support computation and communication overlap  Additionally  it provides a  means to reduce the effects of system noise and application skew on application scalability     The relevant verbs to be used for CORE Direct      ibv create qp ex     ibv modify cq      16   query device ex     jbv post task 
257. rs Flags and Options                              Default  Flag Pd M   If Not Description       Specified     h help  Optional Print the help menu    b Optional Print in brief mode  Reduce the output to show  only if errors are present  not what they are    v erbose  Optional Increase verbosity level  May be used several  times for additional verbosity   vvv or  v  v    v     G uid  Optional Use GUID address argument  In most cases  it  is the Port GUID  Example     0x08f1040023       T Optional Use specified threshold file    lt threshold_fi   le gt     5 Optional Show the predefined thresholds                          Optional color mode Use mono mode rather than color mode                      198 Mellanox Technologies    Table 28   ibcheckerrs Flags and Options      Rev 2 0 3 0 0                                          Optional   Default  Flag     es t  If Not Description                Specified     C Optional Use the specified channel adapter or router   lt ca_name gt    P   ca port     Optional Use the specified port   t Optional Override the default timeout for the solicited   lt timeout_ms MADs  msec    gt    lt lid   guid gt  Mandatory Use the specified port   s or node   s LID GUID   with  G flag  with  G option     lt port gt   Mandatory Use the specified port   without    G flag   Examples    1  Check aggregated node counter for LID 0x2      gt  ibcheckerrs 2    warn  counter SymbolErrors   65535  threshold 10  lid 2 port 255   255  threshold 10  lid 2 port 255 
258. rt2  Note  This switch is applicable only for Mellanox Technolo   gies Ethernet products     macs burn  sg Two MACs must be specified here  The specified MACs are    lt MACs    gt  assigned to port  and port2  repectively   Note  This switch is applicable only for Mellanox Technolo   gies Ethernet products     blank guids   burn Burn the image with blank GUIDs and MACs  where applica   ble   These values can be set later using the sg command     see  Table 30 below      No com  Force clear the Flash semaphore on the device  No command is   clear semap   mands allowed when this switch is used    hore allowed Warning  May result in system instability or Flash corruption if  the device or another application is currently using the Flash     i mage  burn  verify Binary image file    lt image gt     qq burn  query Run a quick query  When specified  mstflint will not perform  full image integrity checks during the query operation  This  may shorten execution time when running over slow interfaces   e g   I2C  MTUSB 1      nofs burn Burn image in a non failsafe manner    skip is burn Allow burning the firmware image without updating the  invariant sector  This is to ensure failsafe burning even when  an invariant sector difference is detected     byte mode   burn  write Shift address when accessing Flash internal registers  May be  required for burn write commands when accessing certain  Flash types     s ilent  burn Do not print burn progress messages    y es  All Non interactive m
259. ry Access    The Shared Memory Access  SHMEM  routines provide low latency  high bandwidth communi   cation for use in highly parallel scalable programs  The routines in the SHMEM Application Pro   gramming Interface  API  provide a programming model for exchanging data between  cooperating parallel processes  The SHMEM API can be used either alone or in combination  with MPI routines in the same parallel program     The SHMEM parallel programming library is an easy to use programming model which uses  highly efficient one sided communication APIs to provide an intuitive global view interface to  shared or distributed memory systems  SHMEM s capabilities provide an excellent low level  interface for PGAS applications     A SHMEM program is of a single program  multiple data  SPMD  style  All the SHMEM pro   cesses  referred as processing elements  PEs   start simultaneously and run the same program   Commonly  the PEs perform computation on their own sub domains of the larger problem  and  periodically communicate with other PEs to exchange information on which the next communi   cation phase depends     The SHMEM routines minimize the overhead associated with data transfer requests  maximize  bandwidth  and minimize data latency  the period of time that starts when a PE initiates a transfer  of data and ends when a PE can use the data      SHMEM routines support remote data transfer through       put  operations   data transfer to a different PE        get  operations   data
260. s         sbin insmod  lib modules ib ipoib helper ko   sbin insmod  lib modules ib ib ipoib ko  Step 11  In case of interoperability issues between iSCSI and Large Receive Offload  LRO   change the  last command above as follows to disable LRO    sbin insmod  lib modules ib ib ipoib ko lro 0  Step 12  Now you can assign an IP address to your IB device by adding a call to ifconfig or to the  DHCP client in the init file after loading the modules  If you wish to use the DHCP client  then    you need to add a call to the DHCP client in the init file after loading the IB modules  For  example      sbin dhclient  cf  sbin dhclient conf 1  1  Step 13  Save the init file   Step 14  Close initrd   hostl  cd  tmp initrd_ib  host1  find      cpio  H newc  o  gt   tmp new initrd ib img  hostl  gzip  tmp new init ib img  Step 15  At this stage  the modified initrd  including the IB driver  is ready and located at     tmp new init ib img gz  Copy it to the original initrd location and rename it prop   erly     A 9 2 Case Il  Ethernet Ports    The Ethernet driver requires loading the following modules in the specified order     see the exam   ple below       mlx4 core ko      mlx4 en ko    Mellanox Technologies 215      Rev 2 0 3 0 0      A 9 2 1 Example  Adding      Ethernet Driver to initrd  Linux     Prerequisites    1  The FlexBoot image is already programmed on the adapter card     2  The DHCP server is installed and configured as described in Section 4 3 3 1 on page 50  and  connect
261. s  then do it in software   SOF TIMESTAMPING RAW HARDWARE  return original raw hardware time stamp  SOF TIMESTAMPING SYS HARDWARE  return hardware time stamp transformed to  the system time base   SOF TIMESTAMPING SOFTWARE  return system time stamp generated in  software   SOF TIMESTAMPING TX RX determine how time stamps are generated    SOF TIMESTAMPING RAW SYS determine how they are reported                         To enable time stamping for a net device     Admin privileged user can enable disable time stamping through calling ioctl sock  SIOCSHWT   STAMP   amp ifreq  with following values     Send side time sampling     68 Mellanox Technologies      Rev 2 0 3 0 0        Enabled by ifreq hwtstamp config tx type when       Mellanox Technologies 69      Rev 2 0 3 0 0 Driver Features      Receive side time sampling     Enabled by ifreq hwtstamp config rx filter when       possible values for hwtstamp config   rx filter     enum hwtstamp rx filters       time stamp no incoming packet at all     HWTSTAMP FILTER NONE      time stamp any incoming packet     HWTSTAMP FILTER ALL      return value  time stamp all packets requested plus some others     HWTSTAMP FILTER SOME        PTP vi  UDP  any kind of event packet     HWTSTAMP FILTER PTP V1 L4 EVENT       PTP vi  UDP  Sync packet     HWTSTAMP FILTER PTP V1 14 SYNC       PTP vl  UDP  Delay req packet     HWTSTAMP FILTER PTP V1 L4 DELAY REQ                 PTP v2  UDP  any kind of event packet     HWTSTAMP FILTER PTP V2 L4 EVENT   
262. s across such links in a round robin fashion  based on ports at the path  destination switch that are active and not used for inter switch links  Should a link that is one of  severalsuch parallel links fail  routes are redistributed across the remaining links  When the last  of such a set of parallel links fails  traffic 1s rerouted as described above     Handling a failed switch under DOR requires introducing into a path at least one turn that would  be otherwise  illegal   Le  not allowed by DOR rules  Torus 2QoS will introduce such a turn as  close as possible to the failed switch in order to route around it  n the above example  suppose  switch T has failed  and consider the path from S to D  Torus 2QoS will produce the path S n I r   D  rather than the S n T r D path for a pristine torus  by introducing an early turn at n  Normal  DOR rules will cause traffic arriving at switch I to be forwarded to switch r  for traffic arriving  from I due to the  early  turn at n  this will generate an  illegal  turn at I     Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1  which  would be otherwise unused  for y x  z x  and z y turns  i e   those turns that are illegal under  DOR  This causes the first hop after any such turn to use a separate set of VL values  and pre   vents deadlock in the presence of a single failed switch  For any given path  only the hops after a  turn that is illegal under DOR can contribute to a credit loop that leads to deadl
263. sce parameters for ethl   Adaptive RX  on TX  off    pkt rate low  400000  pkt rate high  450000    rx usecs  16  rx frames  88  rx usecs irq  0  rx frames irq  0       ethtool  C ethl adaptive rx off rx usecs 0 rx frames 0     gt  ethtool  c ethl  Coalesce parameters for ethl   Adaptive RX  off TX  off    pkt rate low  400000  pkt rate high  450000    rx usecs  0  rx frames  0  rx usecs irq  0  rx frames irq  0    Tuning for NUMA Architecture    Tuning for Intel amp  Sandy Bridge Platform    The Intel Sandy Bridge processor has an integrated PCI express controller  Thus every PCIe  adapter OS is connected directly to a NUMA node     On a system with more than one NUMA node  performance will be better when using the local  NUMA node to which the PCIe adapter is connected     In order to identify which NUMA node is the adapter s node the system BIOS should support  ACPI SLIT      gt  To see if your system supports PCIe adapter s NUMA node detection       cat  sys class net   interface   device numa node    cat  sys devices  PCI root   PCIe function  numa node    116 Mellanox Technologies      Rev 2 0 3 0 0      Example for supported system       cat  sys class net eth3 device  numa_node  0    Example for unsupported system     cat  sys class net ib0 device numa node   1  7 2 6 1 1 Improving Application Performance on Remote NUMA Node    Verbs API applications that mostly use polling  will have an impact when using the remote  NUMA node     libmlx4 has a build in enhancement th
264. section applies to the ibdiagpath tool only  A tool command may require defining    the destination device or port to which it applies           The following addressing modes can be used to define the IB ports     Using a Directed Route to the destination   Tool option     d       This option defines a directed route of output port numbers from the local port to the destination   e Using port LIDs   Tool option   I       In this mode  the source and destination ports are defined by means of their LIDs  If the fabric is con   figured to allow multiple LIDs per port  then using any of them is valid for defining a port        Using port names defined in the topology file   Tool option     n        This option refers to the source and destination ports by the names defined in the topology file    Therefore  this option is relevant only if a topology file is specified to the tool   In this mode  the tool  uses the names to extract the port LIDs from the matched topology  then the tool operates as in the   I   option     ibdiagnet  of ibutils2    IB Net Diagnostic        This version of ibdiagnet is included in the ibutils2 package  and it is run by default    after installing Mellanox OFED       use this ibdiagnet version  run  ibdiagnet           Please see ibutils2 release notes txt for additional information and known issues     ibdiagnet scans the fabric using directed route packets and extracts all the available information  regarding its connectivity and devices  It then pr
265. sion and exits       config   F  lt file name gt   The name of the OpenSM config file  When not specified   etc opensm opensm conf will be used  if exists        create config   c  lt file name gt   OpenSM will dump its configuration to the specified file and exit   This is a way to generate OpenSM configuration file template       guid   g   GUID in hex      This option specifies the local port GUID value  with which OpenSM should bind   OpenSM may be    Mellanox Technologies 121      Rev 2 0 3 0 0 OpenSM     Subnet Manager      bound to 1 port         time   If GUID given is 0  OpenSM displays a list  of possible port GUIDs and waits for user input   Without  g  OpenSM tries to use the default port     lmc   1  lt LMC gt   his option specifies the subnet s LMC value   he number of 110  assigned to each port is 2 LMC     MC values  gt  0 allow multiple paths between ports         lt   T   It  The LMC value must be in the range 0 7   Li  Li    MC values  gt  0 should only be used if the subnet  topology actually provides multiple paths between  ports  i e  multiple interconnects between switches   Without  1  OpenSM defaults to LMC   0  which allows  one path between any two ports       priority   p  lt PRIORITY gt   This option specifies the SM s PRIORITY   This will effect the handover cases  where master  is chosen by priority and GUID  Range goes  from 0  lowest priority  to 15  highest        smkey   k   SM Key    This option specifies the SM s SM Key  64 bits    This wi
266. start instructions   Preparing  2 HH    HHH HH      HH           HHH         HH    HH    HH HH  mxm           HH                                 HHH      HH HH  Preparing  n HH    HH            HH    HH      HHH         HH         HH HH  bupc      HH Ht dy      HH           HHH         HH         HH    Preparing  x HH    HH He 41 4E 4L      HH           HHH         HH         HH HH  infinipath psm      HH                                          HH         HH HH  Preparing  LA       HH            HH    HH      HHH         HH         HH HH  infinipath psm devel      HH Ht db      HH           HHH         HH         HH HH  Preparing       HH    HH Ht db      HH                        HH    HH    HH HH  mvapich2      HH E 4 4E 4L                  Het         HH         HH HH  Preparing       HH    HH            HHH        HHH         HH    HH    HH HH  openmpi T1HHHBHUHEBHHSHHHHUH SHHHHHUUH B H AE HEH HE HE AE H HE BSHBHBHH NE  Preparing HH    HH He 41 4E 4L i    HH           HHH         HH    HH    HH HH  openshmem      HH Ht db                  HH            HH         HH HH  Preparing     HH    HH            HH    HH HH    HH            HH    HH    HH HH  mpitests mvapich2      HH E 4 4 4L      HH           HHH         HH         HH    Preparing  2x HH    HH               HH           HHH         HH    HH    HH HH  mpitests openmpi      HH He dt                   H           HH         HH HH  Preparing  N HH    HH            HH    HH      HHH         HH    HH    HH HH  mlnxof
267. stic Utilities    Mellanox OFED includes the following two diagnostic packages for use by network and data   center managers        jbutils   Mellanox Technologies diagnostic utilities      infiniband diags     OpenFabrics Alliance InfiniBand diagnostic tools    1 3 8 Mellanox Firmware Tools    The Mellanox Firmware Tools  MFT  package is a set of firmware management tools for a single  InfiniBand node  MFT can be used for       Generating a standard or customized Mellanox firmware image     Querying for firmware information      Burning a firmware image to a single InfiniBand node   MFT includes the following tools      mlxburn   provides the following functions        Generation of a standard or customized Mellanox firmware image for burning   in  bin   binary  or  img format       Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device    Querying the firmware version loaded on an HCA board    Displaying the VPD  Vital Product Data  of an HCA board    flint  This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mella     nox network adapter bridge switch device  It includes query functions to the burnt firmware image  and to the binary image file        spark       1  OpenSM is disabled by default  See Chapter 8     OpenSM   Subnet Manager  for details on enabling it     22 Mellanox Technologies      Rev 2 0 3 0 0      This tool burns a firmware binary image to the EEPROM s  attached to an InfiniScaleIII   switc
268. stre Compilation over MLNX OFED   page 226  2 0 3 0 0 August 2013   Updated the following sections        Section 1 3 4     ULPs     on page 21      Section 4 12     Flow Steering     on page 77 and its subsections      Section 1 3 3     Mid layer Core     on page 21      Section 4 8     Ethernet Tunneling Over IPoIB Driver   eIPoIB    on page 70      Section 8 2 1     opensm Syntax     on page 121      Appendix C     mlx4 Module Parameters    page 223      Added the following sections       Section 1 5              over Converged Ethernet  RoCE    on  page 23      Section 4 5     Quality of Service Ethernet     on page 59 and its  subsections     Section 4 11              eXtended Reliable Connected Trans   port Service for InfiniBand     on page 76      Section 4 13 7     Configuring Pkeys and GUIDs under SR   IOV     on page 87 and its subsections      Section 4 15   Ethtool   on page 94      Appendix E     Lustre Compilation over MLNX OFED    page 226      2 0 2 0 5 April 2013 Initial release                      Mellanox Technologies 11 J      Rev 2 0 3 0 0    About this Manual    This Preface provides general information concerning the scope and organization of this User   s  Manual     Intended Audience    This manual is intended for system administrators responsible for the installation  configuration   management and maintenance of the software and hardware of VPI  InfiniBand  Ethernet   adapter cards  It is also intended for application developers     Common Abbrevi
269. t    ibdiagnet scans the fabric using directed route packets and extracts all the available information  regarding its connectivity and devices  It then produces the following files in the output directory   which is defined by the  o option described below      Synopsis    ibdiagnet l e  lt count gt     v    r    o  lt out dir gt     t  lt topo file gt      s  lt sys name gt     i  lt dev index gt     p  lt port num gt     wt     pm    pc    P  lt  lt PM gt   lt Value gt  gt     lw  lt 1x 4x 12x gt     1s   lt 2 5 5 10 gt       skip  lt ibdiag check s gt     load_db  lt db file gt      176 Mellanox Technologies      Rev 2 0 3 0 0      Options        count   Min number of packets to be sent across each link  default    110    V Enable verbose mode   r Provides a report of the fabric qualities   t   topo file   Specifies the topology file name   s   sys name   Specifies the local system name  Meaningful only if a    topology file is specified    i   dev index   Specifies the index of the device of the port used to  connect to the IB fabric  in case of multiple devices on  the local system         p   port num   Specifies the local device s port num used to connect to  the IB fabric   o  lt out dir gt  Specifies the directory where the output files will be    placed  default    tmp     lw   1x 4x l2x   Specifies the expected link width    ls  lt 2 5 5 10 gt  Specifies the expected link speed    pm Dump all the fabric links  pm Counters into ibdiagnet pm         Reset all the fa
270. t  FlexBoot  Download Tab      A 9 1 Case 1  InfiniBand Ports    The IB driver requires loading the following modules in the specified order  see Section A 9 1 1  for an example      ib_addr ko  ib_core ko  ib_mad ko  ib_sa ko  ib_cm ko  ib_uverbs ko  ib_ucm ko  ib_umad ko  iw_cm ko  rdma_cm ko  rdma_ucm ko  mlx4 core ko  mlx4 ib ko  ib mthca ko    ipoib helperko     this module is not required for all OS kernels  Please check the  release notes     ib ipoib ko    212 Mellanox Technologies      Rev 2 0 3 0 0      A 9 1 1 Example  Adding an IB Driver to initrd  Linux     Prerequisites  1  The FlexBoot image is already programmed on the HCA card     2  The DHCP server is installed and configured as described in Section 4 3 3 1     IPoIB Config   uration Based on DHCP     and is connected to the client machine     3  An initrd file     4  To add an IB driver into initrd  you need to copy the IB modules to the diskless image   Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is  appropriate for the kernel version the diskless image will run     Adding the IB Driver to the initrd File    The following procedure modifies critical files used in the boot procedure  It must be       executed by users with expertise in the boot process  Improper application of this          cedure may prevent the diskless machine from booting            1  Back up your current initrd file   Step 2  Make anew working directory and change to it     host1  mkdir  tmp in
271. t plugins  help if exists      H   deep help   Prints deep help information  including plugins  help      Output Files  Table 19 lists the ibdiagnet output files that are placed under  var tmp ibdiagnet2     Table 19   ibdiagnet  of ibutils2  Output Files                         Output File Description  ibdiagnet2 lst Fabric links in LST format  ibdiagnet2 sm Subnet Manager  ibdiagnet2 pm Ports Counters  ibdiagnet2 fdbs Unicast FDBs  ibdiagnet2 mcfdbs Multicast FDBx  ibdiagnet2 nodes info Information on nodes                Mellanox Technologies 175    9 4    Rev 2 0 3 0 0           InfiniBand Fabric Diagnostic Utilities       Table 19   ibdiagnet  of ibutils2  Output Files       Output File Description                ibdiagnet2 db_csv ibdiagnet internal database       An ibdiagnet run performs the following stages      Fabric discovery     Duplicated GUIDs detection     Links in INIT state and unresponsive links detection    Counters fetch     Error counters check     Routing checks     Link width and speed checks     Alias GUIDs check      Subnet Manager check     Partition keys check      Nodes information    Return Codes    0   Success  1   Failure  with description     ibdiagnet  of ibutils    IB Net Diagnostic    after installing Mellanox OFED  To use this ibdiagnet version and not that of the ibu        gt  This version of ibdiagnet is included in the ibutils package  and it is not run by default  tils package  you need to specify the full path   opt bin ibdiagnet    a
272. t when it has insufficient configuration for a torus with  radix 4 dimensions     In the event the torus is significantly degraded  1 e   there are many missing switches or links  it  may happen that torus 2QoS is unable to place into the torus some switches and or links that  were discovered in the fabric  and will generate a warning in that case  A similar condition    146 Mellanox Technologies      Rev 2 0 3 0 0      occurs if torus 2QoS is misconfigured  i e   the radix of a torus dimension as configured does not  match the radix of that torus dimension as wired  and many switches links in the fabric will not  be placed into the torus     8 5 7 4 Quality Of Service Configuration    OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration  configuration unless it is invoked with  Q  Since torus 2QoS depends on such functionality for  correct operation  always invoke OpenSM with  Q when torus 2QoS is in the list of routing  engines  Any quality of service configuration method supported by OpenSM will work with  torus 2QoS  subject to the following limitations and considerations  For all routing engines sup   ported by OpenSM except torus 2QoS  there is a one to one correspondence between QoS level  and SL  Torus 2QoS can only support two quality of service levels  so only the high order bit of  any SL value used for unicast QoS configuration will be honored by torus 2QoS  For multicast  QoS configuration  only SL values 0 and 8 should be used w
273. t1  ssh host2 uname    Linux    5 2 3 MPI Selector   Which MPI Runs    Mellanox OFED contains a simple mechanism for system administrators and end users to select  which MPI implementation they want to use  The MPI selector functionality is not specific to  any MPI implementation  it can be used with any implementation that provides shell startup files  that correctly set the environment for that MPI  The Mellanox OFED installer will automatically  add MPI selector support for each MPI that it installs  Additional MPI s not known by the Mella   nox OFED installer can be listed in the MPI selector  see the mpi selector 1  man page for  details     Note that MPI selector only affects the default MPI environment for future shells  Specifically  if  you use MPI selector to select MPI implementation ABC  this default selection will not take  effect until you start a new shell  e g   logout and login again   Other packages  such as environ   ment modules  provide functionality that allows changing your environment to point to a new  MPI implementation in the current shell  The MPI selector was not meant to duplicate or replace  that functionality     The MPI selector functionality can be invoked in one of two ways   1  The mpi selector menu command     This command is a simple  menu based program that allows the selection of the system wide MPI   usually only settable by root  and a per user MPI selection  It also shows what the current selections  are  This command is recommended 
274. table  IBA 7 6 9      VLArb low table  Low priority VL Arbitration table  IBA 7 6 9  template      VLArb high table  High priority VL Arbitration table  IBA 7 6 9  template      SL2VL  SL2VL Mapping table  IBA 7 6 6  template  It is a list of VLs corresponding  to SLs 0 15  Note that VL15 used here means drop this SL      There are separate QoS configuration parameters sets for various target types  CAs  routers   switch external ports  and switch s enhanced port 0  The names of such parameters are prefixed  by  qos   type    string  Here is a full list of the currently supported sets        qos        QoS configuration parameters set for CAs     qos rtr   parameters set for routers       qos sw     parameters set for switches  port 0      qos swe   parameters set for switches  external ports     Here s the example of typical default values for CAs and switches  external ports  hard coded in  OpenSM initialization      qos ca max vls 15   qos ca high limit 0    efe            dentem        Lot  250  900 2080  s 0   5 800   190 880  9a   1050  Mile   30280  Lei  312  810  qos ca vlarb low 0 0 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 4 13 4 14 4  Cos Ca        0 1  2 354 505  77 8 9530   Mil Wa  Ws        qos swe max vls 15       gos_swe_high limit 0  qos swe vlarb high 0 4 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0  qos swe vlarb low SIO  beak edly Saab Areal edly orb         fe M edi  TDI agi 1220152012     Mellanox Technologies 157      Rev 2 0 3 0 
275. the status of specific ports of specific devices      gt  ibstatus mthca0 1 mlx4 0 2  Infiniband device  mthca0  port 1 status     default gid    e80 0000 0000 0000 0002 c900 0101 d151  base lid  0x0   sm lid  0x0   state                phys state  5  LinkUp   rate  10 Gb sec  4X     Infiniband device  mlx4 0  port 2 status     default gid    e80 0000 0000 0000 0000 0000 0007 3897  base lid  0x1   sm lid  0  1   state  4  ACTIVE   phys state  5  LinkUp   rate  20 Gb sec  4X DDR     9 10 ibportstate    Enables querying the logical  link  and physical port states of an InfiniBand port  It also allows  adjusting the link speed that is enabled on any InfiniBand port    If the queried port is a switch port  then ibportstate can be used to      disable  enable or reset the port       validate the port   s link width and speed against the peer port    Synopsis  ibportstate   d    e    v    V    D    G    s  lt smlid gt   V   C   ca name      P    ca port      t   timeout ms         dest dr path lid guid         portnum      op      value        Output Files    Table 24 lists the various flags of the command     Table 24   ibportstate Flags and Options                     Default  Flag hs TER    If Not Description     Specified     h help  Optional Print the help menu    d ebug  Optional Raise the IB debug level  May be  used several times for higher debug  levels   ddd or  d  d  d     e rr show  Optional Show send and receive errors  time   outs and others                    Mellano
276. tion default kernels you can run scst_vdisk blockio mode to obtain good    performance   ae    2  Download and install the SCST driver  The supported version is 1 0 1 1   a  Download scst 1 0 1 1 tar gz from http   scst sourceforge net downloads html  b  Untar scst 1 0 1 1     tar    at BELL     lO TEN oye    Col pest  0 1il  c  Install scst 1 0 1 1 as follows       make  amp  amp  make install    B 2 How to Run    A  On an SRP Target machine     1  Please refer to SCST s README for loading scst driver and its dev_handlers drivers   scst vdisk block or file IO mode  nullio          Regardless of the mode  you always need to have lun 0 in any group s device list   Then you can have any lun number following lun 0  it is not required to have the lun  adi numbers in ascending order except that the first lun must always be 0        Setting SRPT_LOAD yes in  etc infiniband openib conf is not enough as it only loads    the ib srpt module but does not load scst not its dev handlers           Mellanox Technologies 219      Rev 2 0 3 0 0          The scst disk module  pass thru mode  of SCST is not supported by Mellanox OFED             Example 1  Working with VDISK BLOCKIO mode   Using the md0 device  sda  and cciss c1d0     a  modprobe scst    b     e  f   g   h    modprobe scst_vdisk   echo  open vdisk0  dev md0 BLOCKIO   gt   proc scsi_tgt vdisk vdisk   echo  open vdisk1  dev sda BLOCKIO   gt   proc scsi tgt vdisk vdisk   echo  open vdisk2  dev cciss c1d0 BLOCKIO   gt   proc scsi_t
277. tion specifies the prefix routes file   Prefix routes control how the SA responds to path record       queries for off subnet DGIDs  Default file is    etc opensm prefix routes conf    Mellanox Technologies 127      Rev 2 0 3 0 0 OpenSM     Subnet Manager        consolidate ipv6_snm         Use shared MLID for IPv6 Solicited          Multicast groups  per MGID scope and P Key       consolidate ipv4 mask  Use mask for IPv4 multicast groups multiplexing  per MGID scope and P Key       pid file  lt path to file gt   Specifies the file that contains the process ID of the  opensm daemon The default is  var run opensm pid    max seq redisc  Specifies the maximum number of failed discovery loops  done by the SM  before completing the whole heavy sweep cycle            mc secondary root guid   GUID in hex     This option defines the guid of the multicast secondary root switch      mc primary root guid   GUID in hex     This option defines the guid of the multicast primary root switch         guid routing order no scatter  Don t use scatter for ports defined in guid routing order file                 pr full world queries allowed  This option allows OpenSM to respond full World Path Record queries   path record for each pair of ports in a fabric                 enable crashd  This option causes OpenSM to run Crash Daemon child process that allows  backtrace dump in case of fatal terminating signals       log prefix   prefix text    Prefix to syslog messages from OpenSM       verbose  
278. to restart      network service in order to bring up the bonding master  A fter the  configuration is saved  restart the network service by running   etc init d network   en restart     Mellanox Technologies 55 J    Rev 2 0 3 0 0 Driver Features       44 Quality of Service InfiniBand    4 4 1 Quality of Service Overview    Quality of Service  QoS  requirements stem from the realization of I O consolidation over an IB  network  As multiple applications and ULPs share the same fabric  a means is needed to control  their use of network resources     Figure 2  I O Consolidation Over InfiniBand    Servers       IB Ethernet  a    Gateway IB Fibre Block Storage    Channel  Gateway       QoS over Mellanox OFED for Linux is discussed in Chapter 8     OpenSM     Subnet Manager        The basic need is to differentiate the service levels provided to different traffic flows  such that a   policy can be enforced and can control each flow utilization of fabric resources    The InfiniBand Architecture Specification defines several hardware features and management   interfaces for supporting QoS      Up to 15 Virtual Lanes  VL  carry traffic in a non blocking manner      Arbitration between traffic of different VLs is performed by a two priority level  weighted round robin arbiter  The arbiter is programmable with a sequence of  VL   weight  pairs and a maximal number of high priority credits to be processed before low  priority is served      Packets carry class of service marking in the ra
279. tributions        lib modules uname  r  extra mlnx ofa kernel on RHEL and other RedHat like Distribu   tions        lib modules  uname  r  updates dkms  on Ubuntu    Mellanox Technologies 35 J      Rev 2 0 3 0 0 Installation      Firmware      The firmware of existing network adapter devices will be updated if the following two  conditions are fulfilled   a  You run the installation script in default mode  that is  without the option      without fw update        b  The firmware version of the adapter device is older than the firmware version included with the Mellanox  OFED ISO image      If an adapter   s Flash was originally programmed with an Expansion ROM image  the    automatic firmware update will also burn an Expansion ROM image   ae       ncase your machine has an unsupported network adapter device  no firmware update  will occur and the error message below will be printed  Please contact your hardware  vendor for help on firmware updates     Error message    I  Querying device       E  Can t auto detect fw configuration file         2 3 5  Post installation Notes      Most of the Mellanox OFED components can be configured or reconfigured after the  installation by modifying the relevant configuration files  See the relevant chapters in  this manual for details        The list of the modules that will be loaded automatically upon boot can be found in the   etc infiniband openib conf file     2 4 Updating Firmware After Installation    In case you ran the m1nxofedinst
280. troducing new set of  PR MPR attributes     4 4 3 Supported Policy  The QoS policy  which is specified in a stand alone file  is divided into the following four sub   sections   I  Port Group    A set of CAs  Routers or Switches that share the same settings  A port group might be a partition  defined by the partition manager policy  list of GUIDs  or list of port names based on NodeDe   scription     Mellanox Technologies 57 J      Rev 2 0 3 0 0 Driver Features      II  Fabric Setup  Defines how the SL2VL and VLArb tables should be setup       In OFED this part of the policy is ignored  SL2VL        VLArb tables should be config     ured in the OpenSM options file  opensm opts    ae    II  QoS Levels Definition    This section defines the possible sets of parameters for QoS that a client might be mapped to   Each set holds SL and optionally  Max MTU  Max Rate  Packet Lifetime and Path Bits       Path Bits are not implemented in                    IV  Matching Rules    A list of rules that match an Incoming PR MPR request to a QoS Level  The rules are processed  in order such as the first match is applied  Each rule is built out of a set of match expressions  which should all match for the rule to apply  The matching expressions are defined for the fol   lowing fields     e SRC and DST to lists of port groups     Service ID to a list of Service ID values or ranges       QoS Class to a list of QoS Class values or ranges    4 4 4 CMA Features    The CMA interface supports Servic
281. ts credit loop dead   locks     If the root guid file is not provided    a  or   root_guid_file  options   the topology has to  be pure fat tree that complies with the following rules        Tree rank should be between two and eight  inclusively        Switches of the same rank should have the same number of UP going port groups    unless they are root switches  in which case the shouldn t have UP going ports at all        Switches of the same rank should have the same number of DOWN going port groups   unless they are leaf switches        Switches of the same rank should have the same number of ports in each UP going port  group       Switches of the same rank should have the same number of ports in each DOWN going  port group            the CAs have to be at the same tree level  rank      If the root guid file is provided  the topology does not have to be pure fat tree  and it should only  comply with the following rules        Tree rank should be between two and eight  inclusively        All the Compute Nodes  have to be at the same tree level  rank   Note that non compute  node CAs are allowed here to be at different tree ranks     Topologies that do not comply cause a fallback to min hop routing  Note that this can also  occur on link failures which cause the topology to no longer be a    pure    fat tree     Note that although fat tree algorithm supports trees with non integer CBB ratio  the routing  will not be as balanced as in case of integer CBB ratio  In addition t
282. uage designed for high per   formance computing on large scale parallel machines The language provides a uniform program   ming model for both shared and distributed memory hardware  The programmer is presented  with a single shared  partitioned address space  where variables may be directly read and written  by any processor  but each variable is physically associated with a single processor  UPC uses a  Single Program Multiple Data  SPMD  model of computation in which the amount of parallelism  Is fixed at program startup time  typically with a single thread of execution per processor     In order to express parallelism  UPC extends ISO C 99 with the following constructs      An explicitly parallel execution model         shared address space      Synchronization primitives and a memory consistency model      Memory management primitives    The UPC language evolved from experiences with three other earlier languages that proposed  parallel extensions to ISO C 99  AC  Split C  and Parallel C Preprocessor  PCP   UPC is not a  superset of these three languages  but rather an attempt to distill the best characteristics of each   UPC combines the programmability advantages of the shared memory programming paradigm  and the control over data layout and performance of the message passing programming para   digm   Mellanox ScalableUPC is based on Berkely UPC package  see http   upc Ibl gov   and contains  the following enhancements     GasNet library used within UPC integrated with
283. und     ports   The actual  physical  port resource tables  Port GID tables      ports   n   gids   n   where 0  lt       lt   127  the physical port gids        ports  lt n gt  admin_guids  lt n gt  where 0  lt   n  lt   127  allows examining or changing the  administrative state of a given GUID gt        ports  lt n gt  pkeys  lt n gt  where 0  lt   n  lt   126  displays the contents of the physical pkey  table          pci id   directories   one      Dom0 and one per guest  Here  you may see the map   ping between virtual and physical pkey indices  and the virtual to physical gid 0     Currently  the GID mapping cannot be modified  but the pkey virtual to physical mapping can    These directories have the structure       lt pci_id gt  port  lt m gt  gid_idx 0 where m   1  2  this is read only   and       pci id   port   m   pkey idx   n    Where m   1  2andn   0  126    For instructions on configuring pkey_idx  please see below     4 13 7 2 2Configuring an Alias GUID  under ports  lt n gt  admin_guids     Step 1  Determine the GUID index of the PCI Virtual Function that you want to pass through to a  guest     For example  if you want to pass through PCI function 02 00 3 to a certain guest  you initially  need to see which GUID index is used for this function     To do so   cat  sys class infiniband iov 0000 02 00 3 port  lt port_num gt  gid_idx 0  The value returned will present which guid index to modify                   Step 2  Modify the physical GUID table via the ad
284. uration records for clients  an appropriate config   uration file needs to be created  By default  the DHCP server looks for a configuration file called  dhcpd conf under  etc  You can either edit this file or create a new one and provide its full  path to the DHCP server using the  cf flag  See a file example at docs  dhcpd  conf of the Mel   lanox OFED for Linux installation     The DHCP server must run on a machine which has loaded the IPoIB module     50 Mellanox Technologies      Rev 2 0 3 0 0      To run the DHCP server from the command line  enter     dhcpd  lt IB network interface name gt   d    Example   host1  dhcpd ib0  d    4 3 3 1 2 DHCP Client  Optional       A DHCP client can be used if you need to prepare a diskless machine with         IB driver  See Step 8 under    Example  Adding an IB Driver to initrd  Linux             In order to use    DHCP client identifier  you need to first create a configuration file that defines  the DHCP client identifier   Then run the DHCP client with this file using the following command     dhclient  cf  lt client conf file gt   lt IB network interface name gt   Example of a configuration file for the ConnectX  PCI Device ID 26428   called  dhclient conf       The value indicates a hexadecimal number  interface  ibl       send dhcp client identifier       00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39          Example of a configuration file for InfiniHost HI Ex  PCI Device ID 25218   called  dhclient  conf       The
285. urce Network Boot Firmware       netO  00 02 c9 03 00 0c 78 11 on PCI02 00 0  open    Link down  TX O TXE O      0 RXE 01   Link status  The socket is not connected   Waiting for link up on netO    ok       Placing Client Identifiers in  etc dhcpd conf    The following is an excerpt of a  etc dhcpd conf example file showing the format of represent   ing a client machine for the DHCP server   host hostl     next server 11 4 3 7    filename  pxelinux 0     fixed address 11 4 3 130    option dhcp client identifier          00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39     A 4 Subnet Manager     OpenSM      This section applies to ports configured as InfiniBand only   ae    FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network   OpenSM is part of the Mellanox OFED for Linux software package and can be used to accom   plish this  Note that OpenSM may be run on the same host running the DHCP server but it is not  mandatory  For details on OpenSM  see    OpenSM     Subnet Manager  on page 121     to use the OpenSM options described in Section 8 2 1   opensm Syntax   on      To use OpenSM caching      large InfiniBand clusters   gt  100 nodes   it is recommended  adi page 121     A 5  TFTP Server    When you set the    filename    parameter in your DHCP configuration file to a non empty file   name  the client will ask for this file to be passed through TFTP  For this reason you need to  install a TFTP server     A 6 BIOS Configuration  
286. wing          1  View the available package groups by invoking       yum grouplist   grep MLNX_OFED  LNX OFED ALL   LNX OFED BASIC   LNX OFED GUEST   LNX OFED HPC   LNX OFED HYPERVISOR   LNX OFED VMA   LNX OFED VMA ETH   LNX OFED VMA VPI                       2  Install the desired group       yum groupinstall  MLNX OFED ALL    Loaded plugins  product id  security  subscription manager   This system is not registered to Red Hat Subscription Management  You can use subscrip   tion manager to register    Setting up Group Process   Resolving Dependencies        Running transaction check         Package ar mgr x86 64 0 1 0 0 11 g22fff4a will be installed    rds devel x86 64 0 2 0 6mlnx 1   rds tools x86 64 0 2 0 6mlnx 1   srptools x86 64 0 0 0 4mlnx3 OFED 2 0 2 6 7 11 ge863cb7  Complete     2 5 3 Updating Firmware After Installation  Installing MLNX OFED using the YUM tool does not automatically update the firmware   To update the firmware to the version included in MLNX OFED package  you can either       Run the minxofedinstall script with the       w update only  flag  or      Update the firmware to the latest version available on Mellanox Technologies    Web site  as described in section Section 2 4   Updating Firmware After Installation   on page 36     2 6  Uninstalling Mellanox OFED    Use the script  usr sbin ofed uninstall sh to uninstall the Mellanox OFED package   The script is part of the ofed scripts RPM     27 Uninstalling Mellanox OFED using the YUM Tool  If MLNX OF
287. wn to compute nodes    To solve this problem     list of non CN nodes        be specified by    G   or V  io guid fileV  option  These nodes will be allowed to use switches the wrong way around a specific number of  times  specified by    H   or V  max reverse hopsV  With the proper max reverse hops and  io guid file values  you can ensure full connectivity in the Fat Tree  In the scheme above  with a  max reverse hop of 1  routes will be instanciated between N1 lt   gt N2 and N2 lt   gt N3  With a  max reverse hops value of 2  N1 N2 and     will all have routes between them         Using max_reverse_hops creates routes that use the switch in a counter stream way       This option should never be used to connect nodes with high bandwidth traffic   between them  It should only be used to allow connectivity for HA purposes or similar     Also having routes the other way around can cause credit loops     8 5 4 2 Activation through OpenSM       Use     R ftree    option to activate the fat tree algorithm      LMC  gt  0 is not supported by fat tree routing  If this is specified  the default routing    algorithm is invoked instead   ra    8 5 5 LASH Routing Algorithm    LASH is an acronym for LAyered SHortest Path Routing  It is a deterministic shortest path rout   ing algorithm that enables topology agnostic deadlock free routing within communication net   works     When computing the routing function  LASH analyzes the network topology for the shortest path  routes between all p
288. x Technologies 185    Rev 2 0 3 0 0              Table 24   ibportstate Flags and Options  Continued                                                        Optional   Default  Flag       If Not Description  a  Specified     v erbose  Optional Increase verbosity level  May be  used several times for additional ver   bosity   vvv or  v  v  v     V ersion  Optional Show version info    D irect  Optional Use directed path address arguments   The path is a comma separated list of  out ports   Examples      0        self port     0 1 2 1 4        out via port 1  then 2         G uid  Optional Use GUID address argument  In  most cases  it is the Port GUID   Example    0x08f1040023     s  lt smlid gt  Optional Use  lt smlid gt  as the target lid for SM   SA queries    C   ca name   Optional Use the specified channel adapter or  router    P   ca port   Optional Use the specified port    t   timeout ms   Optional Override the default timeout for the  solicited MADs  msec      dest dr path            Optional Destination s directed path  LID  or   guid gt  GUID     lt portnum gt  Optional Destination   s port number    lt op gt    lt value gt   Optional query Define the allowed port operations   enable  disable  reset  speed  and  query       In case of multiple channel adapters  CAs  or multiple ports without a CA port being specified  a    port is chosen by the utility according to the following criteria     1  The first ACTIVE port that is found     2  If not found  the first port th
289. xc024  0xc040  0xc041  0xc042  12 valid mlids dumped             XM    P4                             Mellanox Technologies 191       9 12    192 Mellanox Technologies    Rev 2 0 3 0 0             smpquery    Provides a basic subset of standard SMP queries to query Subnet management attributes such as  node info  node description  switch info  and port info     Synopsis  smpquery   h    d    e    v    D    G    s  lt smlid gt     V    C   ca name      P    ca port      t   timeout ms       node name map  lt node name map gt     op        dest dr path lid guid    op params     Output Files    Table 26 lists the various flags of the command     Table 26   smpquery Flags and Options    InfiniBand Fabric Diagnostic Utilities                                        Default  Flag Ne       If Not Description         Specified     h help  Optional Print the help menu    d ebug  Optional Raise the IB debug level  May be used several  times for higher debug levels   ddd or  d  d  d     e rr_show  Optional Show send and receive errors  timeouts and  others     v erbose  Optional Increase verbosity level  May be used several  times for additional verbosity   vvv or  v  v    v     D irect  Optional Use directed path address arguments  The path  Is a comma separated list of out ports   Examples      0        self port   0 1 2 1 4       out via port 1  then 2         G uid  Optional Use GUID address argument  In most cases  it  is the Port GUID  Example      0x08f1040023       s  lt smlid gt
290. you will need to actively turn it off  Running the SM w o the CC   Manager is not sufficient  as the hardware still continues to function in accordance to  ad the previous CC configuration     For further information on how to turn OFF CC  please refer to Section 8 9 3     Configuring Con   gestion Control Manager   on page 167    8 9 3 Configuring Congestion Control Manager    Congestion Control  CC  Manager comes with a predefined set of setting  However  you can  fine tune the CC mechanism and CC Manager behavior by modifying some of the options  To do  so  perform the following     1  Find the event plugin options    option in the SM options file  and add the following     conf file   cc mgr options file name        Options string that would be passed to the plugin s   event plugin options ccmgr   conf file   cc mgr options file name      2  Run the SM with the new options file   opensm  F   options file name       Mellanox Technologies 167      Rev 2 0 3 0 0 OpenSM     Subnet Manager        To turn CC OFF  set  enable  to  FALSE  in the Congestion Control Manager configura     tion file  and run OpenSM ones with this configuration           For      full list of CC Manager options with all the default values  See    Configuring Congestion  Control Manager    on page 167     For further details on the list of CC Manager options  please refer to the IB spec     8 9 4 Configuring Congestion Control Manager Main Settings    To fine tune CC mechanism and CC Manager behavior  
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
sviluppo di materiali compositi rinforzati con fibre naturali per l  Fortinet 3.0 MR4 Network Card User Manual  Proiettore digitale SP891 Manuale Utente  Betriebsanleitung  Oregon Scientific RM901A Clock Radio User Manual  取扱説明書 - 三菱電機  取扱説明書 - オムロン ヘルスケア  LER-42478K-LS9+R-4209  Modelo normalizado de ficha para asignaturas  取扱説明書 - 三菱電機    Copyright © All rights reserved. 
   Failed to retrieve file