Home

MLNX_EN for Linux User Manual

image

Contents

1. a n 41 6 3 6 Tuning for NUMA Architecture i eee 41 6 3 7 IRQ Affinity oett ek wr ROCK BIR REA ee MRB SANS RE 43 6 3 8 Tuning Multi Threaded IP 45 4 Mellanox Technologies Rev 2 0 3 0 0 List of Tables Table 1 Package Content rer rese sea 6 Table 2 Flow Specific Parameters 23 Table 3 Recommended PCIe Configuration 0 00 cee teens 35 Table 4 Recommended BIOS Settings for Intel Sandy Bridge Processors 36 Table 5 Recommended BIOS Settings for Intel Nehalem Westmere Processors 37 Table 6 Recommended BIOS Settings for AMD 37 Mellanox Technologies 5 J Rev 2 0 3 0 0 Overview 1 Overview This document provides information on the MLNX EN Linux driver and instructions for install ing the driver on Mellanox ConnectX adapter cards supporting 10Gb s and 40Gb s Ethernet The MLNX EN driver release exposes the following capabilities Single Dual port Upto 16 Rx queues per port 6 Tx queues per port Rxsteering mode Receive Core Affinity RCA e MSI X or INTx Adaptive interrupt moderation HW Tx Rx checksum calculation Large Send Offload 1 e TCP Segmentation Offload Large Receive Offload Multi core NAPI support VLAN Tx Rx acceleration HW
2. Mellanox Technologies 37 Rev 2 0 3 0 0 Performance Tuning Table 6 Recommended BIOS Settings for AMD Processors BIOS Option Values Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 6 3 Performance Tuning for Linux You can use the Linux sysct command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 6 3 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better throughput sysctl w net ipv4 tcp 1 Increase the maximum length of processor input queues Sysctl w net core netdev max backlog 250000 Increase the TCP maximum and default buffer sizes using setsockopt sysctl w net core rmem max 4194304 Sysctl w net core wmem max 4194304 sysctl w net core rmem default 4194304 Sysctl w net core wmem default 4194304 sysctl w net core optmem max 4194304 Increase me
3. mstflint dev PCI device dc gt ini device file gt Step 4 Edit the ini file that you found in the previous step and add the following lines to HCA section in order to support 63 VFs SRIOV enable total vfs 63 num pfs 1 sriov en true a Some servers might have issues accepting 63 Virtual Functions or more In such case please set the number of total vfs to any required value Step 5 Create a binary image using the modified ini file Step a Download the Mellanox Firmware Tools www mellanox com gt Products gt Adapter IB VPI SW gt Firmware Tools and install the package Stepb Run mlxburn fw fw name mlx conf modified ini file wrimage file name gt bin The file file name gt bin is a firmware binary file with SR IOV enabled that has 63 VFs It can be spread across all machines and can be burnt using mstflint which is part of the bundle using the following command mstflint dev PCI device image file name bin b Mellanox Technologies 31 Rev 2 0 3 0 0 Driver Features 7d After burning the firmware the machine must be rebooted If the driver is only restarted the machine may hang and a reboot using power OFF ON might be required 5 4 7 Ethernet Virtual Function Configuration when Running SR IOV 5 4 7 1 VLAN Guest Tagging VGT and VLAN Switch Tagging VST When running ETH ports on VFs the ports may be configured to simply pass through packets
4. 5 1 5 Quality of Service Tools 5 1 5 1 mlnx qos mlnx qos is a centralized tool used to configure QoS features of the local host It communicates directly with the driver thus does not require setting up a DCBX daemon on the system The minx qos tool enables the administrator of the system to Inspect the current QoS mappings and configuration The tool will also display maps configured by TC and vcon ig set egress map tools in order to give a centralized view of all QoS mappings Set UP to TC mapping Assign a transmission algorithm to each TC strict or ETS Set minimal BW guarantee to ETS TCs Setrate limit to TCs For unlimited ratelimit set the ratelimit to 0 Usage mlnx qos i interface options Mellanox Technologies 15 J Rev 2 0 3 0 0 Driver Features Options version show program s version number and exit h help show this help message and exit p LIST prio tc LIST maps UPs to TCs LIST is 8 comma seperated TC numbers Example 0 0 0 0 1 1 1 1 maps UPs 0 3 to TCO and UPs 4 7 ro TCL s LIST tsa LIST Transmission algorithm for each TC LIST is comma seperated algorithm names for each TC Possible algorithms strict etc Example ets strict ets sets TCO TC2 to ETS and TCl to strict The rest are unchanged t LIST tcbw LIST Set minimal guaranteed BW for ETS TCs LIST is comma seperated percents for each TC Values set to TCs that are not configured to ETS algor
5. 24 5 4 1 System Requirements 0 cece cece e 24 5 4 2 Setting Up SR lOV co ent cae pede ready cet 25 5 4 3 Enabling SR IOV and Para Virtualization on the Same Setup 28 5 4 4 Assigning a Virtual Function to a Virtual Machine 30 5 4 5 Uninstalling SR IOV Driver 0 0 20 0 cette es 31 5 4 6 Burning Firmware with SR IOV 0 0 ccc eens 31 5 4 7 Ethernet Virtual Function Configuration when Running SR IOV 32 Chapter 6 Performance 35 6 1 Increasing Packet Rate 4 5 ska ses kare tenn eens 35 6 2 General System 35 6 2 1 PCI Express PCIe 35 6 2 2 Memory Configuration sssereressrereererrrerrrr ere rr rr rr rr eens 35 6 2 3 Recommended BIOS Settings sseosserrmesererrererrrrrrrrrr rer rr rer era 36 Mellanox Technologies 3 Rev 2 0 3 0 0 6 3 Performance Tuning for Linux 38 6 3 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 38 6 3 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 39 6 3 3 Preserving Your Performance Settings after a 39 6 3 4 Tuning Power Management 39 6 3 5 Interrupt Moderation
6. A user s sk priois mapped to UP which in turn is mapped into TC Indicating the UP e When the user uses sk prio it is mapped into a UP by tc tool This is done by the tc wrap py tool which gets a list of lt 16 comma separated UP and maps the sk prio to the specified UP For example tc wrap py ieth0 u 1 5 maps sk prio 0 of etho device to UP 1 and sk prio 1to UP 5 Setting set egress map VLAN maps the skb priority of the VLAN to a v1an qos The v1an qos is represents a UP for the VLAN device n RoCE set option with ROMA OPTION ID TOS could be used to set the UP When creating QPs the s1 field in modify command represents the UP Indicating the TC After mapping the skb priority to UP one should map the UP into a TC This assigns the user priority to a specific hardware traffic class In order to do that qos should be used m1nx qos gets a list of a mapping between UPs to TCs For example m1nx qos iethO p 0 0 0 0 1 1 1 1 maps UPs 0 3 to Tco and Ups 4 7 to Tc1 5 1 4 Quality of Service Properties The different QoS properties that can be assigned to a TC are Strict Priority see Strict Priority e Minimal Bandwidth Guarantee ETS see Minimal Bandwidth Guarantee ETS Rate Limit see Rate Limit 5 1 4 1 Strict Priority When setting a TC s transmission algorithm to be strict then this TC has absolute strict prior ity over other TC strict prioritie
7. Mellanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2013 Mellanox Technologies All Rights Reserved Mellanox amp Mellanox logo BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale MLNX OS PhyX SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd Connect IB ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroX MetroDX ScalableHPC Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2950 Rev 2 0 3 0 0 Table of Contents Table of Contents eine oN exo Shwe ER E List OF Tables P Tm Chapter I OVePEVIEW eee vU bb te eee wee ak eee ox VE ete Ce Pa eee OO 1 1 Package Contents ia ue p E E Dahan aes 6 Chapter 2 Driver 8 2 1 Software Dependencies 0 rr Mr en 8 2 2 Installing the Driver se gen Lec o t WR RE le 8 2 3 Loa
8. cat sys devices system node node1 cpumap 0000aaaa 42 Mellanox Technologies Rev 2 0 3 0 0 6 3 6 3 1 Running an Application on a Certain Node In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool For example if the adapter s NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only gt To run an application run the following commands taskset c 8 15 ib write bw a Or taskset Oxff00 ib write bw a 6 3 7 IRQ Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalance stop The following command assigns the affinity of a single interrupt vector gt echo hexadecimal bit mask gt proc irg irq vector gt smp affinity Bit i in lt hexadecimal bit mask gt indicates whether processor core i is in lt irq vector gt s affinity or not 6 3 7 1 IRQ Affinity Configuration It is recommended to set each IRQ to a different Ah For Sandy Bridge or AM
9. 55 loc 5 action 2 All packets that contain the above destination MAC address are to be steered into rx ring 2 its underlying QP with priority 5 within the ethtool domain ethtool U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2 All packets that contain the above destination IP address and source port are to be steered into rx ring 2 When destination MAC is not given the user s destination MAC is filled automatically e ethtool u eth5 Shows all of ethtool s steering rule When configuring two rules with the same priority the second rule will overwrite the first one so this ethtool interface is effectively a table Inserting Flow Steering rules in the kernel requires support from both the ethtool in the user space and in kernel v2 6 28 MLXA Driver Support The mlx4 driver supports only a subset of the flow specification the ethtool API defines Asking for an unsupported flow specification will result with an invalid value failure The following are the flow specific parameters Table 2 Flow Specific Parameters ether tcp4 udp4 ip4 Mandatory dst src ip dst ip Optional vlan src ip dst ip src src ip dst ip vlan port dst port vlan RFS RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUs that are used by flow s owner applications This domain allows the RFS mechanism to use the flow steering infrastructure to suppor
10. Rx interrupt moderation and last shows the new setting gt ethtool c ethl Coalesce parameters for ethl Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 rx usecs 16 rx frames 88 rx usecs irq 0 rx frames irq 0 ethtool C ethl adaptive rx off rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX off pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 rx usecs irq 0 rx frames irq 0 6 3 6 Tuning for NUMA Architecture 6 3 6 1 Tuning for Intel amp Sandy Bridge Platform The Intel Sandy Bridge processor has an integrated PCI express controller Thus every PCIe adapter OS is connected directly to a NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCIe adapter is connected In order to identify which NUMA node is the adapter s node the system BIOS should support ACPI SLIT gt To see if your system supports PCIe adapter s NUMA node detection cat sys class net interface device numa node cat sys devices PCI root PCIe function numa node Mellanox Technologies 41 Rev 2 0 3 0 0 Performance Tuning Example for supported system cat sys class net eth3 device numa node 0 Example for unsupported system cat sys class net ib0 device numa node 1 6 3 6 1 1 Improving Application Performance on Remote NUMA Node Verbs API applicatio
11. VLAN stripping insertion Ethtool support Net device statistics SR IOV support Flow steering Ethernet Time Stamping at beta level 1 1 Package Contents This driver kit contains the following Table 1 MLNX EN Package Content Components Description mlx4 driver mlx4 is the low level driver implementation for the ConnectX adapters designed by Mellanox Technologies The ConnectX can operate as an InfiniBand adapter and as an Ethernet NIC To accommodate the two flavors the driver is split into modules mlx4 core mlx4 en and mlx4 ib Note mlx4 ib is not part of this package mlx4 core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand Ethernet and FC functions can share a device without interfering with each other mlx4 en Handles Ethernet specific functions and plugs into the netdev mid layer mstflint An application to burn a firmware binary image Software modules Sources of all software modules under conditions mentioned in the modules LICENSE files 6 Mellanox Technologies Rev 2 0 3 0 0 Table 1 MLNX_EN Package Content Components Description Documentation Release Notes README Mellanox Technologies 7 Rev 2 0 3 0 0 Driver Installation 2 2 1 2 2 Driver Installation Software Dependencies To install the driver software
12. for RedHat and SUSE distributions NonKMP installation mode where the sources are rebuilt with the running kernel This mode is used for vanilla kernels If the Vanilla kernel is installed as rpm please use disable kmp flag when installing the driver 8 Mellanox Technologies Rev 2 0 3 0 0 The kernel module sources placed under usr src mellanox mlnx en 2 0 gt To recompile the driver gt cd usr src mellanox mlnx en 2 0 gt scripts mlnx en patch sh gt make gt make install The uninstall and performance tuning scripts are installed If the driver was installed without kmp support the sources would be located under usr srs mlnx en 2 0 2 3 Loading the Driver Step 1 Make sure no previous driver version is currently loaded gt modprobe r mlx4 en 2 Load the new driver version gt modprobe mlx4 en The result is a new net device appearing in the ifconfig a output For details on driver usage and configuration please refer to Section 3 Ethernet Driver Usage and Configuration on page 10 On Ubuntu OS the mlnx en service is responsible for loading the mlx4 en driver upon boot 2 4 Unloading the Driver gt To unload the Ethernet driver gt modprobe r mlx4 en 2 5 Uninstalling the Driver gt To uninstall the mInx en driver gt sbin mlnx en uninstall sh Mellanox Technologies 9 J Rev 2 0 3 0 0 Ethernet Driver Usage a
13. mapping flow 1 The application sets the ToS of the socket using setsockopt IP TOS value 2 ToS is translated into the sk prio using a fixed translation TOS 0 sk prio 0 TOS 8 sk prio 2 TOS 24 sk prio 4 TOS 16 sk prio 6 3 The Socket Priority is mapped to the UP Ifthe underlying device is a VLAN device egress map is used controlled by the vconfig command This is per VLAN mapping Ifthe underlying device is not a VLAN device the tc command is used In this case even though tc manual states that the mapping is from the sk prio to the TC number the mlx4 en driver interprets this as sk prio to UP mapping Mellanox Technologies 13 J Rev 2 0 3 0 0 Driver Features Mapping the sk prio to the UP is done by using tc wrap py i dev name u 0 1 2 3 4 5 6 7 4 The the UP is mapped to the TC as configured by the m1nx qos tool or by the 11dpad daemon if DCBX is used Socket applications can use setsockopt SK_PRIO value to directly set the sk_prio of the socket In this case the ToS to sk prio fixed mapping is not needed This allows the application and the administrator to utilize more than the 4 values possible via ToS In case of VLAN interface the UP obtained according to the above mapping is also used 24 in tag of the traffic 5 1 3 Map Priorities with tc wrap py mlnx qos Network flow that can be managed by QoS attributes is described by a User Priority UP
14. of event packet HWTSTAMP FILTER PTP V1 L4 EVENT PTP v1 UDP Sync packet HWTSTAMP FILTER PTP V1 L4 SYNC v1 UDP Delay packet HWTSTAMP FILTER PTP V1 L4 DELAY REQ PTP v2 UDP any kind of event packet HWTSTAMP FILTER PTP V2 L4 EVENT PTP v2 UDP Sync packet HWTSTAMP FILTER PTP V2 L4 SYNC PTP v2 UDP Delay reg packet HWTSTAMP FILTER PTP V2 L4 DELAY REQ 802 AS1 Ethernet any kind of event packet HWTSTAMP FILTER PTP V2 L2 EVENT 802 AS1 Ethernet Sync packet HWTSTAMP FILTER PTP V2 L2 SYNC 802 AS1 Ethernet Delay req packet HWTSTAMP FILTER PTP V2 L2 DELAY REQ PTP v2 802 A81 any layer any kind of event packet HWTSTAMP FILTER PTP V2 EVENT PTP v2 802 AS1 any layer Sync packet HWTSTAMP FILTER PTP V2 SYNC PTP v2 802 AS1 any layer Delay req packet HWTSTAMP FILTER PTP V2 DELAY REQ Note for receive side time stamping currently only HWTSTAMP FILTER NONE and HWTSTAMP FILTER ALL are supported 5 2 2 Getting Time Stamping Once time stamping is enabled time stamp is placed in the socket Ancillary data recvmsg can be used to get this control message for regular incoming packets For send time stamps the outgo ing packet is looped back to the socket s error queue with the send time stamp s attached It can be received with recvmsg flags MSG ERRQUEUE The call return
15. 0 5 00 07 0 8 The driver will enable 5 VFs on the HCA positioned in BDF 00 04 0 and 8 on the one in 00 07 0 Note PFs not included in the above list will not have SR IOV enabled probe vf Absent or zero No VFs will be used by the PF driver ts value is a single number in the range of 0 63 Physical Func tion driver will use probe v VFs and this will be applied to all ConnectX HCAs on the host ts format is a string which allows the user to specify the probe vf parameter separately per installed HCA Its format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to use in the PF driver for that HCA This parameter can be set in one of the following ways For exam ple probe vfs 5 The PF driver will probe 5 VFs on HCA and this will be applied to all ConnectX HCAs on the host probe vfs 00 04 0 5 00 07 0 8 The PF driver will probe 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for the one in 00 07 0 Note PFs not included in the above list will not use any of their VFs in the PF driver The example above loads the driver with 5 VFs num vfs The standard use of a VF is a single VF per a single VM However the number of VFs varies upon the working mode requirements Mellanox Technologies 27 Rev 2 0 3 0 0 Driver Features Step 9 Reboot the server If the SR IOV is not supported by the server the machine might not come o
16. D systems set the irq affinity to the adapter s NUMA node Foroptimizing single port traffic run set irq affinity bynode sh numa node interface e For optimizing dual port traffic run set irq affinity bynode sh lt numa node gt lt interfacel gt lt interface2 gt To show the current affinity settings run show irq affinity sh lt interface gt Mellanox Technologies 43 J Rev 2 0 3 0 0 Performance Tuning 6 3 7 2 6 3 7 3 Auto Tuning Utility MLNX EN 2 0 x introduces a new affinity tool called mlnx affinity This tool can automatically adjust your affinity settings for each network interface according to the system architecture Usage Start mlnx affinity start Stop mlnx affinity stop Restart mlnx affinity restart mlnx affinity can also be started by driver load unload gt To enable affinity by default Add the line below to the etc infiniband openib conf file RUN AFFINITY TUNER yes Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core utilization so there will be no interleaving between interfaces The following script can be used to separate each adapter s IRQs to different set of cores set irq affinity cpulist sh cpu list interface cpu list can be either a comma separated list of single core numbers 0 1 2 3 or core groups 0 3 Exam
17. Mellanox TECHNOLOGIES MLNX EN for Linux User Manual Rev 2 0 3 0 0 Last Modified October 07 2013 www mellanox com Rev 2 0 3 0 0 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE
18. _cm connection and manage its guarantees limitations and its priority over other flows This is accomplished by mapping the user s priority to a hardware TC traffic class through a 2 3 stages process The TC is assigned with the QoS attributes and the different flows behave accordingly 5 1 1 Mapping Traffic to Traffic Classes Mapping traffic to TCs consists of several actions which are user controllable some controlled by the application itself and others by the system network administrators The following is the general mapping traffic to Traffic Classes flow 1 The application sets the required Type of Service ToS 2 The ToS is translated into a Socket Priority sk_prio 3 The sk_prio is mapped to a User Priority UP by the system administrator some applica tions set sk_prio directly 4 The UP is mapped to TC by the network system administrator 5 TCs hold the actual QoS parameters QoS can be applied on the following types of traffic However the general QoS flow may vary among them Plain Ethernet Applications use regular inet sockets and the traffic passes via the ker nel Ethernet driver e Applications use the ROMA API to transmit using QPs Raw Ethernet QP Application use VERBs API to transmit using a Raw Ethernet QP 5 1 2 Plain Ethernet Quality of Service Mapping Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver The following is the Plain Ethernet QoS
19. as is from VFs Vlan Guest Tagging or the administrator may configure the Hypervisor to silently force packets to be associated with a VLan Qos Vlan Switch Tagging In the latter case untagged or priority tagged outgoing packets from the guest will have the VLAN tag inserted and incoming packets will have the VLAN tag removed Any vlan tagged packets sent by the VF are silently dropped The default behavior is VGT The feature may be controlled on the Hypervisor from userspace via iprout2 netlink ip link set dev DEVICE group DEVGROUP up down v NUM mac LLADDR vlan VLANID qos VLAN QOS spoofchk on off use ip link set dev PF device vf NUM vlan vlan id qos lt qos gt where NUM 0 max vf num vlan id 0 4095 4095 means set qos 0 7 For example ip link set dev eth2 vf 2 qos 3 sets VST mode for VF 2 belonging to PF eth2 with qos 3 ip link set dev eth2 vf 4095 sets mode for VF 2 back to VGT 5 4 7 2 Additional Ethernet VF Configuration Options Guest MAC configuration By default guest MAC addresses are configured to be all zeroes In the MLNX EN guest driver If a guest sees a zero MAC it generates a random MAC address for itself If the administrator wishes the guest to always start up with the same MAC he she should configure guest MACS before the guest driver comes up The guest MAC may be configured by using ip link set dev PF devi
20. ce vf NUM mac lt LLADDR gt For legacy guests which do not generate random MACS the adminstrator should always configure their MAC addresses via ip link as above 32 Mellanox Technologies Rev 2 0 3 0 0 e Spoof checking Spoof checking is currently available only on upstream kernels newer than 3 1 ip link set dev lt PF device gt vf lt NUM gt spoofchk on off Mellanox Technologies 33 J Rev 2 0 3 0 0 Driver Features 34 Mellanox Technologies Rev 2 0 3 0 0 6 Tuning 6 1 Increasing Packet Rate To increase packet rate especially for small packets set the value of high rate steer mod ule parameter in mlx4 module to 1 default 0 Enabling this mode will cause the following chassis management features to stop work ing ad NC SI e 6 2 General System Configurations The following sections describe recommended configurations for system components and or interfaces Different systems may have different features thus some recommendations below may not be applicable 6 2 1 PCI Express PCIe Capabilities Table 3 Recommended PCle Configuration PCIe Generation 3 0 Speed 8GT s Width x8 or x16 Max Payload size 256 Max Read Request 4096 For ConnectX3 based network adapters 40GbE Ethernet adapters it is recommended d to use an x16 PCIe slot to benefit from the additional buffers allocated by the CPU 6 2 2 Memory Conf
21. config network scripts ifcfg ethx configuration file The MAC address for every virtual function is configured randomly therefore it is not necessary to add it 30 Mellanox Technologies Rev 2 0 3 0 0 5 4 5 Uninstalling SR IOV Driver gt uninstall SR IOV driver perform the following Step 1 For Hypervisors detach all the Virtual Functions VF from all the Virtual Machines VM or stop the Virtual Machines that use the Virtual Functions Please be aware stopping the driver when there are VMs that use the VFs will cause machine to hang Step2 Run the script below Please be aware uninstalling the driver deletes the entire driver s file but does not unload the driver sbin mlnx en uninstall sh MLNX EN uninstall done Step3 Restart the server 5 4 6 Burning Firmware with SR IOV The following procedure explains how to create a binary image with SR IOV enabled that has 63 VFs However the number of VFs varies according to the working mode requirements To burn the firmware Step 1 Verify you have MFT installed in your machine Step 2 Enter the firmware directory according to HCA type e g ConnectX 3 The path is mInx_en firmware lt device gt lt FW version Step3 Find the ini file that contains the HCA s PSID Run mstflint d 03 00 0 q grep PSID PSID MT 1090110019 If such ini file cannot be found in the firmware directory you may want to dump the configura tion file using mstflint Run
22. ding the Driver ie Beet Phe ee ea te 9 2 4 Unloading the Driver ee e e 9 2 5 Uninstalling the 9 Chapter 3 Ethernet Driver Usage and Configuration Le 10 Chapter 4 Firmware 12 4 1 Installing Firmware Tools 12 4 2 Updating Adapter Card 12 Chapter 5 Driver Features 13 5 1 QuahtyofSetvice sione OI ul e eSI e ee are Rae e rs 13 5 1 1 Mapping Traffic to Traffic Classes 0 ee kee eene 13 5 1 2 Plain Ethernet Quality of Service Mapping 13 5 1 3 Map Priorities with _ 1 _ 14 5 1 4 Quality of Service Properties 14 5 1 5 Quality of Service Tools ensure Mie WR RE OAR es 15 5 2 Time Stamping Service 0 0 eee teen eens 19 5 2 1 Enabling Time Stamping 0 0 cece eee teenies 20 5 2 2 Getting Time Stamping 0 0 eee ete BARNEN 21 3 BLOW Steering ive edoceri ARE E ER RAV gue GS 22 5 3 1 Enable Disable Flow 22 5 3 2 Flow Domains and Priorities 5 leeren 22 5 4 Single Root IO Virtualization 1
23. ernel Mellanox Technologies 25 Rev 2 0 3 0 0 Driver Features For example to Intel systems default 0 timeout 5 splashimage hd0 0 grub splash xpm gz hiddenmenu title Red Hat Enterprise Linux Server 2 6 32 36 x86 645 root hd0 0 kernel vmlinuz 2 6 32 36 x86 64 ro root dev VolGroup00 LogVol00 rhgb quiet intel iommu on initrd initrd 2 6 32 36 x86 64 img a Please make sure the parameter intel jiommu on exists when updating the boot grub grub conf file otherwise SR IOV cannot be loaded 5 Install the MLNX EN driver for Linux that supports SR IOV Step 6 Verify the HCA is configured to support SR IOV root selene mstflint dev lt PCI Device gt dc e Verify in the HCA section the following field appears HCA num pfs 1 total vis 5 sriov en true HCA parameters can be configured during firmware update using the m1nxofedinstall script and running the enable sriov and total vfs 0 63 installation parameters If the current firmware version is the same as one provided with MLNX EN run it in combination with the force fw update parameter This configuration option is supported only in HCAS that their configuration file INT is included in MLNX EN Parameter Recommended Value num pfs 1 Note This field is optional and might not always appear total vfs 63 sriov en true Ifthe HCA does not support SR IOV please contact Mel
24. es Rev 2 0 3 0 0 Step 4 Attach a virtual NIC to VM Step 5 Add the MAC 52 54 00 E7 77 99 to the sys class net eth5 fdb table on HV Mellanox Technologies 29 Rev 2 0 3 0 0 Driver Features 5 4 4 Assigning a Virtual Function to a Virtual Machine This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine 5 4 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server Step 1 Run the virt manager Step 2 Double click on the virtual machine and open its Properties Step 3 to Details gt Add hardware gt PCI host device COC tual Ble Virtual Machine View Send Key Q 2 0 Add new virtual hardware e Adding Virtual Hardware This assistant will guide you through adding a new piece of virtual hardware First select what type of hardware you wish to add Hardware type Storage Network input Graphics Sound d Serial Parallel Physical Host Device 00 video Bl watchdog Forward gt Add Hardware Remove Step 4 Choose a Mellanox virtual function according to its PCI device e g 00 03 1 5 If the Virtual Machine is up reboot it otherwise start it Step 6 Log into the virtual machine and verify that it recognizes the Mellanox card Run lspci grep Mellanox 00 03 0 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 7 Add the device to the etc sys
25. ice con nected to the Physical Function It shares the same resources with the Physical Function and its number of ports equals those of the Physical Function SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual machines direct hardware access to network resources hence increasing its performance In this chapter we will demonstrate setup and configuration of SR IOV in a Red Hat Linux envi ronment using Mellanox ConnectX VPI adapter cards family 5 4 1 System Requirements To set up an SR IOV environment the following is required MLNX EN Driver Aserver blade with an SR IOV capable motherboard BIOS Hypervisor that supports SR IOV such as Red Hat Enterprise Linux Server Version 6 Mellanox ConnectX VPI Adapter Card family with SR IOV capability 24 Mellanox Technologies Rev 2 0 3 0 0 5 4 2 Setting Up SR IOV Depending on your system perform the steps below to set up your BIOS The figures used in this section are for illustration purposes only For further information please refer to the appropriate BIOS User Manual 1 Enable SR IOV in the system BIOS BIOS SETUP UTILITY fiduanced di d Enabled Step 2 Enable Intel Virtualization Technology ring m Tech Step 3 Install the hypervisor that supports SR IOV Step 4 Depending on your system update the boot grub grub conf file to include a similar command line load parameter for the Linux k
26. iguration For high performance it is recommended to use the highest memory speed with fewest DIMMs and populate all memory channels for every CPU installed For further information please refer to your vendor s memory configuration instructions or mem ory configuration tool available Online Mellanox Technologies 35 J Rev 2 0 3 0 0 Performance Tuning 6 2 3 Recommended BIOS Settings These performance optimizations may result in higher power consumption 6 2 3 1 General Set BIOS power management to Maximum Performance 6 2 3 2 Intel amp Sandy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors Table 4 Recommended BIOS Settings for Intel Sandy Bridge Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Enabled Hyper Threading HPC disabled Data Centers enabled CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance a Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hy
27. in the etc modprobe d mlx4 conf file For MLNX EN 2 0 x options mlx4 core enable sys tune 1 For MLNX EN 1 5 10 options mlx4 en enable sys tune 1 6 3 4 3 OS Controlled Power Management Some operating systems can override BIOS power management configuration and enable c states by default which results in a higher latency To resolve the high latency issue please follow the instructions below 1 Edit the boot grub grub conf file or any other bootloader configuration file 2 Add the following kernel parameters to the bootloader command intel idle max cstate 0 processor max_cstate 1 3 Reboot the system Example title RH6 2x64 root hd0 0 kernel wmlinuz RH6 2x64 2 6 32 220 e16 x86 64 root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0 console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max cstate l 40 Mellanox Technologies Rev 2 0 3 0 0 6 3 5 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface eth1 then turns off
28. ithm are ignored but must be present Example if TC0 TC2 are set to ETS then 10 0 90 will set TCO to 10 and TC2 to 90 Percents must sum to 100 r LIST ratelimit LIST Rate limit for TCs in Gbps LIST is a comma seperated Gbps limit for each TC Example 1 8 8 will limit TCO to 1Gbps and TC1 TC2 to 8 Gbps each i INTF interface INTF Interface name a Show all interface s TCs 16 Mellanox Technologies Rev 2 0 3 0 0 Get Current Configuration Set ratelimit 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2 Mellanox Technologies 17 Rev 2 0 3 0 0 Driver Features Configure QoS map UP 0 7 to 0 1 2 3 to tc1 4 5 6 to tc 2 set tc0 tc1 as ets and tc2 as strict divide ets 30 for tc0 and 70 for tc1 mile peg i ibus es ets ets stance 59 0 1 1L 1 2 27 2 58 3 0 10 tc 0 ratelimit 3 Gbps tsa ets bw 30 0 skprio 0 skprio 1 Skprio 2 tos 8 Skprio 3 Skprio 4 tos 24 Skprio 5 Skprio 6 tos 16 Skprio 7 Skprio 8 Skprio 9 Skprio 10 Skprio 11 Skprio 12 Skprio 13 Skprio 14 Skprio 15 wiga I tc 1 ratelimit 4 Gbps tsa ets bw 70 1 2 up 3 tc 2 ratelimit 2 Gbps tsa strict up 4 wigg 15 up 6 5 1 5 2 tc and tc wrap py The tc tool is used to setup sk prio to UP mapping using the mgprio queue discipline In kernels that do not support such as 2 6 34 an alternate mapping is created in sysfs The wrap py
29. kernel sources must be installed on the machine MLNX EN driver cannot coexist with OFED software on the same machine Hence when installing MLNX EN all OFED packages should be removed done by the minx en install script Installing the Driver Step 1 Download Driver Package from the Mellanox site http www mellanox com content pages php pg products dyn amp product family 27 amp menu section 35 Step 2 Install Driver gt tar xzvf minx en 2 0 3 0 0 tgz file gt cd 2 0 3 0 0 gt install sh gt To install mInx en 2 0 3 0 0 on XenServer6 1 rpm ihv RPMS xenserver6u1 i386 uname r mlnx_en rpm The package consists of several source RPMs The install script rebuilds the source RPMs and then installs the created binary RPMs The created kernel module binaries are located at For KMP RPMs installation OnSLES mellanox mlnx en kmp RPM 1ib modules kernel ver updates mellanox mlnx en On RHEL kmod mellanox mlnx en RPM lib modules kernel ver extra mellanox mlnx en Fornon KMP RPMs mlnx en RPM OnSLES 1ib modules kernel ver updates mlnx en On RHE 1ib modules kernel ver extra mlnx en mlnx en installer supports 2 modes of installation The install scripts selects the mode of driver installation depending of the running OS kernel version Kernel Module Packaging KMP mode where the source rpm is rebuilt for each installed flavor of the kernel This mode is used
30. lanox Support support mellanox com Step 7 Create the text file etc modprobe d mlx4 core conf if it does not exist otherwise delete its contents 1 Ifthe fields in the example above do not appear in the HCA section meaning SR IOV is not supported in the used INI 2 If SR IOV is supported to enable if it is not it is sufficient to set sriov en true in the INI 26 Mellanox Technologies Rev 2 0 3 0 0 8 Insert an option line in the etc modprobe d mlx4 core conf file to set the number of the protocol type per port and the allowed number of virtual functions to be used by the physical function driver probe vf options mlx4 core num vfs 5 probe vf 1 Parameter Recommended Value num vfs Absent or zero The SRI OV mode is not enabled in the driver hence no VFs will be available e ts value is a single number in the range of 0 63 The driver will enable the num v s VFs on the HCA and this will be applied to all ConnectX HCAs on the host ts format is a string which allows the user to specify the num vfs parameter separately per installed HCA Its format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to enable for that HCA This parameter can be set in one of the following ways For exam ple num vfs 5 The driver will enable 5 VFs on the HCA and this will be applied to all ConnectX HCAs on the host num vfs 00 04
31. mory thresholds to prevent packet dropping sysctl w net ipv4 tcp rmem 4096 87380 4194304 sysctl w net ipv4 tcp wmem 4096 65536 4194304 Enable low latency mode for TCP sysctl w net ipv4 tcp low latency 1 38 Mellanox Technologies Rev 2 0 3 0 0 6 3 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp sack 1 6 3 3 Preserving Your Performance Settings after a Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows sysctl namel gt lt valuel gt sysctl name2 gt lt value2 gt sysctl name3 gt lt value3 gt sysctl name4 gt lt value4 gt For example Tuning the Network Adapter for Improved IPv4 Traffic Performance on page 38 lists the following setting to disable the TCP timestamps option Sysctl w net ipv4 tcp timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 6 3 4 Tuning Power Management Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent e Check the maximum suppor
32. mware upgrades and provides instructions for 1 installing Mellanox firmware update tools MFT 2 downloading FW and 3 updating adapter card firmware 4 1 Installing Firmware Tools The driver package compiles and installs the Mellanox mstflint utility under usr local bin You may also use this tool to burn a card specific firmware binary image See the file tmp mlnx en src utils mstflint README file for details Alternatively you can download the current Mellanox Firmware Tools package MFT from www mellanox com gt Products gt Adapter IB VPI SW gt Firmware Tools The tools package to download is SW for Linux tarball name is mft X X X tgz For help in identifying your adapter card please visit http www mellanox com content pages php pg firmware HCA FW identification 4 2 Updating Adapter Card Firmware Using a card specific binary firmware image file enter the following command gt mstflint d pci device i image name bin b For burning firmware using the MFT package please check the MFT user s manual under www mel lanox com gt Products gt Adapter IB VPI SW gt Firmware Tools After burning new firmware to an adapter reboot machine so that new firm ware can take effect PI 12 Mellanox Technologies Rev 2 0 3 0 0 5 Driver Features 5 1 Quality of Service Quality of Service QoS is a mechanism of assigning a priority to a network flow socket rdma
33. n of a packet A time stamping ser vice supports assertions of proof that a datum existed before a particular time Incoming packets are time stamped before they are distributed on the PCI depending on the congestion in the PCI buffers Outgoing packets are time stamped very close to placing them on the wire Mellanox Technologies 19 J Rev 2 0 3 0 0 Driver Features 5 2 1 Enabling Time Stamping Time stamping is off by default and should be enabled before use To enable time stamping for a socket Call setsockopt with SO TIMESTAMPING and with the following flags SOF TIMESTAMPING TX HARDWARE try to obtain send time stamp in hardware SOF TIMESTAMPING TX SOFTWARE if SOF TIMESTAMPING TX HARDWARE is off or fails then do it in software SOF TIMESTAMPING RX HARDWARE return the original unmodified time stamp as generated by the hardware SOF TIMESTAMPING RX SOFTWARE if SOF TIMESTAMPING RX HARDWARE is off or fails then do it in software SOF TIMESTAMPING RAW HARDWARE return original raw hardware time stamp SOF TIMESTAMPING SYS HARDWARE return hardware time stamp transformed to the system time base SOF TIMESTAMPING SOFTWARE return system time stamp generated in software SOF TIMESTAMPING TX RX determine how time stamps are generated SOF TIMESTAMPING RAW SYS determine how they are reported To enable time stamping for a net device Admin privileged user can enable disable time stamping th
34. nd Configuration 3 X Ethernet Driver Usage and Configuration To assign an IP address to the interface gt ifconfig eth lt x gt ip x is the OS assigned interface number gt To check driver and device information gt ethtool i eth lt x gt Example gt ethtool i eth2 driver mlx4 en version 2 1 8 Oct 06 2013 firmware version 2 30 3110 bus info 0000 1a 00 0 To query stateless offload status gt ethtool k eth x To set stateless offload status gt ethtool K eth x rx on off tx on off sg on off tso on off lro on off To query interrupt coalescing settings gt ethtool c eth x To enable disable adaptive interrupt moderation gt ethtool C eth x adaptive rx on off By default the driver uses adaptive interrupt moderation for the receive path which adjusts the mod eration time to the traffic pattern gt To set the values for packet rate limits and for moderation time high and low gt ethtool C eth lt x gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value gt To set interrupt coalescing settings when adaptive moderation is disabled gt ethtool C eth x rx usecs rx frames N 7 usec settings correspo
35. nd to the time to wait after the last packet is sent received before triggering an interrupt gt To query pause frame settings gt ethtool a eth lt x gt gt To set pause frame settings gt ethtool A eth x rx on off tx on off 10 Mellanox Technologies Rev 2 0 3 0 0 gt query ring size values gt ethtool g eth lt x gt gt To modify rings size gt ethtool G eth lt x gt rx lt N gt tx lt N gt gt To obtain additional device statistics gt ethtool S eth lt x gt gt To perform a self diagnostics test gt ethtool t eth lt x gt The driver defaults to the following parameters Both ports are activated 1 e a net device is created for each port The number of Rx rings for each port is the nearest power of 2 of number of cpu cores limited by 16 LRO is enabled with 32 concurrent sessions per Rx ring Some of these values can be changed using module parameters which can be displayed by run ning gt modinfo mlx4 en To set non default values to module parameters add to the etc modprobe cont file options mlx4 en param name value param name value Values of all parameters can be observed in sys module mlx4 en parameters Mellanox Technologies 11 Rev 2 0 3 0 0 Firmware Programming 4 Firmware Programming The adapter card was shipped with the most current firmware available This section is intended for future fir
36. ns that mostly use polling will have an impact when using the remote NUMA node libmlx4 has a build in enhancement that recognizes an application that is pinned to a remote NUMA node and activates a flow that improves the out of the box latency and throughput However the NUMA node recognition must be enabled as described in section Tuning for Intel Sandy Bridge Platform on page 41 In systems which do not support SLIT the following environment variable should be applied MLX4 LOCAL CPUS 0x bit mask of local NUMA node Example for local NUMA node which its cores are 0 7 LOCAL CPUS Oxff Additional modification can apply to impact this feature by changing the following environment variable MLX4 STALL NUM LOOP integer default 400 p The default value is optimized for most applications However several applications might benefit from increasing decreasing this value 6 3 6 2 Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system e With a2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 With a 4 socket system the PCIe adapter will be connected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 6 3 6 3 Recognizing NUMA Node Cores gt To recognize NUMA node cores run the following command cat sys devices system node node X cpulist cpumap Example cat sys devices system node node1 cpulist 15
37. per threading is enabled 36 Mellanox Technologies Rev 2 0 3 0 0 6 2 3 3 Intel amp Nehalem Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalem based processors Configuring the Completion Queue Stall Delay Table 5 Recommended BIOS Settings for Intel amp Nehalem Westmere Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled Hyper Threading Disabled Recommended for latency and message rate sen sitive applications CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance a Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled 6 2 3 4 AMD Processors The following table displays the recommended BIOS settings in machines with AMD based pro CeSsors Table 6 Recommended BIOS Settings for AMD Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled HPC Optimizations Enabled CPU frequency select Max performance
38. ple Ifthe system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the follow ing etc init d irgbalancer stop set irq affinity cpulist sh 0 1 eth2 set irq affinity cpulist sh 2 3 eth3 set irq affinity cpulist sh 4 5 eth4 set affinity cpulist sh 6 7 eth5 44 Mellanox Technologies Rev 2 0 3 0 0 6 3 8 Tuning Multi Threaded IP Forwarding gt To optimize NIC usage as IP forwarding 1 Set the following options in etc modprobe d mlx4 conf For MLNX EN 2 0 x options mlx4 en inline thold 0 options mlx4 core high rate steer 1 e ForMLNX EN 1 5 10 options mlx4 en num lro 0 inline thold 0 options mlx4 core high rate steer 1 2 Apply interrupt affinity tuning 3 Forwarding on the same interface set irq affinity bynode sh numa node interface 4 Forwarding from one interface to another set irq affinity bynode sh numa node interfacel interface2 5 Disable adaptive interrupt moderation and set status values using ethtool C adaptive rx off Mellanox Technologies 45 J Rev 2 0 3 0 0 Performance Tuning 46 Mellanox Technologies
39. regular L2 steering is performed instead BO Steering When using SR IOV flow steering is enabled if there is adequate amount of space to store the flow steering table for the guest master gt To enable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Set the parameter 1og num mgm entry size to 1 by writing the option mlx4_core log num mgm entry size 1 Step3 Restart the driver To disable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Remove the options mlx4 core log num mgm entry size 1 Step3 Restart the driver 5 3 2 Flow Domains and Priorities Flow steering defines the concept of domain and priority Each domain represents a user agent that can attach a flow The domains are prioritized A higher priority domain will always super sede a lower priority domain when their flow specifications overlap Setting a lower priority value will result in higher priority In addition to the domain there is priority within each of the domains Each domain can have at most 2 12 priorities in accordance to its needs The following are the domains at a descending order of priority 22 Mellanox Technologies Rev 2 0 3 0 0 e Ethtool Ethtool domain is used to attach ring specifically its to specified flow Please refer to the most recent ethtool manpage for all the ways to specify a flow Examples ethtool U eth5 flow type ether dst 00 11 22 33 44
40. rough calling ioctl sock SIOCSHWT STAMP amp ifreq with following values Send side time sampling Enabled by ifreq hwtstamp config tx type when possible values for hwtstamp config tx type enum hwtstamp tx types No outgoing packet will need hardware time stamping should a packet arrive which asks for it no hardware time stamping will be done v TX OFF Enables hardware time stamping for outgoing packets the sender of the packet decides which are to be time stamped by setting SOF TIMESTAMPING TX SOFTWARE before sending the packet HWTSTAMP TX ON Enables time stamping for outgoing packets just HWTSTAMP TX ON does but also enables time stamp insertion directly into Sync packets In this case transmitted Sync packets will not received a time stamp via the socket error EgUSUGE f HWTSTAMP TX ONESTEP SYNC n Note for send side time stamping currently only HWTSTAMP TX OFF and HWTSTAMP TX ON are supported 20 Mellanox Technologies Rev 2 0 3 0 0 Receive side time sampling Enabled by ifreq hwtstamp config rx filter when possible values for hwtstamp config rx filter enum hwtstamp rx filters time stamp no incoming packet at all HWTSTAMP FILTER NONE time stamp any incoming packet HWTSTAMP FILTER ALL return value time stamp all packets requested plus some others HWTSTAMP FILTER SOME PTP v1 UDP any kind
41. s coming before it as determined by the TC number TC 7 is highest priority TC 0 is lowest It also has an absolute priority over non strict TCs ETS 14 Mellanox Technologies Rev 2 0 3 0 0 This property needs to be used with care as it may easily cause starvation of other TCs A higher strict priority TC 1s always given the first chance to transmit Only if the highest strict priority TC has nothing more to transmit will the next highest TC be considered Non strict priority TCs will be considered last to transmit This property is extremely useful for low latency low bandwidth traffic Traffic that needs to get immediate service when it exists but is not of high volume to starve other transmitters in the sys tem 5 1 4 2 Minimal Bandwidth Guarantee ETS After servicing the strict priority TCs the amount of bandwidth BW left on the wire may be split among other TCs according to a minimal guarantee policy If for instance TCO is set to 80 guarantee and TC1 to 20 the TCs sum must be 100 then the BW left after servicing all strict priority TCs will be split according to this ratio Since this is a minimal guarantee there is no maximum enforcement This means in the same example that if TC1 did not use its share of 20 the reminder will be used by TCO 5 1 4 3 Rate Limit Rate limit defines a maximum bandwidth allowed for a TC Please note that 10 deviation from the requested values is considered acceptable
42. s the original outgoing packet data including all headers preprended down to and including the link layer the scm timestamping control message and a sock extended err control message with ee errno ENOMSG and ee origin SO EE ORIGIN TIMESTAMPING A socket with such a pending bounced packet is ready for reading as far as select is concerned If the outgoing Mellanox Technologies 21 Rev 2 0 3 0 0 Driver Features packet has to be fragmented then only the first fragment is time stamped and returned to sending socket When time stamping is enabled stripping is disabled For more info please refer to Documentation networking timestamping txt in kernel org 5 3 Flow Steering Flow Steering is applicable to the mlx4 driver only p Flow steering is a new model which steers network flows based on flow specifications to specific QPs Those flows can be either unicast or multicast network flows In order to maintain flexibil ity domains and priorities are used Flow steering uses a methodology of flow attribute which is a combination of L2 L4 flow specifications a destination QP and a priority Flow steering rules could be inserted either by using ethtool or by using InfiniBand verbs The verbs abstraction uses an opposed terminology of a flow attribute ibv flow attr defined by a combination of specifi cations struct ibv_flow_spec_ 5 3 1 Enable Disable Flow Steering Flow Steering is disabled by default and
43. t the RFS logic by implementing the ndo rx flow steer which in turn calls the underlying flow steering mechanism with the RFS domain Enabling the RFS requires enabling the ntuple flag via the ethtool For example to enable ntuple for 0 run ethtool K eth0 ntuple on RFS requires the kernel to be compiled with the conFIG_RFS ACCEL option This options is available in kernels 2 6 39 and above Furthermore RFS requires Device Managed Flow Steering support RFS cannot function if LRO is enabled LRO can be disabled via ethtool has Mellanox Technologies 23 J Rev 2 0 3 0 0 Driver Features e of the rest The lowest priority domain serves the following users The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP using L2 flow specifications Fragmented UDP traffic cannot be steered It is treated as other protocol by hardware from the first packet and not considered as UDP traffic 5 4 Single Root IO Virtualization SR IOV Single Root IO Virtualization SR IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus This technology enables multiple virtual instances of the device with separate resources Mellanox adapters are capable of exposing in ConnectX 3 adapter cards 63 virtual instances called Virtual Functions VFs These virtual functions can then be provisioned separately Each VF can be seen as an addition dev
44. ted CPU frequency cat sys devices system cpu cpu cpufreq cpuinfo max freq Check that core frequencies are consistent cat proc cpuinfo grep cpu MHz Check that the output frequencies are the same as the maximum supported If the CPU frequency is not at the maximum check the BIOS settings according to tables in is section Recommended BIOS Settings on page 36 to verify that power state 1s disabled Check the current CPU frequency to check whether it is configured to max available frequency cat sys devices system cpu cpu cpufreq cpuinfo cur freq Mellanox Technologies 39 J Rev 2 0 3 0 0 Performance Tuning 6 3 4 1 Setting the Scaling Governor If the following modules are loaded CPU scaling is supported and you can improve perfor mance by setting the scaling mode to performance freq table acpi cpufreq this module is architecture dependent It is also recommended to disable the module cpuspeed this module is also architecture depen dent gt To set the scaling mode to performance use echo performance gt sys devices system cpu cpu7 cpufreg scaling governor To disable cpuspeed use service cpuspeed stop 6 3 4 2 Kernel Idle Loop Tuning The mlx4 en kernel module has an optional parameter that can tune the kernel idle loop for bet ter latency This will improve the CPU wake up time but may result in higher power consump tion To tune the kernel idle loop set the following options
45. tool will use either the sysfs or the tc tool to configure the sk prio to UP mapping Usage tc wrap py i interface options Options version show program s version number and exit h help show this help message and exit u SKPRIO UP skprio up SKPRIO UP maps sk prio to UP LIST is lt 16 comma separated UP index of element is sk prio i INTF interface INTF Interface name 18 Mellanox Technologies Rev 2 0 3 0 0 Example set skprio 0 2 to skprio 3 7 to UP1 on eth4 UP 0 Skprio 0 Skprio 1 Skprio 2 tos 8 Skprio 7 Skprio 8 Skprio 9 Skprio 10 Skprio 11 Skprio 12 Skprio 13 Skprio 14 Skprio 15 UE skprio 3 skprio 4 tos 24 skprio 5 skprio 6 tos 16 UP 2 UP 3 UP 4 w 5 UP 6 UP 7 5 1 5 3 Additional Tools tc tool compiled with the sch mqprio module is required to support kernel v2 6 32 or higher This is a part of iproute2 package v2 6 32 19 or higher Otherwise an alternative custom sysfs interface 1s available qos tool package ofed scripts requires python gt 2 5 e toc wrap py package ofed scripts requires python gt 2 5 5 2 Time Stamping Service Time Stamping is currently at beta level d Please be aware that everything listed here is subject to change Time Stamping is currently supported in ConnectX 3 ConnectX 3 Pro adapter ly d cards only Time stamping is the process of keeping track of the creatio
46. ut of boot load Step 10 Load the driver and verify the SR IOV is supported Run lspci grep Mellanox 03 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 03 00 1 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 2 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 3 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 4 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 5 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Where 03 00 represents the Physical Function 03 00 represents the Virtual Function connected to the Physical Function 5 4 3 Enabling SR IOV and Para Virtualization on the Same Setup To enable SR IOV and Para Virtualization on the same setup 1 Create a bridge vim etc sysconfig network scripts ifcfg bridged DEVICE bridge0 TYPE Bridge IPADDR 12 195 15 1 NETMASK 255 255 0 0 BOOTPROTO static ONBOOT yes NM CONTROLLED no DELAY 0 Step 2 Change the related interface in the example below bridge0 is created over eth5 DEVICE eth5 BOOTPROTO none STARTMODE on 00 02 9 2 66 52 TYPE Ethernet NM CONTROLLED no ONBOOT yes BRIDGE bridge0 Step 3 Restart the service network 28 Mellanox Technologi

Download Pdf Manuals

image

Related Search

Related Contents

KitchenAid 5KCF0103EER/3  Mode d`emploi Scribus (989,56 Ko)  FDE・FDEV(PDF/1874KB)  RFK5516 PK5516  F re e s c a le S e m ic o n d u c to r, I n c . ..  Continental CP43329 User's Manual  Samsung RT-F320G TMF with Digital Inverter Technology, 298 L User Manual  2.4GHz Outdoor Router User Manual    User manual - Joakim Braun  

Copyright © All rights reserved.
Failed to retrieve file