Home

Performance Tuning Guide for Mellanox Network Adapters

image

Contents

1. VMA SELECT POLL 1 This setting increases the number of times the selected path successfully receives poll hits which improves the latency and causes increased CPU utilization e Disable the following polling parameters by setting their values to 0 e VMA RX POLL OS RATIO e VMA SELECT POLL OS When disabled only offloaded sockets are polled 3 12 4 Handling Single Threaded Processes You can improve performance for single threaded processes e Change the threading parameter to VMA THREAD MODE 0 This setting helps to eliminate VMA locks and improve performance 3 13 Performance Tuning for Virtualized Environment 3 13 1 Tuning for Hypervisor It is recommended to configure the iommu to pass thru option in order to improve hypervisor performance ALL 25 Revision 1 16 Performance Tuning for Linux gt To configure the iommu to pass thru option e Add to kernel parameters Intel iommu on iommu pt The virtualization service might enable the global IPv4 forwarding which in turn will cause all interfaces to disable their large receive offload capability To re enable large receive offload capability using ethtool Geli xul x Simi ie ve lro om 26 Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 4 Performance Tuning for Windows This document describes how to modify Windows registry parameters in order to A i
2. Please be aware that if the file does not exist is must be created having the same name as the one stated above e For MLNX OFED 2 x options mlx4 core enable sys tune 1 ALL Revision 1 16 Performance Tuning for Linux 3 4 3 4 1 3 4 1 1 NUMA Architecture Tuning Tuning for Intel Sandy Bridge Platform Ivy Bridge Processors The Intel Sandy Bridge processor has an integrated PCI express controller Thus every PCIe adapter OS is connected directly to a NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCIe adapter is connected In order to identify which NUMA node is the adapter s node the system BIOS should support ACPI SLIT To see if your system supports PCle adapter s NUMA node detection cat sys class net interface device numa_node cat sys devices PCI root PCIe function numa_node Example for supported system cat sys class net eth3 device numa_node 0 Example for unsupported system cat sys class net ib0 device numa_node al Improving Application Performance on Remote Sandy Bridge Node Verbs API applications that mostly use polling will have an impact when using the remote Sandy Bridge node libmlx4 and libmlx5 have a build in enhancement that recognizes an application that is pinned to a remote Sandy Bridge node and activates a flow that improves the out of the box latency and throughput Howe
3. see Setting the Scaling Governor on page 17 6 Increase the number of persistent huge pages in the kernel s huge page pool for user space targets such as TGT echo 3000 gt proc sys vm nr_hugepages For kernel space targets such as LIO SCST decrease the number of persistent huge pages or set to zero echo 0 gt proc sys vm nr_hugepages 7 Set the IRQ Affinity hints see IRQ Affinity Hints on page 14 Tuning VMA Parameters This section provides guidelines for improving performance with VMA It is intended for administrators who are familiar with VMA and should be used in conjunction with the VMA User Manual and the VMA Release Notes You can minimize latency by tuning VMA parameters It is recommended to test VMA performance tuning on an actual application We suggest that you try the following VMA parameters one by one and in combination to find the optimum for your application For more information about each parameter see the VMA User Manual To perform tuning add VMA configuration parameters when you run VMA after LD PRELOAD for example LD PRELOAD libvma so VMA MTU 200 my application Memory Allocation Type We recommend using contiguous pages default However in case you want to use huge pages do the following e Before running VMA enable Kernel and VMA huge table for example echo 1000000000 proc sys kernel shmmax echo 800 gt proc sys vm nr hugepages Note Increase the a
4. 3 4 4 Running an Application on a Certain NUMA Node In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool For example if the adapters NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only Torun an application run the following commands taskset c 8 15 ib write bw a Or taskset Oxff00 ib write bw a 3 5 Interrupt Moderation Tuning gt Note This section applies to both Ethernet and IPoIB interfaces Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algorithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the following commands first show the current default setting of interrupt moderation on the interface eth1 then turns off Rx interrupt moderation and last shows the new setting gt ethtool c ethl Coalesce parameters for ethl Adaptive RX on TX off pkt rate low 100000 GE O Revision 1 16 Performance Tuning for Linux 3 6 pkt rate high 400000 rx usecs 16 rx frames 128 EX USEcS Iron O rx frames irqg O gt ethtool C ethl adaptive rx off
5. MEL Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 Revision Date Description e Tuning for Intel Microarchitecture Code name Sandy Bridge Ivy Bridge Platforms e Tuning for Windows Server 2012 2012 R2 e Removed the following section e Reducing DMAs e Added the following section e Verbs Applications 1 10 December 2013 e Updated section Performance Testing October 2013 e Updated section Kernel Idle Loop Tuning e Added section Performance Tuning for Virtualized Environment 1 9 September 2013 e Updated section Interrupt Moderation 1 8 June 2013 e Removed section Tuning for Windows Server 2008 and its sub sections e Added the following sections e Recognizing NUMA Node Cores e Finding the Closest NUMA Node to the NIC 1 7 April 2013 e Updated the following sections e Recommended BIOS Settings e Power Management e Tuning for Intel Sandy Bridge e IRQ Affinity Configuration e Multi Threaded IP Forwarding Tuning e Tuning for Multiple Adapters e Replaced Tuning for IPoIB Interfaces with Auto Tuning Utility e Added section Improving Application Performance on Remote Sandy Bridge Node 1 6 October 2012 e Added the following sections e Recognizing NUMA Node Cores e Running an Application on a Certain NUMA Node e Recognizing NUMA Node Cores e Updated the following sections e Tuning the Networ
6. ge D Te d E 9 11 Relevant Mellanox Drivers 9 2 General System Configurations esee anneanne aaa nenne nennt nnn nn asina anana nn nasa nnns 10 2 1 PCI Express PCle Capabilities AA 10 2 2 Memory Configuration si akl ak kd anlamakla laa da a alla li 10 2 3 System Monitoring and Profiler 10 2 4 Recommended BIOS Gettngs ener snnt 11 2 44 Generali nde ee ae He lela HE idee nizi ca 11 2 4 2 Intel Haswell PDrocessors A 11 2 43 Intel Sandy Bridge Processors Ivy Bridge Processors eseeeeeeeneeen 11 2 4 4 Intel8 Nehalem Westmere Processors nenne 12 2 4 5 AMD PrOCGSSOIS 1 lane aa ended dete dee dene e de due ee a 12 3 Performance Tuning for Linux ceeeeeeee een een nn nnne nnnm annnm nsn nnmnnn aa nnnm innen 14 Jab oIBQUAffinity EE 14 3 1 1 IF Y Affinity mila ei e RE r a 14 3 1 2 IRQ Affinity Configuration m 14 3 1 3 cAuto Tuning Utility ca ect tere tte e mee eee e ect eee 15 3 1 4 o Tuning for Multiple Adapnters AA 15 3 2 ConnectX 4 100GbE Tuning ssssssssssssesees enne enne nnnren nennen nns trnn enne 16 3 3 PowerManagementluning nennen ai nennen nnne 16 3 3 1 OS Controlled Power Management 16 3 3 2 Checking Core Frequency ssssssssssseese seen entree nnns nnne tenens 17 3 3 3 Setting the Scaling Govemmor nennen 17 3 3 4 Kernel Idle Loop TUNING enne 17 3 4 NUMA Architecture Tuning oerte er
7. number of NUMA node cores it is recommended to configure the rps_cpus mask to the device NUMA node core list to gain the better parallelism of multi queue interfaces Example When IPoIB is used in connected mode it has only a single rx queue ALL 21 Revision 1 16 Performance Tuning for Linux 3 9 3 9 1 3 9 2 gt To enable RPS LOCAL CPUS cat sys class net ib0 device local cpus echo LOCAL CPUS gt sys class net ib0 queues rx 0 rps cpus For further information please refer to https www kernel org doc Documentation networking scaling txt Tuning with sysctl You can use the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance e Disable the TCP timestamps option for better CPU utilization sysctl w net ipv4 tcp timestamps 0 e Enable the TCP selective acks option for better throughput sysctl w net ipv4 tcp sack 1 e Increase the maximum length of processor input queues Sysctl w net core netdev max backlog 250000 e Increase the TCP maximum and default buffer sizes
8. sysctl name3 gt lt value3 gt lt sysctl name4 gt lt value4 gt For example Tuning the Network Adapter for Improved IPv4 Traffic Performance on page 16 lists the following setting to disable the TCP timestamps option sysctl w net ipv4 tcp_timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 3 10 Verbs Applications Optimization 3 10 1 Single Thread Applications When running verbs applications that only have a single thread per process it is recommended to enable the following environment variable e For ConnectX 3 adapter family MLX4_SINGLE_THREADED 1 e When using Connect IB adapter family MLX5_SINGLE_THREADED 1 When single thread is enabled the hardware library will remove expensive locks from the code and improve performance 3 11 Performance Tuning for iSER gt To perform tuning for iSER 1 Set the SCSI scheduler to noop echo noop gt sys block lt block_dev gt queue scheduler 2 Disable the SCSI add_random echo 0 gt sys block lt block_dev gt queue add_random 3 Disable IO merges echo 2 gt sys block lt block_dev gt queue nomerges 4 Disable the hyper threading in BIOS configuration GE D 23 Revision 1 16 Performance Tuning for Linux 3 12 3 12 1 3 12 2 5 Set the CPU scaling governor to performance if supported
9. using setsockopt Soen T w net core rmem max 4194304 eee 1 w net core wmem max 4194304 sysct l w net core rmem default 4194304 sysct l w net core wmem default 4194304 ss w net core optmem max 4194304 e Increase memory thresholds to prevent packet dropping sysctl sysctl w net ipv4 tcp rmem 4096 87380 4194304 w net ipv4 tcp wmem 4096 65536 4194304 e Enable low latency mode for TCP sysctl w net ipv4 tcp low latency 1 The following variable is used to tell the kernel how much of the socket buffer space should be used for TCP window size and how much to save for an application buffer sysctl w net ipv4 tcp adv win scale 1 A value of 1 means the socket buffer will be divided evenly between TCP windows size and application Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance e Disable the TCP timestamps option for better CPU utilization sysctl w net ipv4 tcp timestamps 0 e Enable the TCP selective acks option for better throughput 22 Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 sysctl w net ipv4 tcp sack 1 3 9 3 Preserving Your sysctl Settings after a Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows sysctl namel gt lt valuel gt sysctl name2 gt lt value2 gt lt
10. BILITY OF SUCH DAMAGE Mellanox TECHNOLOGIES Mellanox Technologies 350 Oakmead Parkway Suite 100 Sunnyvale CA 94085 U S A www mellanox com Tel 408 970 3400 Fax 408 970 3403 Copyright 2015 Mellanox Technologies All Rights Reserved Mellanox Mellanox logo BridgeX CloudX logo Connect IB ConnectX CoolBox CORE Direct GPUDirect Infini Host InfiniScale amp Kotura Kotura logo Mellanox Federal Systems Mellanox Open Ethernet Mellanox Scalable HPC Mellanox Connect Accelerate Outperform logo Mellanox Virtual Modular Switch MetroDX MetroX MLNX OS Open Ethernet logo Phy X SwitchX TestX The Generation of Open Ethernet logo UFM Virtual Protocol Interconnect Voltaire and Voltaire logo are registered trademarks of Mellanox Technologies Ltd Accelio CyPU FPGADirect HPC X InfiniBridge LinkX Mellanox Care Mellanox CloudX Mellanox Multi Host Mellanox NEO Mellanox PeerDirect Mellanox Socket Direct Mellanox Spectrum M NVMeDirect StPU Spectrum logo Switch IB Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Document Number 3368 Table of Contents Revision 1 16 Table of Contents Document Revision History eeeeeeeeeeeeeeeeeeet aana nana nnne nasa nnn nasa nnns asas nsa sanas nnn 6 1 Tu
11. C States Disabled Turbo mode Enabled Hyper Threading HPC disabled Data Centers enabled IO non posted prefetching Enabled If the BIOS option does not exists please contact your BIOS vendor CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 2 4 3 Intel Sandy Bridge Processors Ivy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Enabled Hyper Threading HPC disabled Data Centers enabled CPU frequency select Max performance GE OS Revision 1 16 General System Configurations BIOS Option Values Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 2 4 4 Intel Nehalem Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalem based processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled Hyper Threading Disabled Recommende
12. Mellanox x Connect CHNOLOGIES Accelerate Outperform Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 www mellanox com NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCT S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OFLIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY W AY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSI
13. The processor core ID e D The distance between the NUMA node closest to the physical PCI slot where the NIC is installed to the NUMA node where processor core C resides We recommend using only cores that have D 0 implying they are within the closest NUMA node to the NIC 4 5 Tuning for Windows 2008 R2 Please use the perf tuning exe tool that comes with MLNX VPI driver It will recognize the adapters NUMA node automatically and set the relevant registry keys accordingly This tool is based on information retrieved from a tuning document that can be found here http msdn microsoft com en us windows hardware gg463392 aspx 30 Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 The following are the auto tuning options e Optimized for single port use when most of the traffic is utilizing one of the NIC ports perf tuning exe s cl connection name gt e Optimized for dual port use when most of the traffic is utilizing both of the NIC ports perf tuning exe d c1 first connection name c2 second connection name Optimized for IP Routing RFC2544 perf tuning exe f cl first connection name c2 second connection name e For multicast streams tuning perf tuning exe mc cl first connection name c2 second connection name e For single connection applications perf tuning exe st cl first connection name gt Auto tuning can be performed us
14. configuration of network adapters based on the ConnectX adapters This document describes important tuning parameters and settings that can improve performance for Mellanox drivers Each setting along with its potential effect is described to help in making an informed judgment concerning its relevance to the user s system the system workload and the performance goals Tuning is relevant for both Ethernet and IPoIB network interfaces 1 1 Relevant Mellanox Drivers The tuning guidelines described in this document apply to the following Mellanox Software drivers e On Linux Mellanox Ethernet Driver MLNX EN for Linux version 2 x and later e On Linux Mellanox VPI Driver MLNX OFED for Linux version 2 x and later e On Windows Mellanox OFED for Windows MLNX VPI version 4 80 and later Mellanox Technologies Confidential Revision 1 16 General System Configurations 2 1 2 2 2 3 General System Configurations The following sections describe recommended configurations for system components and or interfaces Different systems may have different features thus some recommendations below may not be applicable PCI Express PCle Capabilities Table 2 Recommended PCle Configuration PCIe Generation 3 0 Speed 8GT s Width x8 or x16 Max Payload size 256 Max Read Request 4096 Note For ConnectX3 based network adapters 40GbE Ethernet adapters it is recommended to use an x16 PCIe slot to benefi
15. d for latency and message rate sensitive applications CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 2 4 5 AMD Processors The following table displays the recommended BIOS settings in machines with AMD based processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled HPC Optimizations Enabled CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled AN a Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 BIOS Option Values Channel Interleaving Enabled Thermal Mode Performance 13 Revision 1 16 Performance Tuning for Linux 3 3 1 3 1 1 3 1 2 Performance Tuning for Linux IRQ Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to di
16. d optimally across receive queues If this occurs change the hash type to XOR which is more optimal to small number of connections ethtool set priv flags interface mlx4 rss xor hash function on gt Note RPS does not work when using XOR hash type gt 3 7 2 ConnectX 3 Connect X 3 Pro Optimized Steering As of MLNX_OFED 2 3 1 0 0 ConnectX 3 ConnectX 3 Pro adapter cards can be configured for optimized steering mode E Note Optimized steering mode may improve Ethernet packet rate however sideband management is not functional in this mode To use this optimization Edit etc modprobe d mlnx conf options mlx4 core log num mgm entry size 7 2 Restart the driver 3 8 Receive Packet Steering RPS Receive Packet Steering RPS is a software implementation of RSS called later in the datapath Contrary to RSS which selects the queue and consequently the CPU that runs the hardware interrupt handler RPS selects the CPU to perform protocol processing on top of the interrupt handler RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol ON by default for SMP Even when compiled in RPS remains disabled until explicitly configured The list of CPUs to which RPS may forward traffic can be configured for each receive queue using a sysfs file entry sys class net lt dev gt queues rx lt n gt rps_cpus For interfaces that have a single queue or its number of queues is less than the
17. d section Intel Haswell Processors 1 14 January 2015 e Added section System Monitoring and Profilers 1 13 September 2014 e Removed section Multi Thread Applications August 2014 e Added the following sections e Performance Tuning for iSER e Updated the following sections e Tuning the Network Adapter e Performance Testing e IR Affinity Hints e Multi Threaded IP Forwarding Tuning e ConnectX 3 Connect X 3 Pro Optimized Steering 1 12 May 2014 e Added the following sections e IRQ Affinity Hints e Receive Side Scaling RSS and its subsections e Receive Packet Steering RPS e Updated the following sections e IRQ Affinity Configuration e OS Controlled Power Management e Setting the Scaling Governor e Verbs Applications Optimization 1 11 March 2014 e Updated Tuning the Network Adapter for Improved IPv4 Traffic Performance February 2014 e Updated the following sections e Relevant Mellanox Drivers e Intel Sandy Bridge Processors Ivy Bridge Processors e Setting the Scaling Governor e Kernel Idle Loop Tuning e Interrupt Moderation Tuning e Tuning for Intel Sandy Bridge Platform Ivy Bridge Processors e Improving Application Performance on Remote Sandy Bridge Node e Auto Tuning Utility e Multi Threaded IP Forwarding Tuning e Memory Allocation Type e Reducing Memory Footprint e Polling Configurations
18. e iere i Pe tv diee o 18 3 4 1 Tuning for Intel Sandy Bridge Platform Ivy Bridge Processors 18 3 4 2 Tuning for AMD Architecture sese 19 3 4 8 Recognizing NUMA Node Cores enne enne nns 19 3 4 8 Running an Application on a Certain NUMA Node 19 3 5 Interrupt Moderation TUNING gt enne 19 3 6 M Multi Threaded IP Forwarding Tuning seen 20 3 7 Receive Side Scaling DH 21 3 7 1 RSS Hash tUM Ae oarra e aa Fi euet vp tar yelekli Beet 21 CM Revision 1 16 Table of Contents 3 7 2 ConneciX amp 3 Connect XX83ProOptimizedSteering 21 3 8 Receive Packet Steering DPE 21 3 9 TURING WIth Syscll eed deet eege sede sl ee ea eter ege ebe dees 22 3 9 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 22 3 9 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 22 3 9 8 Preserving Your sysctl Settings after a Heboot nenene 23 3 10 Verbs Applications Cptimlzaton m nennen nene 23 3 10 1 Single Thread Applications nnne nennen 23 3 11 Performance Tuning for SER aide teen enne neret tbe le fa 23 3 12 TuningVMAParameters eee 24 3 12 1 MemoryAllocationiype mean 24 3 12 2 Reducing Memory Footprint gt nanten 24 3 12 3 Poling Configurations et rrt rh me un te E end 25 3 12 4 Handling Single Threaded Pr
19. g systems Command line example e Receiver ntttcp_x64 ex r t 15 m 16 interface IP e Sender ntttcp x64 ex s t 15 m 16 same address as above Note Running the commands above with the a 8 parameter may result in performance improvement due to higher overlapped IO s allowed More details and tool binaries can be found here http gallery technet microsoft com NTttcp V ersion 528 Now f8b12769
20. ilization so there will be no interleaving between interfaces The following script can be used to separate each adapter s IRQs to different set of cores sor Gen fame Coulis Sin e ul sie gt lt interface gt lt cpu list gt can be either a comma separated list of single core numbers 0 1 2 3 or core groups 0 3 Example If the system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the following etc init d irgbalancer stop ALL 15 Revision 1 16 Performance Tuning for Linux 3 2 3 3 3 3 1 set irg affinity cpulist sh 0 1 eth2 u SS Leet EE E Cowllasic scm 2 9 GE set irg affinity cpulist sh 4 5 eth4 set irg affinity cpulist sh 6 7 eth5 ConnectX 4 100GbE Tuning Line rate performance with ConnectX 4 100GbE can be achieved by most operation systems without special tuning The number of streams needed varies from 4 to 16 depending on the system strength and OS kernel In some Linux distributions Hardware LRO HW LRO must be enabled to reach the required line rate performance To enabled HW LRO ethtool set priv flags interface hw iro on default off In case tx nocache copy is enabled this is the case for some kernels e g kernel 3 10 which is the default for RH7 0 t x nocache copy should be disabled To disable tx nocache copy ethtool K interface tx nocache copy off Power Management Tuning OS Controlled Power Management S
21. ing the User Interface as well For further information please refer to section Tuning the Network Adapter on page 27 4 5 1 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core utilization so there will be no interleaving between interfaces Please use the perf tuning exe manual option to separate each adapter s cores to different set of cores perf tuning exe m c1 first connection name b base RSS processor number n number of RSS processors Example If the system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the following perf tuning exe m cl first connection name b 0 n 4 perf tuning exe m cl first connection name b 2 n perf tuning exe m c1 first connection name gt b 4 n perf tuning exe m cl first connection name b 6 n NO NO NO PO 4 5 2 Recognizing NUMA Node Cores gt To recognize NUMA node cores perform the following 1 Open the Task Manager 2 Go to the Processes tab 3 Right click on one of the processes and choose Set affinity A table of the available cores and NUMA nodes will be displayed gt lt Oi 31 Revision 1 16 Performance Tuning for Windows 4 6 Performance Testing The preferred tool for performance testing is NTttcp The tool was developed by Microsoft and it is well optimized for Windows operatin
22. k Adapter 1 5 May 2012 e Added the following sections e Memory Configuration e Tuning for IPoIB EoIB Interfaces e Kernel Idle Loop Tuning e Updated the following sections e IRQ Affinity Configuration GE O Mellanox Technologies Confidential Revision 1 16 Introduction Revision Date Description e Recommended BIOS Settings e Tuning for Multiple Adapters e Tuning for Windows 2008 R2 1 4 April 2012 e Added Tuning for NUMA Architecture sections e Rearranged section in chapter 3 1 3 March 2012 e Added new section Tuning Power Management 1 2 January 2012 e Updated versions of adapters to make the document more generic e Merged sections on BIOS Power Management Settings and Intel Hyper Threading Technology to new section Recommended BIOS Settings e Added sections to Performing Tuning for Linux e Added section Tuning for Windows 2008 R2 e Added new chapter Tuning VMA Parameters 1 1 e Updated the following sections e Intel Hyper Threading Technology e Tuning the Network Adapter for Improved IPv4 Traffic Performance e Example Script for Setting Interrupt Affinity e Added new section Tuning IP Forwarding Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 1 Introduction Depending on the application of the user s system it may be necessary to modify the default
23. mount of shared memory bytes and huge pages if you receive a warning about insufficient number of huge pages allocated in the system e Set VMA MEM ALLOC TYPE When set VMA attempts to allocate data buffers as huge pages Reducing Memory Footprint A smaller memory footprint reduces cache misses thereby improving performance Configure the following parameters to reduce the memory footprint e If your application uses small messages reduce the VMA MTU using VMA MTU 200 e The default number of RX buffers is 200 K Reduce the amount of RX buffers to 30 60 K using 24 Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 VMA RX BUFS 30000 Note This value must not be less than the value of VMA RX WRE times the number of offloaded interfaces e The same can be done for TX buffers by changing VMA TX BUFS and VMA TX WRE 3 12 3 Polling Configurations You can improve performance by setting the following polling configurations e Increase the number of times to unsuccessfully poll an Rx for VMA packets before going to sleep using VMA RX POLL 200000 Or infinite polling using VMA RX POLL 1 This setting is recommended when Rx path latency is critical and CPU usage is not critical e Increase the duration in micro seconds usec in which to poll the hardware on Rx path before blocking for an interrupt using VMA SELECT POLL 100000 Or infinite polling using
24. mprove performance Please note that modifying the registry incorrectly might lead to serious problems including the loss of data system hang and you may need to reinstall Windows As such it is recommended to back up the registry on your system before implementing recommendations included in this document If the modifications you apply lead to serious problems you will be able to restore the original registry state For more details about backing up and restoring the registry please visit www microsoft com 4 1 Tuning the Network Adapter To improve the network adapter performance activate the performance tuning tool as follows 1 Select Start 2 Control Panel 2 Open Network Connections 3 Right click on one of the entries Mellanox ConnectX Ethernet Adapter and select Properties 4 Select the Performance tab 5 Choose one of the Tuning Scenarios e Single port traffic Improves performance when running a single port traffic each time e Dual port traffic Improves performance when running on both ports simultaneously e Forwarding traffic Improves performance when running routing scenarios for example via IXIA e Available in Mellanox WinOF v4 2 and above Multicast traffic Improves performance when the main traffic runs on multicast e Available in Mellanox WinOF v4 2 and above Single stream traffic Optimizes tuning for applications with single connection e Default Balanaced tuning Applies default
25. nnected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 4 2 3 Running an Application on a Certain NUMA Node In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool For example if the adapters NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only Torun an application run the following commands start affinity Oxff00 nd write bw S C ip 4 3 Tuning for Windows Server 2012 2012 R2 4 3 1 Recognizing NUMA Node Cores gt To recognize NUMA node cores perform the following 1 Open the Task Manager 2 Go to the Performance tab 3 Choose CPU 4 Right click on graph and choose Change graph to gt Logical processors Hovering over a CPU will display its NUMA node 29 Revision 1 16 Performance Tuning for Windows 4 4 Finding the Closest NUMA Node to the NIC gt Note BIOS support for ACPI SLIT must be enabled To find the closest NUMA node to the NIC perform the following Open a PowerShell window 2 Execute Get NetAdapterRss name Connection Name Where Connection Name is the name assigned to the desired interface e g Ethernet 1 Expected output The RssProcessorArray field displays the closer NUMA node The array should have entries that are of the form G C D e G The processor group e C
26. ocesses sseseesseeseeeenesrnsrnsrnesrnssrnssrnssrnssrnssrns 25 3 13 Performance Tuning for Virtualized Environment 25 3 13 1 TuningforHypemvisoer m 1 nennen nnne 25 4 Performance Tuning for Windows eeeeeeeeeeneeeee een ear nnn nnne nnn inne nnne innen 27 4 1 Tuning the Network AGAPI eee enne 27 4 2 Tuning for NUMA Archtecture eene nnne enne enne 29 4 2 4 Tuning for Intel Microarchitecture Code name Sandy Bridge Ivy Bridge Platforms 29 4 2 2 o Tuning for AMD Archttechure nennen nenne 29 4 2 8 Running an Application on a Certain NUMA Node 29 4 3 Tuning for Windows Server 2012 2012 R2 sssssssssssssesesseeee enne nennen 29 4 3 1 Recognizing NUMA Node Cores enne nnne 29 4 4 Finding the Closest NUMA Node to felt 30 4 5 TuningforWindows2008R2 nanten 30 4 5 1 X Tuning for Multiple Adapnters AA 31 4 5 2 Recognizing NUMA Node Cores enne nenne 31 4 6 Performance Testing o oie eie ede M P e Eege d M EHE EO ERR 32 Table of Contents Revision 1 16 List of Tables Tablet DocumentRevisionHistory ae 6 Table 2 Recommended PCle Contouratton nnns 10 Revision 1 16 Introduction Document Revision History Table 1 Document Revision History Revision Date Description 1 16 November 2015 e Added section ConnectX 4 100GbE Tuning 1 15 May 2015 e Adde
27. ome operating systems can override BIOS power management configuration and enable c states by default which results in a higher latency There are several options to resolve e When using MLNX OFED 2 2 x x x or higher Ethernet interfaces can be configured to automatically request for low latency from the OS This can be done using ethtool ethtool set priv flags interface pm gos request low latency on default off This is to improve latency and packet loss while power consumption can remain low when traffic is idle e When using IPoIB or an older driver it is possible to force high power by kernel parameters a Edit the boot grub grub conf file or any other bootloader configuration file b Add the following kernel parameters to the bootloader command intel idle max cstate 0 processor max cstate 1 c Reboot the system Example title RH6 4x64 Toot nd 0 kernel vmlinuz RH6 4x64 2 6 32 358 e16 x86_64 root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0 console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max cstate 1 RZ Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 e Temporarily request for low CPU latency from user mode This can be done by a program that opens dev cpu dma latency and writing the required latency while keeping the file descriptor opened For further information please refer to kernel documents Linux Documentation power pm q
28. os interface txt 3 3 2 Checking Core Frequency Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent e Check the maximum supported CPU frequency cat sys devices system cpu cpu cpufreg cpuinfo max freq e Check that core frequencies are consistent cat proc cpuinfo grep cpu MHz e Check that the output frequencies are the same as the maximum supported If the CPU frequency is not at the maximum check the BIOS settings according to tables in section Recommended BIOS Settings on page 10 to verify that power state is disabled e Check the current CPU frequency to check whether it is configured to max available frequency cat sys devices system cpu cpu cpufreg cpuinfo cur freq 3 3 3 Setting the Scaling Governor If the following CPU frequency modules are loaded CPU scaling is supported and you can improve performance by setting the scaling mode to performance gt To set the scaling mode to performance use this command for every cpu echo performance gt sys devices system cpu cpu lt cpu number gt cpufreg scaling governor 3 3 4 Kernel Idle Loop Tuning The mlx4 en kernel module has an optional parameter that can tune the kernel idle loop for better latency This will improve the CPU wakeup time but may result in higher power consumption To tune the kernel idle loop set the following options in the etc modprobe d mlnx conf file
29. rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX off pkt rate low 100000 pkt rate high 400000 rx usecs 0 rx frames 0 rx usecs irq 0 ew Eremes iree E Note When working with a 1GbE network it is recommended to disable the interrupt moderation in order to get a full IGbE throughput Adi Todoso run ethtool C ethll adaptive rx off rx usecs 0 rx frames 0 Multi Threaded IP Forwarding Tuning To optimize NIC usage as IP forwarding 1 Set the following options in etc modprobe d mlx4 conf options mlx4 en inline thold 0 e For MLNX OFED 2 3 1 0 0 options mlx4 core log num mgm entry size 7 e For MLNX OFED 2 2 1 x x and lower options mlx4 core high rate steer 1 2 Apply interrupt affinity tuning 3 Forwarding on the same interface set irq affinity bynode sh numa node interface 4 Forwarding from one interface to another set irq affinity bynode sh numa node lt interfacel gt lt interface2 gt 5 Disable adaptive interrupt moderation and set status values using ethtool C interface adaptive rx off rx usecs 0 tx frames 64 20 Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 3 7 Receive Side Scaling RSS 3 7 1 RSS Hash tuning The default RSS hash calculated by the adapter is a Toeplitz function On some workloads it is possible that small number of connections will not be distribute
30. stribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalance stop The following command assigns the affinity of a single interrupt vector gt echo hexadecimal bit mask gt proc irg lt irg vector gt smp_affinity Bit i in hexadecimal bit mask indicates whether processor core i is in irq vector gt s affinity or not IRQ Affinity Hints As of MLNX OFED 2 2 1 x x the driver uses affinity hints API that allows the irqbalance service to set the affinity automatically On some kernels the irqbalance service needs to be restarted in order for these changes to take effect To check if affinity hints is working properly run the following command at least 10 seconds after the interface is up show irq affinity sh interface If all the rows are or 00000 it means it did not work and the irgbalance needs to be restarted IRQ Affinity Configuration 5 Note It is recommended to set each IRQ to a different core For optimal functionality it is recommended to download the latest tuning scripts from the web cd tmp wget http www mellanox com related docs prod software mlnx irg affinity tgz tar xzf tmp mlnx irq affinity tgz directory usr sbin overwrite For systems
31. t from the additional buffers allocated by the CPU Memory Configuration For high performance it is recommended to use the highest memory speed with fewest DIMMs and populate all memory channels for every CPU installed For further information please refer to your vendor s memory configuration instructions or memory configuration tool available Online System Monitoring and Profilers It is recommended to disable system profilers and or monitoring tools while running performance benchmarks System profilers and or monitoring tools use the host s resources hence running them in parallel to benchmark jobs may affect the performance in various degrees based on the traffic type and or pattern and the nature of the benchmark In order to measure optimal performance make sure to stop all system profilers and monitoring tools such as sysstat vmstat iostat mpstat dstat etc before running any benchmark tool Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 2 4 Recommended BIOS Settings gt Note These performance optimizations may result in higher power consumption b 2 4 1 General Set BIOS power management to Maximum Performance 2 4 2 Intel Haswell Processors The following table displays the recommended BIOS settings in machines with Intel code name Haswell based processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor
32. that have Sandy Bridge Ivy Bridge or AMD CPUs set the IRQ affinity to the adapter s NUMA node e For optimizing single port traffic run set irq affinity bynode sh numa node interface e For optimizing dual port traffic run set irq affinity bynode sh numa node lt interfacel gt lt interface2 gt 14 Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 e To show the current irq affinity settings run show_irq_affinity sh lt interface gt 3 1 3 Auto Tuning Utility MLNX_OFED 2 x introduces a new affinity tool called n1nx affinity This tool can automatically adjust your affinity settings for each network interface according to the system architecture This tool will disable the IRQ balancer service from running at boot time To disable it immediately need to stop the service manually service irqbalance stop Usage e Start service irqbalance stop mme e SESTE e Stop mlnx affinity stop service irqbalance start e Restart mlnx affinity restart minx affinity can also be started by driver load unload gt To enable minx affinity by default e Add the line below to the etc infiniband openib conf file RUNEAT ERNE RY eS TUNE Raves Note This tool is not a service it run once and exits mw 3 1 4 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core ut
33. values to various factors which may affect performance 6 Click the Run Tuning button Clicking the Run Tuning button will change several registry entries described below and will check for system services that might decrease network performance It will also generate a log including the applied changes Users can view this log to restore the previous values The log path is HOMEDRIVE Windows System32 Log Files PerformanceTunning log ALL 27 Revision 1 16 Performance Tuning for Windows This tuning is needed on one adapter only and only once after the installation as long as these entries are not changed directly in the registry or by some other installation or script 28 Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 4 2 Tuning for NUMA Architecture 4 2 1 Tuning for Intel Microarchitecture Code name Sandy Bridge Ivy Bridge Platforms The Intel Sandy Bridge processor has an integrated PCI express controller Thus every PCIe adapter OS is connected directly to a NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCIe adapter is connected 4 2 2 Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system e With a 2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 e With a 4 socket system the PCIe adapter will be co
34. ver the Sandy Bridge node recognition must be enabled as described in section 3 4 1 In systems which do not support SLIT the following environment variable should be applied MLX4_LOCAL_CPUS 0x bit mask of local NUMA node Example for local Sandy Bridge node which its cores are 0 7 e When using ConnectX 3 adapter cards MLX4 LOCAL CPUS 0xff e When using Connect IB adapter cards MEX5 LOCAL CPUS 0xEE Additional modification can apply to impact this feature by changing the following environment variable MLX4_STALL_NUM_LOOP integer default 400 Note The default value is optimized for most applications However several applications might benefit from increasing decreasing this value A NN y Performance Tuning Guidelines for Mellanox Network Adapters Revision 1 16 3 4 2 Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system e With a 2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 e With a 4 socket system the PCIe adapter will be connected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 3 4 3 Recognizing NUMA Node Cores To recognize NUMA node cores run the following command cat sys devices system node node X cpulist cpumap Example cat sys devices system node nodel cpulist do 359p n Bp Li 13 15 cat sys devices system node nodel cpumap 0000aaaa

Download Pdf Manuals

image

Related Search

Related Contents

  

Copyright © All rights reserved.
Failed to retrieve file