Home

Building you own InfiniBand FDR Cluster

image

Contents

1. STATUS AAA AV AVav PS MEN AAA Av AG DA EV av ANV PEEN STATUS TND ESP NS PSA REV ENTAO PDT AIA I LA ADE Ev L av AZ RORA STATUS IALIA AGNA A AV AVav PMS A REVOLT ATA INDIA AA A av AZ IV RORA EI LT TF AA Babel Lai Coco bede ketel a O 5 ir E EEE EEB FER ES EA PA o FER EEE SEE FREI E E rme Se M EE a IS5030 a 18 Link to Compute Nodes 6x 4X QDR Uplinks 13 Mellanox TECHNOLOGIES Cabling Tips abel all cables between the leaf and the core switches Failure to do so will hamper best efforts to isolate identify and troubleshoot any bad cables when the cluster is brought up Sample cable labeling ee a pd ei Server L1 switch Node Name Leaf Slot Port Leaf Slot Port Node Name Switch Switch Leaff Slot Port Spine Slot Port Spine Slot Port Leaf Slot Port Cluster Configurator Mellanox TECHNOLOGIES Cluster configuration tool available from Mellanox s http calc mellanox com clusterconfig File Edit View History Bookmarks Tools Help i I Mellanox InfiniBand Configurator le ESA K i i calc mellanox com clusterconfig 3 ATTN Mellanox InfiniBand Configurator The Mellanox InfiniBand Configurator is an online tool to configure clusters based on a FAT Tree Topology with two levels of switch systems Click HERE for further description
2. smpdump ibswitches E smpquery ibhosts S perfquery ibnodes a ibcheckport ibcheckwidth ibchecknode ope 3 ibcheckerrs Do dad O ibportstate Ipsysstat ibclearerrors 9 ibcheckportwidth ibtracert ibelearcounters ibcheckportstate 1 PPINg saquery Single Node Src Dest Pair Subnet Scope The world of b d ags Mellanox N ol O ibstat o ibstatus ibnetdiscover ibaddr ibdiscover pl ibroute ibchecknet E sminfo ibnetdiscover smpdump ibswitches E smpquery ibhosts S perfquery ibnodes a ibcheckport ibcheckwidth ibchecknode ope 3 ibcheckerrs Do dad O ibportstate Ipsysstat ibclearerrors Q ibcheckportwidth ibtracert ibelearcounters ibcheckportstate 1 PPINg saquery Single Node Src Dest Pair Subnet Scope The world of b diags Mellanox TECHNOLOGIES Single Node Scope e ibstat Show host adapters status e ibstatus Similar to ibstat but implemented as a script e ibaddr Shows the lid range and default GID of the target default is the local port e ibroute display unicast and multicast forwarding tables of switches e sminfo query the SMInfo attribute on a node e smpdump Simple solicited SMP query tool Output is hex dump e smpquery formatted SMP query tool e perfquery dump and optionally clear performance error counters of the destination port e ibch
3. Tutorial Building InfiniBand clusters with Open Fabrics Software Stack HPC Advisory Council Lugano Switzerland Workshop March 13 15 2012 Todd Wilde Director of Technical Computing and HPC Mellanox Mellanox TECHNOLOGIES InfiniBand Overview The InfiniBand Architecture Mellanox Industry standard defined by the InfiniBand Trade Association Defines System Area Network architecture e Comprehensive specification K from physical to applications L IN FIN IBAND eat dd InfiniBand lt Processor Node 3000 2001 ETZ a Rd Consoles J finiBan N E HCA Architecture supports e Host Channel Adapters HCA e Target Channel Adapters TCA e Switches e Routers Ti et Fibre Facilitated HW design for um LE e Low latency high bandwidth Subsystem aet e Transport offload Gateway Gateway ALTA pr InfiniBand Feature Highlights Mellanox TECHNOLOGIES Serial High Bandwidth Links Quality Of Service e 56Gb s HCA and switch links e Independent I O channels at the adapter e Over 12 GB s data throughput level Ultra low latency e Virtual Lanes at Ee link SE e Under 1 us application to application Cluster Scalability flexibility l e i 128 Reliable lossless self managing heal PS A SS fabric e Parallel routes between end nodes e Link level How control e Multiple cluster topologies possible Congestion control to prevent HOL Simp
4. 8 76 13 35 18 46 30 28 52 84 99 88 191 46 375 02 748 70 1481 48 28 OFED InfiniBand Diagnostic Tools 29 GUIDS Globally Unique Device IDs Mellanox EN All HCA and switches ship with GUIDs burned into non volatile memory of the device The GUID is 64 bits 8 bytes long Similar to MAC address in that they are globally unique and assigned by the manufacturer There are three types of GUIDS s A single NODE GUID for every single switch or HCA device e A PORT GUID for every InfiniBand terminate port E g Port 1 and Port 2 of an HCA Port O the management port or a switch e A SYSTEM GUID Allows a group of devices to be identified as a single unit E g a large director class switch has a single SYSTEM GUID An example GUID 0x0002c902001 23456 OOOO OOOO o T a LIDS Locally assigned IDs Mellanox All HCA ports and all Port O management ports on a switch are assigned LIDs These are dynamically assigned by the Subnet Manager and can change from power cycle to power cycle but not usually The LID is 16 bits long 64K LIDS 48K unicast and 16K multicast LIDs are used to route InfiniBand packets through the subnet Example unicast LID 0x1000 or 4096 Mellanox The world of b diags 0 ol O ibstat o ibstatus ibnetdiscover ibaddr ibdiscover pl ibroute ibchecknet E sminfo ibnetdiscover
5. caguid 0x2c902002312a8 Ca 2 H 0002c902002312a8 mtilab47 HCA 1 1 2c902002312a9 S 000b8cffff004207 6 lid 12 Imc 0 MT47396 Infiniscale lll Mellanox Technologies lid 9 4xDDR vendid 0x2c9 devid 0x6274 sysimgguid 0x2c90200230e57 caguid 0x2c90200230e54 Ca 1 H 0002c90200230e54 mtilab55 HCA 1 1 2c90200230e55 S 000b8cffff004207 5 lid 22 Imc 0 MT 47396 Infiniscale lll Mellanox Technologies lid 9 4xDDR 41 outils ibdiagnet path Integrated Cluster Utilities butils in a nutshell Mellanox TECHNOLOGIES iDdiagnet e Examine all the paths in the network Look for cross paths issues Network balancing Covers all L2 issues on links Latest MLNX_OFED stack includes ibdiagnet2 which includes performance optimizations and many advanced features ibdiagpath Source to Destination path based analysis Cover all L2 issues on the path Include extensive link level analysis ibdiagnet functionality Mellanox TECHNOLOGIES Nn Topology e Info dump topology in topo Ist and ibnetdiscover formats e Info Optionally report on cable information i e vendor cable length part number e Error duplicate GUIDs e Error connectivity mismatch to reference topology e Warn link speed width change from reference topology e Error optional report on any port below given speed width SM e Info all active SMs their status and priority e
6. Error missing or multiple masters Error Illegal LID 0 duplicated not meeting LMC e Error invalid link parameters OpVLs MTU e Error link width speed not matching maximal supported Error Counters e Info a full dump of all IB port counters of the entire subnet e Error error counters over some limit user controlled e Error error counters increasing during the run 44 ibdiagnet functionality Mellanox TECHNOLOGIES Routing Info histogram of hops from CA to CA Info histogram of number of CA to CA paths on every port Info multicast groups and their members include sender only Error no unicast route between every CA to every other CA Error on request no unicast route between every CA SW to every other CA SW Error credit loops found optionally include multicast Error multicast routing loops disconnects garbage Partitions Info All partitions ports and membership status Error Mismatching host partitions and attached switches ingress port tables IPoIB Info available broadcast domains and their parameters and member end points e Warn sub optimal domain parameters rate to small rate not met by some nodes Bit Error Check Error given some threshold and time between samples 45 ibdiagnet functionality Mellanox EN Qos e Info Admissible SL s on the path including the details where they block etc e Info PathRecord for every SL optionally limit by give
7. VA A gt p a Aellanox A Sees sium i TECHNOLOGIES ibdiagpath t pwd network topo n H 3 T SO SO SS SO SO DSO DDD DSO DSO SO SO SO SO DDD DDD DDD eee SO DO O PM Counters Info I SSD W SL1 1 U1 P3 1id 0x0002 guid 0x0002c90000000201 dev 47396 Performance Monitor counter Value port rcv errors Oxed4 Increase by 7 during ibdiagpath scan ibdiagpath t pwd network topo n H 23 I SoS oe oS o oe Soe Se SoS eS oe eS oc Se See oo See So See Se eS Seo ee Se oe be PM Counters Info T Soo o o eS Soe oc o eS See eo Se oo See Se Se See See Se See DCD W SL1 2 U1 P11 1id 0x0008 guid 0x0002c90000000207 dev 47396 Performance Monitor counter Value port rcv errors 0x603 Increase by 8 during ibdiagpath scan 50 Advanced ibutil Topics 51 Subnet Manager faults Mellanox nn The Subnet Manager SM is mandatory for setting up port ID links and routes Subnet Manager Reporting e One and only one master SM Error When no master or more then one master Report All master and standby SM poris e SM is responsible for configuring links Error When Neighbor MTU is not correctly set by SM Error If operational VLs does not match the other sides of the link e Packet routes are configured by the SM Error When not all nodes are assigned an unique address LID Error If routes from every nodes to every other node are not set Error If multicast routes for each member o
8. ainan sh sessi ulti lhi PA S o EEE SS ES REI AAA AAA AA EAN 155030 L1 1 L1 2 status psu 2 ESSES ESSE n SES nas eee E 2 status STE SESSI css ssa an nn AAA RE ES IHNEN FPA EGE RBS Sees Istatus STA Elke isan a ses a Ezel AAA U Ey EE ESSA SIA FRA o FERE FE EEE Ps Es ESSES ESE FARSA CEE KER ee 155030 18 Links 18 Links To Compute Nodes To Compute Nodes 2 x 4X QDR Uplinks 1 x 4X QDR Uplinks 1944 Node Full CBB using 36 648 Port Switch Mellanox ala al Ne bf cs a SE mn CS E ne A O O O mn en nn a e O O EA a aos es E O O O O O O wa mn en da en on len le ln lla ml lo d a ne tn desse eee nn ae e Ea wa in da en lis gi mo ln lla g n d a d a a tn ul L Eil QU r ECT ES RR I EEE CERCLE EAC aiaa see en ga egl mga eean sla a via O ee S O ca Ea aiaa en O A O mga emean OO NN ee ee EEE TT ERETTO PAR O ROS mas es eaea e meaa an sese oma eie gn a li in Ea as o eae li o ni a IEA ER TEENS NS SEE EE ESSE ane ent tt N 9 masaa annene SS a O ER R ome aras ema sessi m E EC IN NS Ea ps en SS O dn O O HERREN EE GS L EC SESE EAE EE RATERS EE rt o e eia q q q mn a mga mn mn ema ma on enea a im a earns en mn erne m en OS SRT RR n OS nn QU i Eil Ei F Lu Lt To ERES o T i EO de mimica E La les nen mn mamam le ln ll ml os d a tn A a Ta TT ee TTT la abel cilea ciclone die le LL NEE ES II EEEE ISI RENNEN li s
9. has 12 switches and 128 HCAs I Sani init I Checking credit loops T nn nn e I Analyzing Fabric for Credit Loops 1 SLs 1 VLs used I no credit loops found 64 bdiagnet run a good case AMA Mellanox TECHNOLOGIES T I mgid mlid HCAs table T mgid mlid PKey QKey MTU rate HCAs 0x 12401b80010000 0x00000000 0xc000 0x8001 Ox00000blb 2048 20Gbps 128 0xff12401b80020000 0x00000000ffffffff 0xc001 0x8002 0x00000b1b 2048 20Gbps 128 0xff12401b80030000 0x00000000ffffffff 0xc002 0x8003 0x00000blb 2048 20Gbps 128 I Stages Status Report STAGE Errors Warnings Bad GUIDs LIDs Check 0 0 Link State Active Check 0 0 General Devices Info Report 0 0 Performance Counters Report 0 0 Partitions Check 0 0 IPoIB Subnets Check 0 0 Subnet Manager Check 0 0 Fabric Qualities Report 0 0 Credit Loops Check 0 0 Multicast Groups Report 0 0 Please see tmp ibmgtsim 7021 ibdiagnet log for complete log 65 lbdiagpath run a good case AMA Mellanox TECHNOLOGIES nn ibdiagpath t network topo I 128 I Parsing topology definition local ez OSM_REGRESSION SRC ibutils ibdiag demo network topo I Defined 145 145 systems nodes TREE CIO CINE SERE EPE De
10. topo r T I Checking credit loops T Analyzing Fabric for Credit Loops 1 SLs 1 VLs used Found credit loop on SW L2 1 P3 VL 0 BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT BT credit loops in routing Total Credit Loop Check credit credit credit credit credit credit credit credit credit credit credit credit credit credit credit credit credit loop loop loop loop loop loop loop loop loop loop loop loop loop loop loop loop loop through through through through through through through through through through through through through through through through through SW L1 8 P2 SW L2 8 P3 SW L1 4 P2 SW L2 4 P3 SW L1 7 P2 SW L2 7 P3 SW L1 3 P2 SW L2 3 P3 SW L1 6 P2 SW L2 6 P3 SW L1 2 P2 SW L2 2 P3 SW L1 5 P2 SW L2 5 P3 SW L1 1 P2 SW L2 1 P3 H 1 P1 VL Errors 1 VL VL VL VL VL VL VL VL VL VL VL VL VL VL VL VL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Partition Configuration Issues Mellanox nn Partitions are similar to VLAN IDs but enforced on hosts and switch ports Network e Warning Partition enforcement by leaf switches mismatch hosts e Report Node groups which nodes can comm
11. usr bin ib send bw e usr bin ib send lat Usage e Server lt test name gt lt options gt e Client lt test name gt lt options gt lt server IP address gt Note Same options must be passed to both server and client Use h for all options Mellanox TECHNOLOGIES InfiniBand benchmarks lt lt MPI benchmarks Mellanox e Prerequisites for Running MPI e The mpirun_rsh launcher program requires automatic login 1 e password less onto the remote machines e Must also have an etc hosts file to specify the IP addresses of all machines that MPI jobs will run on e Details on this procedure can be found in Mellanox OFED User s manual Basic format mvapich e mpirun_rsh np procs node1 node nodes BINARY e See MPI User Manual for various tuning parameters for InfiniBand E g Transport types at large scale Other flags show show only paramfile environment variables hostfile list of host ENV VAL i e VIADEV RENDEZVOUS THRESHOLD 8000 root lisbon001 mpirun rsh np 2 lisbon001 lisbon002 usr mpi gee mvapich 1 2 0 tests osu_benchmarks 3 1 1 osu latency OSU MPI Latency Test v3 1 1 Latency us Size 0 0 bb N 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 3 man an PP VN x Fal aw ATA bh LINE JA OMR 1 37 1 36 1 37 1 37 1 38 1 38 1 45 1 55 2 43 2 56 2 86 3 47 4 75 6 03
12. 0000000002 dev 23108 priority 0 T I Fabric qualities report I ini n lc i n n i i nn en en I Parsing FDBs file tmp ibmgtsim 7021 ibdiagnet fdbs I Defined 2465 fdb entries for 17 switches I Parsing Multicast FDBs file tmp ibmgtsim 7021 ibdiagnet mcfdbs I Defined 450 Multicast Fdb entries for 17 switches I Verifying all CA to CA paths E nnn nn enen CA to CA LFT ROUTE HOP HISTOGRAM The number of CA pairs that are in each number of hops distance This data is based on the result of the routing algorithm HOPS NUM CA CA PAIRS 2 1364 4 14892 63 Ibdiagnet run a good case AMA Mellanox TECHNOLOGIES nn SAS LFT CA to CA SWITCH OUT PORT NUM DLIDS HISTOGRAM Number of actual Destination LIDs going through each switch out port considering all the CA to CA paths Ports driving CAs are ignored as they must have Nca 1 If the fabric is routed correctly the histogram should be narrow for all ports on same level of the tree A detailed report is provided in tmp ibdmchk sw out port num dlids NUM DLIDS NUM SWITCH PORTS 1 20 2 84 3 21 4 2 5 1 9 28 10 72 11 28 I Scanning all multicast groups for loops and connectivity I Multicast Group 0xC000 has 12 switches and 128 HCAs I Multicast Group 0xC001 has 12 switches and 128 HCAs I Multicast Group 0xC002
13. 0x00000blb MTU 2048Byte rate 20Gbps SL 0x00 I Subnet IPv4 PKey 0x0003 QKey 0x00000blb MTU 2048Byte rate 20Gbps SL 0x00 T I QoS on Path Check T I The following SLs can be used 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 I Stages Status Report STAGE Errors Warnings LET Traversal local to destination 0 Performance Counters Report Path Partitions Check Path IPoIB Check QoS on Path Check o O O O O O O O Please see tmp ibdiagpath log for complete log
14. De I Traversing the path from local to destination A E ELESSE ESE ERR E RES E I From H 1 U1 P1 lid 0x0001 guid 0x0002c90000000002 dev 23108 I To SL1 1 U1 P1 lid 0x0002 guid 0x0002c90000000201 dev 47396 I From SL1 1 U1 P21 1id 0x0002 guid 0x0002c90000000201 dev 47396 I To SL2 5 U1 P1 lid 0x0020 guid 0x0002c9000000021f dev 47396 I From SL2 5 U1 P5 1id 0x0020 guid 0x0002c9000000021f dev 47396 L To SL1 3 U1 P21 lid 0x0009 guid 0x0002c90000000209 dev 47396 I From SL1 3 U1 P9 1id 0x0009 guid 0x0002c90000000209 dev 47396 TO H 33 U1 P1 lid 0x0080 guid 0x0002c900000000de dev 23108 T fi ui nn nn CR ES fu o iG o a o nen I PM Counters Info T i i TR fi en I No illegal PM counters values were found T i n n n i n n fi n i n ee I Path Partitions Report T Ga a i i pi n n a a li e en I Source H 1 U1 P1 1id 0x0001 guid 0x0002c90000000002 dev 23108 Port 1 PKeys 0xffff 0x8001 0x8002 0x8003 I Destination H 33 U1 lid 0x0080 guid 0x0002c900000000de dev 23108 PKeys 0x7fff 0x8001 0x8002 0x8003 I Path shared PKeys 0x8001 Oxffff 0x8002 0x8003 66 lbdiagpath run a good case AMA Mellanox TECHNOLOGIES I IPoIB Path Check T I Subnet IPv4 PKey 0x0001 QKey 0x00000blb MTU 2048Byte rate 20Gbps SL 0x00 I Subnet IPv4 PKey 0x0002 QKey
15. NS eneh H Extra Info Ibdiagnet run a good case Mellanox TECHNOLOGIES ibdiagnet lIs 10 lw 4x I Parsing Subnet file tmp ibmgtsim 7021 ibdiagnet lst I Defined 145 145 systems nodes ille is SE I Bad Guids LIDs Info Eileen I skip option set no report will be issued estera I Links With Logical State INIT ATER EMISE ISEE DI EROI AE SETT I No bad Links with logical state INIT were found lessi SS nici iaia ti I General Device Info PTT RE AE RETTE STREETS I PM Counters Info CE eee ESTONIAN I No illegal PM counters values were found A ITS EAS E E EE E NA I Fabric Partitions Report see ibdiagnet pkey for a full hosts list ee ee ee eee eee a PKey 0x0001 Hosts 128 full 128 limited 0 p PKey 0x0002 Hosts 128 full 128 limited 0 Ls PKey 0x0003 Hosts 128 full 128 limited 0 T PKey 0x7fff Hosts 128 full 1 limited 127 Sie I IPoIB Subnets Check espesas a See I Subnet IPv4 PKey 0x0001 OKey 0x00000blb MTU 2048Byte rate 20Gbps SL 0x00 I Subnet IPv4 PKey 0x0002 OKey 0x00000blb MTU 2048Byte rate 20Gbps SL 0x00 I Subnet IPv4 PKey 0x0003 OKey 0x00000blb MTU 2048Byte rate 20Gbps SL 0x00 I e E E EE E a a a T R r 62 Ibdiagnet run a good case AMA Mellanox TECHNOLOGIES nn T cli I Bad Links Info I No bad link were found T mn i I Summary Fabric SM state priority T nn nen SM master The Local Device H 1 P1 lid 0x0001 guid 0x0002c9
16. X or 4X LinkSpeedSupported 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 2 5 Gbps or 5 0 Gbps LinkSpeedActive 5 0 Gbps 3 FIDE PS p P perfquery e Queries InfiniBand ports performance and error counters It can also reset counters perfquery Port counters Lid 6 port 1 38 smpquery attribute details HNOLOGIES smpquery e Reports relevant node port switch info onetdiscover cluster topology report Mellanox peer LL eme Reports a complete topology of cluster Shows all interconnect connections reporting e Port LIDs e Port GUIDs e Host names e Link Speed GUID to switch name file can be used for more readable topology 40 Ibnetdiscover cluster topology report Mellanox TECANOLOGIES gt ibnetdiscover node name map my guid map file vendid 0x2c9 devid 0xb924 sysimgguid 0xb8cffff004207 switchguid 0xb8cffff004207 b8cffff004207 Switch 24 SWITCH 1 47396 Infiniscale lll Mellanox Technologies base port O lid 9 Imc 0 5 0002c90200230e54 1 2c90200230e55 mtilab55 HCA 1 lid 22 4xDDR 6 H 0002c902002312a8 1 2c90200231 2a9 mtilab47 HCA 1 lid 12 4xDDR 14 H 0002c90300000268 2 2c9030000026a mtilab40 HCA 1 lid 20 4xDDR 18 H 0002c9020021ad78 1 2c9020021ad79 mtilab54 HCA 1 lid 21 4xDDR devid 0x6282 sysimgguid 0x2c902002312ab Link Speed
17. _LOAD yes e Optionally manually start and stop services etc init d openibd start stop restart status IPoIB in a Nut Shell Mellanox e Encapsulation of IP packets over IB Uses IB as layer two for IP e Supports both UD service up to 2KB MTU and RC service connected mode up to 64KB MTU Pv4 IPv6 ARP and DHCP support Multicast support VLANs support Benefits e Transparent to legacy applications e Allows leveraging of existing management infrastructure ASS 19 IPolB Configuration Mellanox Requires assigning an IP address and a subnet mask to each HCA port like any other network adapter The first port on the first HCA in the host is called interface IDO the second port is called ib1 and so on Configuration can be based on DHCP or on a static configuration e Modify etc sysconfig network scripts ifcfg i0 DEVICE ib0 BOOTPROTO static IPADDR 10 10 0 1 NETMASK 255 255 255 0 NETWORK 10 10 0 0 BROADCAST 10 10 0 255 ONBOOT yes e ifconfig IO 10 10 0 1 up Es CD i Subnet Management Basics Mellanox TECHNOLOGIES Each Subnet must have a Standby Subnet Managers Supported Subnet Manager SM Standbv Topology Discovery TS Forwarding Table Initialization se 15 Fabric Maintenance Local IDs LIDS are used to identify end ports nodes and route packeis Every entity CA SW Router must support a Subnet Manag
18. and information on Mellanox FAT Tree Topology A WA AV Select OEM Mellanox v Select Level 1 Switch Software 36 port Switch y i Unified Fabric Management UFM Advanced Fin NA datati enni Select Level 2 Switch Fabric Collectives Accelerator FCA DOC 11884 Maximum Nodes 648 port Switch v Messaging Accelerator VMA E E Storsge Accelerator VSA Select Blocking Configuration Unstructured Data Accelerator UDA Non blocking v Select Switch Configuration Data Rate O For 56 Gbis FDR 10 40 Gb s QDR 40 Gb s Select Adapter Configuration Data Rate rorsecbs O FDR 10 40 Gb s QDR 40 Gb s Select Host Adapter Type O Dual Port single Port O NS 15 Basics of the OFED InfiniBand Stack MLNX OFED Installation Mellanox Pre built RPM install e 1 mount o rw loop MLNX OFED LINUX 1 4 rhel5 3 iso mnt e 2 cd mnt e 3 minxofedinstall Building RPMs for un supported kernels e 1 mount o rw loop MLNX_OFED_LINUX 1 4 rhel5 3 iso mnt e 2 cd mnt src e 3 cp OFED 1 4 tgz root this is the original OFED distribution tarball e 4 tar zxvf OFED 1 4 tgz e 5 cd OFED 1 4 e 6 copy ofed conf to OFED 1 4 directory e 7 Jinstall pl c ofed conf MLNX OFED Configuration Mellanox Loading and Unloading the IB stack e etc infiniband openib conf controls boot time configuration and other options Start HCA driver upon boot ONBOOT yes Load IPoIB IPOIB
19. eckport perform some basic tests on the specified port e ibchecknode perform some basic tests on the specified node e ibcheckerrs Check if the error counters of the port node have passed some predefined thresholds e ibportstate get the logical and physical port state of an IB port or enable disable port e ibcheckportwidth perform 1x port width check on specified port e ibcheckportstate perform port state and physical port state check on specified port Node based tools can be run on any machine with OFED stack installed man pages available for all utilities e h option for online help 34 The world of b diags B S B S Mellanox TECHNOLOGIES ource Destination Path Scope iosysstat obtain basic information for node hostname cpus memory which may be remote lotracert display unicast or multicast route from source to destination lbping ping pong between IB nodes currently using vendor MADs ubnet Scope saquery Issue some SA queries lbnetdiscover scan topology ibchecknet perform port node errors check on the subnet ibnetdiscover topology output loswitches scan the net or use existing net topology file and list all switches ibhosts scan the net or use existing net topology file and list all hosts lbnodes scan the net or use existing net topology file and list all nodes locheckwidth perform port width check on the subnet Used to find poris with 1x link width ibclearerror
20. ement Agent SMA SM communicates with SA using Subnet Management Packets SMPs 21 OpenSM in a nutshell Mellanox nn The Subnet Manager SM is mandatory for setting up port ID links and routes OpenSM is an Infiniband compliant subnet manger included with OFED Ability to run several instance of osm on the cluster in a Master Slave s configuration for redundancy Partitions P key similar to VLANs support QoS support Enhanced routing algorithms e Min hop up down fat tree LASH DOR Torus2Q0S Rs 22 Running opensm Mellanox TTT Command line e Default no parameters Scans and initializes the IB fabric e opensm h for usage flags e g to use up down routing would using routing engine updn e Configuration file for advanced settings C creates config file F loads config file e Run is logged to two files var log messages registers only major events var log opensm log detailed report e Can also start as a daemon using etc init d opensmd start SM detection e etc init d opensmd status Shows opensm runtime status on a machine e sminfo Shows master and standby subnets running on the cluster 23 Mellanox TECHNOLOGIES Running Benchmarks 24 InfiniBand benchmarks Mellanox j SS Bandwidth and Latency performance tests e usr bin ib write bw e usr bin ib write lat e usr bin ib read bw e usr bin ib read lat e
21. f each group are not proper Error If credit loops are caused by the routing 52 ibdiagnet t pwd network topo r P I I Fabric qualities report Verifying all CA to CA paths Unassigned LFT for lid 70 Dead end at H 122 U1 Fail to find a path from H 1 U1 1 to H 24 U1 1 Found 1380 missing paths out of 13340 paths Multicast disconnect and unneeded entries I Scanning all multicast groups for loops and connectivity Multicast Group 0xC000 has 6 switches and 8 FullMember CA ports Switch S0002c90000000004 U1 has unconnected MFT entries for MLID 0xC000 Switch S0002c90000000005 U1 has unconnected MFT entries for MLID 0xC000 Found 2 connection groups for MLID 0xC000 Group 1 has 4 CAs H 9 12 U1 Group 1 has 1 SWs S0002c90000000001 U1 Group 2 has 4 CAs H 13 16 U1 Group 2 has 1 SWs S0002c90000000003 U1 53 Credit Loops What are these Mellanox nn loss less fabric link level flow control packet not sent of there is no buffer for it If traffic to DST 1 waits on traffic for DST 2 which in turn depends on traffic to DST 3 which depends on DST 1 we have a dependency loop and the fabric deadlocks H 12H 13 H 14H 15 H 16 H 8 H 9 H 10H 11 AA RA o EB H 2 H 3 H 4 H 5 H 6 H 7 H 1 Credit Loops in real world Y 0 Y CG O ja T 0 lt L Dg ibdiagnet t pwd network
22. ful Hint use ibdiagnet to write out a topology file Mellanox Writing out the topology e Use wt network topo to generate a reference topology of your cluster e Replace switch GUIDs with actual switch names for easier readability TECHNOLC Link Faults Bad cables need to be found in cluster bring up e Error counters provide on every IB port report these issues Reporting Link Faults across Network Error When any port counter change rate gt than threshold e Report Entire set of counters for each port on the subnet ibdiagnet t pwd network topo T SSS SS SS SS SSeS SS SS Se SSS 6S SSS Se Se SS SS SS nn PM Counters Info T H 37 P1 lid 0x0087 guid 0x0002c900000000ee dev 23108 Performance Monitor counter Value port rcv errors 0x307 Increase by 34 during ibdiagnet scan SL1 2 P11 lid 0x0008 guid 0x0002c90000000207 dev 47396 Performance Monitor counter Value port_rcv errors Oxdl Increase by 5 during ibdiagnet scan SL1 2 P16 lid 0x0008 guid 0x0002c90000000207 dev 47396 Performance Monitor counter Value port_rcv errors 0x6c Increase by 4 during ibdiagnet scan SL1 4 P1 lid 0x000c guid 0x0002c9000000020b dev 47396 Performance Monitor counter Value port xmit discard 0x307 Increase by 34 during ibdiagnet scan 49 VA P
23. ia sla mamami dia ll len d n gnes la a da sia sla en en ln d n en TN TE elos nen mn sla s n len le ln ll dn d a tn solos i en mn sla s n en le ln ll dn d a tn REAR la abc ele nim nie in nie die do CL AAA ATA eee Cr Er miniato minna lea stia e A A alla QIQ P k E L T E F i E QIQ QIQ P F I L I E F i E taf calca tafa ee PAER eenaa EEN e nn mammie ii a mio a ll nl eae memes id oa m n man La do ei dii ee daa Le chokes ei dii o aa O Ecs SEE ELE ILE S5 6 0 0 wa ni a lid gi l lla ml lo d a en desse amet i pg sa a wa a ni en on len ell l g s ml lo d a tn RS e Nn COTTO ee Non COTTO e n ba et ee ee co e e nn dd 1 1 18 de 17 18 PANE oe CETTE Teena CETTE EEE Fa cei ll LL E ee dice i EE me wa eras in q o li ol a mn ess q o li o mga e Ea nn ocios q o li o gee e E RAR CORR RO ro Ca E ORAR CORR RO RA de aos nema ne ngon ea e O KEN ja de maaana nema ne nn O TO PA aa ra DENTI EEE PAN CEE EEE a Tr NA a aa wa siasa is i o meae wa siasa i o m eas ja memada manema nn li i m ees HEN ERE RS TEO RS mm sc cl is co da mos sla en a li ml ls d a a a mn mn en mn mamam gia nl ln d a macs cl mamania olden ml nl o d n mn sn i Dil P 28 P E L E E Ll 5 E Li joo A LJ A LJ gt o n I I I ula QQ nin E E LJ QQ P U T U A A U
24. liability Efficiency e gt 12GB s bandwidth lt 0 7usec latency e Link bit encoding 64 66 PCI Express 3 0 e Forward Error Correction InfiniBand router and IB Eth FC bridges e Lower power consumption FDR 56Gb s InfiniBand Solutions Portfolio AA Mellanox Modular Switch asa RNS bola llas SX6500 us En sa ce DN H R AA ne dt 9 SX60XX Feature License a vu gt En eee y c A tr rv merino nen enter bo SX6036 36 ports managed perece UFM Diagnostics Tini jen L jee Cj A es ee nd ee ee a RE Virtual Protocol Interconnect SX6025 36 ports externally managed Note Please check availability with your Mellanox representative Cluster Topologies Mellanox nn Topologies used in large clusters e Fat Tree e 3D Torus e Mesh Hypercube Fat tree characteristics e Use same BW for all links or close BW e Many times use same number of ports for all switches e Several blocking and full non blocking configurations possible Main issues with fabric design e Is the SM capable of routing the fabric e Does it generate credit loops e Are the paths evenly distributed 324 Node Full CBB Using 36 port Switches Mellanox TECHNOLOGIES _ ____ _ _ _ _ __ Pr _ myztt L2 1 L2 9 status status S SESSI Ses aa SISSI mn m m m mm m mm m mm mm mm mm mm mm mm mm mm Ean nn
25. lified Cluster Management blocking U g Full CPU Offload Centralized route manager s In band diagnostics and upgrades e Hardware Based Transport Protocol e Reliable Transport e Kernel Bypass User level applications get direct access to hardware Memory exposed to remote node access e RDMA read and RDMA write Mellanox TECHNOLOGIES InfiniBand Hardware pj 7 Mellanox TECHNOLOGIES a o E ERAS I iu I EGI VU RAA ES I l es Target Channel A a pters Host Channel Adapter 4 f Host Channel Adapters arget for Specialized Sub ams H C A EN for computing platforms RA i TCA e Device that terminates an IB w and executes transport level Elm functions and support the verbs BEBE Jem a as interface Switch subnets consist of Router Melmork Drie Router aaia A device that routes packets J from one link to another of the EEE S a m e IB Su bn et inter subnet een Cables e QSFP copper or fiber cables used to connect the HCA and switch ports together IB Link I PR dm Lala i Links amp Switches Mellanox TECHNOLOGIES 2011 56Gb s 2008 40Gb s 2005 e 20Gb s 10Gb s PCI gt EXPRESS PCI 3 0 EXPRESS FDR INFINIBAND TECHNOLOGY E FDR InfiniBand New Features and Capabilities Performance Scalability Re
26. n ServicelD DSCP and SL e Error no common SL to be used e Error no PathRecord for given ServicelD DSCP and SL Cable Reporis e Reports vendor information part number cable length etc Case 1 remove 2 cables e SL2 1 P10 to SL1 5 P14 e SL2 6 P19 to SL1 10 P23 Case 2 remove hosts e H 49 H 12 Case 3 remove a switch ora FRU within a switch system wo PA E a e LEEN H el all TECHN Case 1 ibdiagnet t pwd network top T SSeS SS Soe eS DO DDD DDD oS DDS DDS DD eS SS Se Se See SHS DDS DDD I Topology matching results T See Se DDT Missing cable connecting SL2 1 P10 to SL1 5 P14 Missing cable connecting SL2 6 P19 to SL1 10 P23 Case 2 ibdiagnet t pwd network top _ ee ee e e n e e I Topology matching results I _ Missing System H 12 MT23108 Should be connected by cable from port P1 H 12 U1 P1 to SL1 1 P12 SL1 1 U1 P12 Missing System H 49 MT23108 Should be connected by cable from port P1 H 49 U1 P1 to SL1 5 P1 SL1 5 U1 P1 Case 3 ibdiagnet t pwd network top I _ I Topology matching results I _ Missing System Board SL1 1 leaf3 47 Help
27. s clear all error counters on subnet ibclearcounters clear all port counters on subnet ibcheckstate perform port state and physical port state check on the subnet ibcheckerrors perform error check on subnet Find ports above the indicated thresholds 35 ASPAS TS TA c dE ae le ibstatus e Displays basic information obtained from the local IB driver e Output includes LID SMLID link width active and port state Infiniband device mlx4_0 port 1 status default gid fe80 0000 0000 0000 0000 0000 0007 3896 base lid 0x3 sm lid 0x3 state 4 ACTIVE Bend 2 Ds E Bes nr nitialize as not configured ye phys state 9 LinkUp Active Ready to transfer data rate 20 Gb sec 4X DDR Infiniband device mlx4_0 port 2 status default gid fe80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0x3 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR 36 ibportstate e Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a switch port then the command can also be used to Disable enable or reset the port Validate the ports link width and speed against the peer port gt Ibportstate 56 3 Portinfo Port info DR path slid 65535 dlid 65535 0 por Initialize LinkUp 1X or 4X 1
28. unicate Path e Error No common partition for the path e Error Mismatch between leaf switch and host partitions e Report Which partitions can be used for the path e Verbose On each port if enforcing show list of PKeys ASSAR AL Cx e p SUIS DON A Vit Ly Mellanox ttt AS SS SS TECHNOLOGIES Two partitions with some common nodes ea RE AA MOREE cado ASA eee I Fabric Partitions Report see ibdiagnet pkey for a full hosts list IRE eee Pei W Missing PKey 0x8001 on remote switch of node H 95 P1 lid 0x0089 guid 0x0002c900000001ee dev 23108 PKey 0x0001 Hosts 87 full 87 limited 0 PKey 0x0002 Hosts 84 full 84 limited 0 PKey 0x7fff Hosts 128 full 1 limited 127 SS E MES cercar AE AA ee eee 58 Recap Verifying InfiniBand interconnect Mellanox PL ibdiagnet is a simple to use diagnostic tools to monitor the quality of Interconnect Once cluster is built run ibdiagnet to quickly scan and find initial problems Stress the cluster MPI communication benchmarks are best for this e g IMB Use an iterative process of stressing the cluster and then scanning for errors until all fabric problems are discovered and fixed Take snapshot of topology and use this to run periodic scans of the fabric Use point tools for problem isolation e e g ibdiagpath ibporistate perfquery etc Ia 59 Mellanox BEENS eneh H Questions Mellanox BEE

Download Pdf Manuals

image

Related Search

Related Contents

EX6100 WiFi Range Extender User Manual    Samsung WA10B3 คู่มือการใช้งาน    NR-13  LEDD-95931WW-LD9  none 5402 Installation Guide  Été 2011 - Journal Des Aixois  Samsung SGH-V200 دليل المستخدم    

Copyright © All rights reserved.
Failed to retrieve file