Home
Mellanox InfiniBand Training
Contents
1. Partitions p key support QoS support Congestion Control Adaptive Routing Enhanced routing algorithms Min hop Up down Fat tree LASH DOR Running OpenSm Mellanox Command line Default no parameters Scans and initializes the IB fabric and will occasionally sweep for changes opensm h for usage flags to start with up down routing opensm rouling engine updn Run is logged to two files var log messages opensm messages registers only general major events var log opensm log details of reported errors Start on Boot As a daemon etc init d opensmd start stop restart status letc opensm conf for default parameters 3 To start OpenSM automatically set ONBOOT yes ONBOOT yes SM detection Jetc init d opensd status Shows opensm runtime status on a machine sminfo Shows master and standby subnets running on the cluster IPoIB in a Nut Shell Encapsulation of IP packets over IB Uses IB as layer two for IP Supports both UD service up to 2KB MTU and RC service connected mode up to 64KB MTU Pv4 ARP and DHCP support Multicast support VLANs support Benefits Transparency to the legacy applications Allows leveraging of existing management infrastructure Specification state IETF Draft MELLANOX CONFIDENTIAL 3 user kernel Protocol Sw
2. Backplane connectors 4x QSFP Fiber 4X QSFP 7 o 12X Cable 162011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL D microGiGaCN 4 Link Layer Packets Packets are routable end to end fabric unit of transfer Link management packets train and maintain link operation Data packets Send 1 Acks 266 Packet lt 77 Upper Layer Protocol Transport Layer Protocol Network Layer Protocol Link Layer Protocol Figure 27 IBA Data Packet Format Link Layer Payload Size Maximum Transfer Unit MTU MTU allowed from 256 Bytes to 4K Bytes Message sizes much larger Only packets smaller than or equal to the MTU are transmitted Large MTU is more efficient less overhead Small MTU gives less jitter Small MTU preferable since segmentation reassembly performed by hardware in the HCA Routing between end nodes utilizes the smallest MTU of any link in the path Path MTU 162011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 12 Link Layer Virtual Lanes Quality of Service _ 16 Service Levels SLs A field in the Local Routing Header LRH of an InfiniBand packet Defines the requested QoS Virtual Lanes VLs Amechanism for creating multiple channels within a single physical link Each VL 15 associated with a set of Tx Rx buffers in a port Has separate flow control Aconfigurable Ar
3. Upper Level Transactions Protocols lent Client 1 Transport Messages Layer E E QP 77777 Network Inter Su i Layer IPv6 Subnet ica Link Layer 7 Physical Layer Switch Router End Node Physical Layer Link Rate InfiniBand uses serial stream of bits for data transfer Link width 1x One differential pair per Tx Rx Not used today m COO gt t 4x Four differential pairs per Tx Rx Used on all Mellanox HCA switch and cables lt 12x Twelve differential pairs per Tx and per Fin Limited use Link Speed Single Data Rate SDR 2 5 Gb s signaling 10 Gb s for 4x Double Data Rate DDR 5 Gb s signaling 20 Gb s for 4x Quad Data Rate QDR 10 Gb s signaling 40 Gb s for 4x FDR 14Gb s signaling 56 Gb s for 4x 64 66 Encoding EDR 25 Gb lane coming in near future Link rate Multiplication of the link width and link speed Most common shipping today is 4x QDR 40Gb s 102011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 8 Physical Layer Cont Media types PCB several inches Copper 20m SDR 10m DDR 7m QDR 4 Fiber 300m SDR 150m DDR 100 300 QDR CAT6 Twisted Pair in future 8 to 10 bit encoding for SDR DDR and QDR 64 66 bit encoding for FDR Industry standard components Copper cables Connectors Optical cables A
4. Switches use FDB Forwarding Database Based DLID and SL a packet is sent to the correct output port Multicast Destinations supported Outbound Packet gt Inbound Packet lt 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 16 Network Layer Responsibility The network layer describes the protocol for routing a packet between subnets Globally Unique ID GUID 64 bit field in the Global Routing Header GRH used to route packets between different IB subnets Every node must have a GUID IPv6 type header Transport Layer Queue Pairs CEN M QPs are in pairs Send Receive Work Queue is the consumer producer interface to the fabric The Consumer producer initiates a Work Queue Element WQE The Channel Adapter executes the work request The Channel Adapter notifies on completion or errors by writing a Completion Queue Element CQE to a Completion Queue CQ 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 18 Transport Layer Types Transfer Operations ae SEND Read message from HCA local system memory Transfers data to Responder HCA Receive Queue logic Does not specify where the data will be written in remote memory Immediate Data option available RDMA Read Responder HCA reads its local memory and returns it to the Requesting HCA Requires remote memory access rights memory start address and m
5. User Level Verbs API RNG Manager Agent soe ub IPoverintniBand T La SDP Sockets Direct Lay NFS ROMA Cluster Protocol rec SRP SCS ROMA Protocol e Protocol Initiator ER Protocol diate rating Abstraction CMA stem 3 2 ROS Reliable Datagram Y Midtayer 3 ie sma _ Connection Connection amp Cie Manager Manager 8 use User orea access Programming ib InfiniBand OpenFabrics O S Level Verbs WARP R NIC Host cum NIC Key Common EXNENE 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 29 MLNX_OFED Installation y Pre built RPM install 1 mount o rw loop MLNX_OFED_LINUX iso mnt 2 cd mnt 3 mlnxofedinstall Building RPMs for un supported kernels 1 mount o rw loop MLNX_OFED_LINUX iso mnt 2 mnt src 3 cp OFED tgz root this is the original OFED distribution tarball 4 tar zxvf OFED tgz 5 cd OFED 6 copy ofed conf to OFED directory 7 install pl c ofed conf MELLANOX CONFIDENTIAL OpenSM Features OpenSM osm is an InfiniBand compliant subnet manger Included in Linux Open Fabrics Enterprise Distribution Ability to run several instance of osm on the cluster in a Master Slave s configuration for redundancy
6. InfiniBand For HPC Overview HPC m Council Switzerland Workshop March 21 23 2011 Erez Cohen Sr Director of Field Engineering yi InfiniBand Overview MELLANOX CONFIDENTIAL 2 The InfiniBand Architecture Industry standard defined by the InfiniBand Trade Association Defines System Area Network architecture Comprehensive specification from physical to applications INFINIBAND 2000 2001 2002 7 2004 7 2007 Consoles 4 n HCA Architecture supports Oe Host Channel Adapters HCA Target Channel Adapters TCA Switches Routers xd Facilitated HW design for lt gt lt Low latency high bandwidth gt Transport offload 92011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 3 InfiniBand Feature Highlights eee Serial High Bandwidth Links 56 Gb s HCA links Quality Of Service Up to 120Gb s switch switch links Independent channels at the adapter Ultra low latency Virtual Lanes at the link level Under 1 us application to application Cluster Scalability flexibility Reliable lossless self managing Up to 48K nodes in subnet up to 2128 in fabric network Link level flow control Parallel routes between end nodes Congestion control to prevent HOL Multiple cluster topologies possible blocking Simplified Clu
7. biter control the Tx priority of each VL Each SL is mapped to a VL IB Spec allows a total of 16 VLs 15 for Data amp 1 for Management Minimum of 1 Data and 1 Management required on all links Switch ports and HCAs may each support a different number of VLs VL 15 is amanagement VL and is not a subject for flow control MELLANOX CONFIDENTIAL n Link Layer Flow Control Mellanox M M Credit based link level flow control Link Flow control assures NO packet loss within fabric even in the presence of congestion Link Receivers grant packet receive buffer space credits per Virtual Lane Flow control credits are issued in 64 byte units Separate flow control per Virtual Lanes provides Alleviation of head of line blocking Virtual Fabrics Congestion and latency on one VL does not impact traffic with guaranteed QOS on another VL even though they share the same physical link Link Control Credits Returned Packets Packets Transmitte d 2011 MELLANOX TECHNOLOGIES Link Layer Addressing rrr Local ID LID 16 bit field in the Local Routing Header LRH of all IB packets Used to rout packet in an InfiniBand subnet Each subnet may contain up to 48K unicast addresses 16K multicast addresses Assigned by Subnet Manager at initialization and topology changes
8. essage length RDMA Write Requester HCA sends data to be written into the Responder 5 system memory Requires remote memory access rights memory start address and message length 62011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 19 Reliable s amp gt Transport Services poj ouuoo uoN poj ouuo Verbs are the SW interface to the HCA and the IB fabric Verbs are not API but rather allow flexibility in the API implementation while defining the framework Some verbs for example Open Query Close HCA Create Queue Pair Query Completion Queue Post send Request Post Receive Request Upper Layer Protocols ULPs are application writing over the verbs interface that bridge between standard interfaces like TCP IP to IB to allow running legacy application intact SNMP Tunneling Agent Application Specific Agent Vendor Specific Agent Device Management Agent Performance Management Agent Communication Mgmt Mgr Agent Subnet Manager Baseboard Management Agent Subnet Administration an Agent General Service Interface virtualized per port Uses any VL except 15 MADS called GMPs LID Routed Subject to Flow Control 2011 MELLANOX TECHNOLOGIES Subnet Management Interface virtuali
9. itch device interface IPoIB Ethernet Verbs NIC Driver Access Layer HW 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL message passing interface Used for point to point communication MPI_I SEND MPI_I RECV Used for collective operations MPI AlltoAll MPI Reduce MPI_barrier Other primitives MPI Wait Walltime MPI Ranks are IDs assigned to each process Communication Groups are subdivisions a job node used for collectives Three stacks are included in this release of OFED MVAPICH 1 1 0 Open MPI 1 2 8 This presentation will concentrate on MVAPICH 1 1 0 MELLANOX CONFIDENTIAL TEES MPI Example MPL Init amp argc amp argv MPI Comm size MPI COMM WORLD amp numprocs MPI Comm rank MPI COMM WORLD amp myid MPI Barrier MPI COMM WORLD if myid 0 printf Passed first barrier n srand myid 1284 x rand printf I m rank d and my x is 0x 08x n myid x MPI Barrier MPI COMM WORLD Bcast amp x 1 MPI 0 COMM WORLD if myid 1 printf My id is rank 1 and got 0x 08x from rank On x if myid 2 printf My id is rank 2 and got 0x 08x from rank 1n x Compiling ae mpicc is used to compiling mpi applications mpicc is equivalent to gcc mpicc includes all the gcc flags needed for compilation Head files paths Libraries paths To see real c
10. ompilation flag run mpicc v MPI application can be shared or dynamic Launching jobs using mpirun_rsh Prerequisites for Running MPI The mpirun_rsh launcher program requires automatic login i e password less onto the remote machines Must also have an etc hosts file to specify the IP addresses of all machines that MPI jobs will run on Make sure there is no loopback node specified i e 127 0 0 1 in the etc hosts file or jobs may not launch properly Details on this procedure can be found in Mellanox OFED User s manual Basic format mpirun rsh np procs node1 node2 node3 BINARY Other flags show show only paramfile environment variables hostfile list of host ENV VAL i e VIADEV RENDEZVOUS THRESHOLD 8000 162011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL E 38 yi VIN Hands On MELLANOX CONFIDENTIAL InfiniBand For HPC Hands y Set up 2 servers with ConnectX HCA running SLES 11 8 port QDR IB switch based on InfiniScale 4 switch silicon Steps Identify OFED package Install OFED package Configure IPoIB interface Run OpenSM Check HCA status Test IPoIB ping Run MPI test without IB BW and Latency tests over IB www mellanox com 9201 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL PENNE
11. ster Management Full CPU Offload Centralized route manager Hardware Based Transport Protocol In band diagnostics and upgrades Reliable Transport Kernel Bypass User level applications get direct access to hardware Memory exposed to remote node access RDMA read and RDMA write 12011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 4 InfiniBand Host Channel Adapter HCA Device that terminates an IB link and executes transport E level functions and support the 8 P verbs interface Switch that routes packets from one link to another of the same IB Subnet Router device that transports packets between IBA subnets 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 5 IB Architecture Layers Physical Signal levels and Frequency Media Connectors Link Symbols and framing Flow control credit based How packets are routed from Source to Destination Network How packets are routed between subnets Transport Delivers packets to the appropriate Queue Pair Message Assembly De assembly access rights etc Software Transport Verbs and Upper Layer Protocols Interface between application programs and hardware Allows support of legacy protocols such as TCP IP Defines methodology for management functions InfiniBand Layered Architecture
12. ute Nodes 2 x 4X QDR Uplinks 1 x 4X QDR Uplinks MELLANOX CONFIDENTIAL 26 InfiniBand Linux SW Stack MLNX_OFED 62011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 27 OpenFabrics Enterprise Distribution OFED Open Fabrics Enterprise Distribution OFED is a complete SW stack for RDMA capable devices Contains low level drivers core Upper Layer Protocols ULPs Tools and documents Available on OpenFabrics org or as a Mellanox supported package at e http www mellanox com content pages php pg products_dyn amp product family 26 amp menu_section 34 Mellanox OFED is a single Virtual Protocol Interconnect VPI software stack based on the OFED stack Operates across all Mellanox network adapters Supports SDR DDR QDR and FDR InfiniBand 10Gb s Ethernet 10GigE Fiber Channel over Ethernet FCoE 2 5 or 5 0 GT s Express 2 0 2 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL The SW stack Mellanox Targeted User Services SA Subnet Sect som pm Access MP is Oracle S Management joa 062 x Datagram ess SMA Subnet Manager UpAPL Agent Performance
13. zed per port Always uses VL15 MADS called SMPs LID or Direct Routed No Flow Control MELLANOX CONFIDENTIAL Subnet Management Topology Discovery Initializati LID Route FDB Initialization Directed Route Vector LID Route Directed Route MADs Fabric Maintenance MADs use unreliable datagrams Each Subnet must have a Subnet Manager SM Every entity CA SW Router must support a Subnet Management Agent SMA Multipathing LMC Supports 1 Multiple LIDS 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 23 yi WIN InfiniBand Cluster Topologies MELLANOX CONFIDENTIAL 24 Cluster Topologies aaa Two topologies are mainly in use for large clusters Fat Tree most popular topology in HPC 3D Torus Fat tree characteristics Use same BW for all links or close BW Many times use same number of ports for all switches Many configurations are possible But they are all only Rearrangeably Non Blocking For any permutation of src dst pairs exists non blocking routing Main issues with fabric design e Is the SM capable of routing the fabric Does it generate credit loops Are the paths evenly distributed 2011 MELLANOX TECHNOLOGIES MELLANOX CONFIDENTIAL 25 324 Node Full FAT Tree using MTS5030 max 648 ports Shark GT EU 18 Links 18 Links To Compute Nodes Comp
Download Pdf Manuals
Related Search
Related Contents
Manuel d`utilisation Sortis Ice Fountain Cut Sheets (2-23 Descargar Clarion Model RK1 User's Manual ASUS SD222-YA User's Manual Modes d`emploi 付録/困ったときには Off line Software User's Guide - Auger at LAL TK 79 - Balzo Supermercados – Como fazer uma boa compra Copyright © All rights reserved.
Failed to retrieve file