Architecture of Parallel Computers CSC / ECE 506 Openfabrics Alliance

Architecture of Parallel Computers CSC / ECE 506 Openfabrics Alliance

Architecture of Parallel Computers CSC / ECE 506 OpenFabrics Alliance Lecture 18 7/17/2006 Dr Steve Hunter Outline • Infiniband and Ethernet Review • DDP and RDMA • OpenFabrics Alliance – IP over Infiniband (IPoIB) – Sockets Direct Protocol (SDP) – Network File System (NFS) – SCSI RDMA Protocol (SRP) – iSCSI Extensions for RDMA (iSER) – Reliable Datagram Sockets (RDS) Arch of Parallel Computers CSC / ECE 506 2 Infiniband Goals - Review • Interconnect for server I/O and efficient interprocess communications • Standard across the industry – backed by all the major players » 200+ companies • With an architecture able to match future systems: – Low overhead – Scalable bandwidth, up and down – Scalable fanout, few to thousands – Low cost, excellent price/performance – Robust reliability, availability, and serviceability – Leverages Internet Protocol suite and paradigms Arch of Parallel Computers CSC / ECE 506 3 The Basic Unit: an IB Subnet - Review • Basic whole IB system is a subnet • Elements: End – Endnodes Node End – Links Node End – Switches Node End • What it does: Communicate Node Switch – endnodes with endnodes, Links – via message queues, Switch Switch – which process messages over several End transport types, Node – and are SARed into packets, End End – which are placed on links, Switch Node Node – and routed by switches. End Node End End Node Node Arch of Parallel Computers CSC / ECE 506 4 End Node Attachment to IB - Review Host • End nodes attach to IB via Channel Adapters: CPU CPU CPU CPU – Host CAs (HCAs) » O/S API/KPIs not specified » Queues and memory accessible via verbs Memory Controller » QP, CQ, and RDMA engines » Must support three IB Transports Verbs » Can include: HCA • Dual ports – load balancing, availability (path migration) Memory Tables – Attach to same or different subnets QPs CQs • Partitioning • Atomics, … IB Layers – Target CAs (TCAs) » Queue access method is vendor unique Adapter » QP and CQ engines IB Layers » Need only support Unreliable Datagram TCA » ULP can be standard or proprietary » In other words… • A smaller subset of required functions. QPs CQs IO Controller Arch of Parallel Computers CSC / ECE 506 5 Infiniband Summary • InfiniBand architecture is a very high performance, low latency interconnect technology based on an industry-standard approach to Remote Direct Memory Access (RDMA) – An InfiniBand fabric is built from hardware and software that are configured, monitored and operated to deliver a variety of services to users and applications • Characteristics of the technology that differentiate it from comparative interconnects such as the traditional Ethernet include: – End-to-end reliable delivery, – Scalable bandwidths from 10 to 60 Gbps available today moving to 120 Gbps in the near future – Scalability without performance degradation – Low latency between devices – Greatly reduced server CPU utilization for protocol processing – Efficient I/O channel architecture for network and storage virtualizations Arch of Parallel Computers CSC / ECE 506 6 Advanced Ethernet - Review iSER / RNIC Model shown TCP/IP Model with SCSI application SCSI Service RDMA SCSl app Examples Service Internet SCSI (iSCSI) TCP SCSI Service iSCSI Extensions for RDMA IP (iSER) HTTP, SMTP, FTP Application Service Remote Direct Memory Access Protocol (RDMAP) MAC Markers with PDU Alignment Service (MPA) Direct Data Placement (DDP) TCP, UDP Transport Transmission Control Protocol (TCP) IP Network Internet Protocol (IP) Ethernet Link Media Access Control (MAC) RDMA NIC (RNIC) Copper, Optical Physical Physical • It’s expected the OpenFabrics effort (i.e., OpenIB / OpenRDMA merger) will enable even more advanced functions into NIC technology Arch of Parallel Computers CSC / ECE 506 7 Advanced Ethernet Summary • The iWARP technology, implemented as RDMA Network Interface Card (RNIC), achieves Zero-copy, RDMA, and protocol offload over existing TCP/IP networks – It was demonstrated that a 10GbE based RNIC can reduce the CPU processing overhead from 80-90% to less than 10% comparing to its host stack equivalent – Additionally, its achievable end-to-end latency is now 5 microseconds or less. • iWARP together with the emerging low latency (low hundreds of nanoseconds) 10 GbE switches can also provide a powerful infrastructure for clustered computing, server-to-server processing, visualization and file system – The advantage of the iWARP technology includes its ability to leverage the widely deployed TCP/IP infrastructure, its broad knowledge base, and mature management and monitoring capabilities. – In addition, an iWARP infrastructure is a routable infrastructure, thereby eliminating the need for gateways to connect to the LAN or WAN internet. Arch of Parallel Computers CSC / ECE 506 8 DDP and RDMA • IETF RFC http://rfc.net/rfc4296.html • The central idea of general-purpose DDP is that a data sender will supplement the data it sends with placement information that allows the receiver's network interface to place the data directly at its final destination without any copying. – DDP can be used to steer received data to its final destination, without requiring layer- specific behavior for each different layer. – Data sent with such DDP information is said to be `tagged'. • The central components of the DDP architecture are the “buffer”, which is an object with beginning and ending addresses, and a method (set()), which sets the value of an octet at an address. – In many cases, a buffer corresponds directly to a portion of host user memory. However, DDP does not depend on this; a buffer could be a disk file, or anything else that can be viewed as an addressable collection of octets. Arch of Parallel Computers CSC / ECE 506 9 DDP and RDMA • Remote Direct Memory Access (RDMA) extends the capabilities of DDP with two primary functions. – It adds the ability to read from buffers registered to a socket (RDMA Read). » This allows a client protocol to perform arbitrary, bidirectional data movement without involving the remote client. » When RDMA is implemented in hardware, arbitrary data movement can be performed without involving the remote host CPU at all. • RDMA specifies a transport-independent untagged message service (Send) with characteristics that are both very efficient to implement in hardware, and convenient for client protocols. – The RDMA architecture is patterned after the traditional model for device programming, where the client requests an operation using Send-like actions (programmed I/O), the server performs the necessary data transfers for the operation (DMA reads and writes), and notifies the client of completion. » The programmed I/O+DMA model efficiently supports a high degree of concurrency and flexibility for both the client and server, even when operations have a wide range of intrinsic latencies. Arch of Parallel Computers CSC / ECE 506 10 OpenFabrics Alliance • The OpenFabric Alliance is an international organization comprised of industry, academic and research groups that have developed a unified core of open source software stacks (OpenSTAC) leveraging RDMA architectures for both the Linux and Windows operating systems over both InfiniBand and Ethernet. – RDMA is a communications technique allowing data to be transmitted from the memory of one computer to the memory of another computer without passing through either devices CPU, without needing extensive buffering, and without calling to an operating system kernel • The core OpenSTAC software supports all the well known standard upper layer protocols such as MPI, IP, SDP, NFS, SRP, iSER, and RDS on top of Ethernet and InfiniBand (IB) infrastructures – The OpenFabric software and supporting services better enables low-latency InfiniBand and 10 GbE to deliver clustered computing, server-to-server processing, visualization and file system access Arch of Parallel Computers CSC / ECE 506 11 OpenFabrics Software Stack SA Subnet Sockets IP Based Block Clustered Access to Administrator Application Based Various DB Access App Storage File Level Access MPIs (Oracle MAD Management Access Access Systems (IBM DB2) 10g RAC) Datagram SMA Subnet Manager Diag Open User UDAPL Agent Tools SM Space User PMA Performance User Level SDP User Level Manager Agent APIs MAD API Library Verbs / API IPoIB IP over InfiniBand Kernel SDP Sockets Direct Upper Protocol NFS-RDMA Cluster Space Layer IPoIB SDP SRP iSER RDS RPC File Sys SRP SCSI RDMA Protocol Protocol (Initiator) iSER iSCSI RDMA Connection Manager Protocol (Initiator) Abstraction (CMA) RDS Reliable Datagram Mid-Layer Service SA Connection Connection MAD SMA Manager Manager UDAPL User Direct Access Client Programming Lib HCA Host Channel InfiniBand Verbs / API R-NIC Driver API Adapter R-NIC RDMA NIC Provider Hardware Hardware Specific Specific Driver Driver Common Apps & Key Access InfiniBand Methods Hardware InfiniBand HCA iWARP R-NIC for using iWARP OF Stack Arch of Parallel Computers CSC / ECE 506 12 IP over IB (IPoIB) • IETF Standard for mapping Internet protocols to Infiniband – IETF IPoIB Working Group • Covers – Fabric initialization – Multicast/Broadcast – Address resolution (IPv4/IPv6) – IP Datagram encapsulation (IPv4/IPv6) – MIBs Arch of Parallel Computers CSC / ECE 506 13 IP over IB (IPoIB) • Communication Parameters – Obtained from Subnet Manager (SM) » P_Key (Partition Key) » SL (Service Level) » Path Rate » Link MTU (for IPv6 can be reduced with router advert) » GRH parameters – TClass, Flow Label, HopLimit – Obtained from address resolution » Data Link Layer Address (GID) • Perstent Data Link layer address

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    37 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us