Scalable Multiprocessors Supporting Programming Models Network Interface Interconnection Network Considerations in Bluegene/L Design

Topics Scaling issues Scalable Multiprocessors Supporting programming models Network interface Interconnection network Considerations in Bluegene/L design 2 Limited Scaling of a Bus Comparing with a LAN Characteristic Bus Characteristic Bus LAN Physical Length ~ 1 ft Physical Length ~ 1 ft KM Number of Connections fixed P1 Pn Number of Connections fixed many Maximum Bandwidth fixed . Maximum Bandwidth fixed ??? Interface to Comm. medium memory $ $ Interface to Comm. medium memory peripheral Global Order arbitration Global Order arbitration ??? Protection Virtual ⇒ physical Protection Virtual ⇒ physical OS Trust total MEM I/O Trust total little OS single OS single independent comm. abstraction HW comm. abstraction HW SW No clear limit to physical scaling, little trust, no Scaling limit global order, consensus difficult to achieve. Close coupling among components Independent failure and restart 3 4 Scalable Computers Bandwidth Scalability What are the design trade-offs for the spectrum of Switches machines between? Specialize or commodity nodes? S S . S Bus Capability of node-to-network interface Supporting programming models? Xbar P M P M . P M What does scalability mean? Avoid inherent design limits on resources Router Bandwidth increases with n What fundamentally limits bandwidth? single set of wires Latency does not increase with n Must have many independent wires Cost increases slowly with n Connect modules through switches 5 6 Programming Models Realized by Protocols Network Transaction Primitive Communication Network CAD Database Scientific modeling Parallel applications serialized msg Multiprogramming Shared Message Data Programming models ° ° ° address passing parallel output buffer input buffer Compilation or library Communication abstraction User/system boundary Source Node Destination Node Operating systems support Hardware/software boundary Communication hardware one-way transfer of information from a source Physical communication medium output buffer to a dest. input buffer causes some action at the destination Network Transactions occurrence is not directly visible at source deposit data, state change, reply 7 8 Shared Address Space Abstraction The Fetch Deadlock Problem Source Destination (1) Initiate memory access Load r ← [Global address] Even if a node cannot issue a request, it must sink (2) Address translation (3) Local/remote check network transactions. (4) Request transaction Read request Read request Incoming transaction may be a request, which will (5) Remote memory access Wait Memory access generate a response. Read response Closed system (finite buffering) (6) Reply transaction Read response (7) Complete memory access Time Fundamentally a two-way request/response protocol writes have an acknowledgement Issues fixed or variable length (bulk) transfers remote virtual or physical address, where is action performed? deadlock avoidance and input buffer full 9 10 Key Properties of SAS Abstraction Message passing Source and destination data addresses are specified Bulk transfers by the source of the request Complex synchronization semantics a degree of logical coupling and trust more complex protocols no storage logically “outside the application address More complex action space(s)” • But it may employ temporary buffers for transport Synchronous Operations are fundamentally request / response Send completes after matching recv and source data Remote operation can be performed on remote sent memory Receive completes after data transfer complete from logically does not require intervention of the remote matching send processor Asynchronous Send completes after send buffer may be reused 11 12 Synchronous Message Passing Asynchronous Message Passing: Optimistic Source Destination Source Destination (1) Initiate send Recv Psrc, local VA, len (1) Initiate send (2) Address translation on Psrc Send Pdest, local VA, len (2) Address translation (3) Local/remote check Send (P , local VA, len) (3) Local/remote check dest (4) Send-ready request Send-rdy req (4) Send data (5) Remote check for (5) Remote check for posted receive; on fail, Tag match allocate data buffer posted receive Wait Tag check Data-xfer req (assume success) Allocate buffer (6) Reply transaction Recv-rdy reply (7) Bulk data transfer Source VA⌫ Dest VA or ID Recv P , local VA, len Time src Data-xfer req Time More powerful programming model Wildcard receive => non-deterministic Constrained programming model. Deterministic! What happens when threads added? Storage required within msg layer? Destination contention very limited. 13 14 Asynchronous MSG Passing: Conservative Key Features of Msg Passing Abstraction Source Destination (1) Initiate send Source knows send data address, destination knows (2) Address translation on Pdest Send Pdest, local VA, len (3) Local/remote check receive data address (4) Send-ready request Send-rdy req after handshake they both know (5) Remote check for posted receive (assume fail); Return and compute record send-ready Tag check Arbitrary storage “outside the local address spaces” may post many sends before any receives (6) Receive-ready request Recv Psrc, local VA, len (7) Bulk data reply non-blocking asynchronous sends reduces the Source VA⌫ Dest VA or ID requirement to an arbitrary number of descriptors Recv-rdy req • fine print says these are limited too Fundamentally a 3-phase transaction Data-xfer reply Time includes a request / response Where is the buffering? can use optimistic 1-phase in limited “Safe” cases Contention control? Receiver initiated protocol? Short message optimizations 15 16 Network Interface Network Performance Metrics Transfer between local memory Sender Transmission time and NIC buffers Sender Overhead (size ÷ bandwidth) SW translates VA ⇔ PA Network (processor SW initiate DMA busy) SW does buffer management Time of Transmission time Receiver Flight (size ÷ bandwidth) Overhead NIC initiates interrupts on receive NIC Receiver (processor Provides protection MM MM Transport Latency MP MM busy) Transfer between NIC buffers and I/O the network Total Latency Generate packets Total Latency = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead Flow control with the network Includes header/trailer in BW calculation? 17 18 Protected User-Level Communication User Level Network ports Traditional NIC (e.g. Ethernet) requires OS kernel to initiate DMA and to manage buffers Virtual address space Net output Prevent apps from crashing OS or other apps port Net input Overhead is high (how high?) port Processor Status Multicomputer or multiprocessor NICs Registers OS maps VA to PA buffers Program counter Apps initiate DMAs using VA addresses or handles of descriptors NIC use mapped PA buffers to perform DMAs Appears to user as logical message queues plus Examples status Research: Active message, UDMA What happens if no user pop? Industry: VIA and RDMA 19 20 User Level Abstraction Generic Multiprocessor Architecture IQ IQ Interconnection Network Proc Proc OQ OQ S S . S Core Core … Core VAS VAS … Core Core Core NIC IQ IQ . Proc Proc $ MM MM OQ OQ MP MM I/O VAS VAS Any user process can post a transaction for any other Network characteristics in protection domain Network bandwidth: on-chip and off-chip interconnection network Bandwidth demands: independent and communicating threads/processes communication layer moves OQsrc –> IQdest Latency: local and remote may involve indirection: VASsrc –> VASdest 21 22 Scalable Interconnection Network Requirements from Above At core of parallel computer architecture Communication-to-computation ratio Requirements and trade-offs at many levels ⇒ bandwidth that must be sustained for given computational rate Elegant mathematical structure Traffic localized or dispersed? Bursty or uniform? Deep relationships to algorithm structure Programming Model Managing many traffic flows Protocol Electrical / optical link properties Granularity of transfer Little consensus Interconnection Network Degree of overlap (slackness) interactions across levels Performance metrics S S . S The job of a parallel machine’s interconnection network is to Cost metrics transfer information from source node to destination node in Workload support of network transactions that realize the programming . Need holistic understanding model 23 24 Characteristics of A Network Basic Definitions Topology (what) Network interface Physical interconnection structure of the network graph Communication between a node and the network Direct: node connected to every switch Links Indirect: nodes connected to specific subset of switches Bundle of wires or fibers that carries signals Routing Algorithm (which) Restricts the set of paths that messages may follow Switches Many algorithms with different properties Connects fixed number of input channels to fixed Switching Strategy (how) number of output channels How data in a message traverses a route Store and forward vs. cut through Flow Control Mechanism (when) When a message or portions of it traverse a route What happens when traffic is encountered? 25 26 Network Basics Traditional Network Media Twisted Pair: Copper, 1mm think, twisted to avoid 0110 0110 attenna effect (telephone) "Cat 5" is 4 twisted pairs in bundle Coaxial Cable: Plastic Covering Insulator Used by cable companies: Copper core Link made of some physical media high BW, good noise Braided outer conductor immunity wire, fiber, air Buffer Light: 3 parts with

Load more