Message Passing
Total Page:16
File Type:pdf, Size:1020Kb
Message passing CS252 • Sending of messages under control of programmer – User-level/system level? Graduate Computer Architecture – Bulk transfers? Lecture 24 • How efficient is it to send and receive messages? – Speed of memory bus? First-level cache? Network Interface Design • Communication Model: Memory Consistency Models – Synchronous » Send completes after matching recv and source data sent » Receive completes after data transfer complete from matching send – Asynchronous Prof John D. Kubiatowicz » Send completes after send buffer may be reused http://www.cs.berkeley.edu/~kubitron/cs252 4/27/2009 cs252-S09, Lecture 24 2 Synchronous Message Passing Asynch. Message Passing: Optimistic Source Destination Source Destination (1) Initiate send Recv Psrc, local VA, len (1) Initiate send (2) Address translation on Psrc Send Pdest, local VA, len (2) Address translation (3) Local/remote check Send (P , local VA, len) (3) Local/remote check dest (4) Send-ready request Send-rdy req (4) Send data (5) Remote check for (5) Remote check for posted receive; on fail, Tag match allocate data buffer posted receive Wait Tag check Data-xfer req (assume success) Processor Allocate buffer Action? (6) Reply transaction Recv-rdy reply (7) Bulk data transfer Source VA Dest VA or ID Recv P , local VA, len Time src Data-xfer req Time • More powerful programming model • Constrained programming model. • Wildcard receive => non-deterministic • Deterministic! What happens when threads added? • Destination contention very limited. • Storage required within msg layer? • User/System boundary? 4/27/2009 cs252-S09, Lecture 24 3 4/27/2009 cs252-S09, Lecture 24 4 Asynch. Msg Passing: Conservative Features of Msg Passing Abstraction Destination • Source knows send data address, dest. knows Source receive data address (1) Initiate send (2) Address translation on Pdest Send Pdest, local VA, len – after handshake they both know both (3) Local/remote check (4) Send-ready request Send-rdy req • Arbitrary storage “outside the local address spaces” (5) Remote check for posted – may post many sends before any receives receive (assume fail); Return and compute record send-ready Tag check – non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors » fine print says these are limited too (6) Receive-ready request Recv Psrc, local VA, len (7) Bulk data reply • Optimistically, can be 1-phase transaction Source VA Dest VA or ID – Compare to 2-phase for shared address space Recv-rdy req – Need some sort of flow control » Credit scheme? Data-xfer reply • More conservative: 3-phase transaction Time • Where is the buffering? – includes a request / response • Essential point: combined synchronization and • Contention control? Receiver initiated protocol? communication in a single package! • Short message optimizations 4/27/2009 cs252-S09, Lecture 24 5 4/27/2009 cs252-S09, Lecture 24 6 Active Messages Common Challenges • Input buffer overflow – N-1 queue over-commitment => must slow sources Request handler • Options: Reply – reserve space per source (credit) handler » when available for reuse? • Ack or Higher level – Refuse input when full » backpressure in reliable network » tree saturation • User-level analog of network transaction » deadlock free – transfer data packet and invoke handler to extract it from the » what happens to traffic not bound for congested dest? network and integrate with on-going computation – Reserve ack back channel • Request/Reply – drop packets • Event notification: interrupts, polling, events? – Utilize higher-level semantics of programming model • May also perform memory-to-memory transfer 4/27/2009 cs252-S09, Lecture 24 7 4/27/2009 cs252-S09, Lecture 24 8 Spectrum of Designs Net Transactions: Physical DMA • None: Physical bit stream – blind, physical DMA nCUBE, iPSC, . Data Dest • User/System DMA – User-level port CM-5, *T, Alewife, RAW channels Addr Addr – User-level handler J-Machine, Monsoon, . Cmd Status, Length interrupt Length • Remote virtual address Rdy Rdy Memory P – Processing, translation Paragon, Meiko CS-2 Memory P • Global physical address – Proc + Memory controller RP3, BBN, T3D • DMA controlled by regs, generates interrupts • Cache-to-cache • Physical => OS initiates transfers sender auth – Cache controller Dash, Alewife, KSR, Flash • Send-side dest addr – construct system “envelope” around user data in kernel area Increasing HW Support, Specialization, Intrusiveness, Performance (???) • Receive – receive into system buffer, since no interpretation in user space 4/27/2009 cs252-S09, Lecture 24 9 4/27/2009 cs252-S09, Lecture 24 10 nCUBE Network Interface Conventional LAN NI Host Memory NIC Input ports Output ports trncv Switch NIC Controller Data TX addr DMA DMA Addr Addr Addr Addr Addr Addr RX channels len Length Length Length Addr Len Addr Len Memory Status Status bus IO Bus Next Next Memory Processor mem bus Addr Len Addr Len Status Status • independent DMA channel per link direction Next Next Proc – leave input buffers always open Addr Len Addr Len Status Status – segmented messages Os 16 ins 260 cy 13 us Next Next Or 18 200 cy 15 us • routing interprets envelope - includes interrupt – dimension-order routing on hypercube – bit-serial with 36 bit cut-through • Costs: Marshalling, OS calls, interrupts 4/27/2009 cs252-S09, Lecture 24 11 4/27/2009 cs252-S09, Lecture 24 12 User Level Ports Example: CM-5 Virtual address space User/system Data Dest Net output • Input and output Diagnostics network port Control network FIFO for each Data network Net input port network Processor PM PM Status • 2 data networks Processing Processing Control I/O partition partition partition processors Status, Registers Mem P Mem P interrupt • tag per message Program counter – index NI mapping table SPARC FPU Data Control • context switching? networks network $ • initiate transaction at user level $ NI ctrl SRAM • deliver to user without OS intervention MBUS • Alewife integrated Vector Vector DRAM unit DRAM DRAM unit DRAM • network port in user space NI on chip ctrl ctrl ctrl ctrl – May use virtual memory to map physical I/O to user mode DRAM DRAM DRAM DRAM • *T and iWARP also • User/system flag in envelope – protection check, translation, routing, media access in src CA Os 50 cy 1.5 us – user/sys check in dest CA, interrupt on system Or 53 cy 1.6 us interrupt 10us 4/27/2009 cs252-S09, Lecture 24 13 4/27/2009 cs252-S09, Lecture 24 14 RAW processor: Systolic Computation User Level Handlers User/system Data Address Dest Mem Mem P P • Very fast support for systolic processing • Hardware support to vector to address specified in – Streaming from one processor to another message » Simple moves into network ports and out of network ports – On arrival, hardware fetches handler address and starts execution – Static router programmed at same time as processors • Active Messages: two options • Also included dynamic network for unpredictable – Computation in background threaads » Handler never blocks: it integrates message into computation computations (and things like cache misses) – Computation in handlers (Message Driven Processing) » Handler does work, may need to send messages or block 4/27/2009 cs252-S09, Lecture 24 15 4/27/2009 cs252-S09, Lecture 24 16 J-Machine Alewife Messaging • Send message – write words to special network interface registers – Execute atomic launch instruction • Receive – Generate interrupt/launch user-level thread context – Examine message by reading from special network interface registers • Each node a small mdg – Execute dispose message driven processor – Exit atomic section • HW support to queue msgs and dispatch to msg handler task 4/27/2009 cs252-S09, Lecture 24 17 4/27/2009 cs252-S09, Lecture 24 18 Sharing of Network Interface The Fetch Deadlock Problem • What if user in middle of constructing message • Even if a node cannot issue a request, it must sink and must context switch??? network transactions! – Need Atomic Send operation! – Incoming transaction may be request generate a response. » Message either completely in network or not at all – Closed system (finite buffering) » Can save/restore user’s work if necessary (think about single set of network interface registers • Deadlock occurs even if network deadlock free! – J-Machine mistake: after start sending message must let sender finish » Flits start entering network with first SEND instruction » Only a SENDE instruction constructs tail of message • Receive Atomicity – If want to allow user-level interrupts or polling, must give user control over network reception NETWORK » Closer user is to network, easier it is for him/her to screw NETWORK it up: Refuse to empty network, etc » However, must allow atomicity: way for good user to select when their message handlers get interrupted – Polling: ultimate receive atomicity – never interrupted » Fine as long as user keeps absorbing messages 4/27/2009 cs252-S09, Lecture 24 19 4/27/2009 cs252-S09, Lecture 24 20 Solutions to Fetch Deadlock? Example Queue Topology: Alewife • logically independent request/reply networks • Message-Passing and – physical networks Shared-Memory both need – virtual channels with separate input/output queues messages • bound requests and reserve input buffer space – Thus, can provide both! – K(P-1) requests + K responses per node • When deadlock detected, – service discipline to avoid fetch deadlock? start storing messages to • NACK on input buffer full memory (out of hardware) – NACK delivery? – Remove deadlock by increasing available queue space • Alewife Solution: