Message passing

CS252 • Sending of messages under control of programmer – User-level/system level? Graduate Architecture – Bulk transfers? Lecture 24 • How efficient is it to send and receive messages? – Speed of memory bus? First-level cache? Network Interface Design • Communication Model: Memory Consistency Models – Synchronous » Send completes after matching recv and source data sent » Receive completes after data transfer complete from matching send – Asynchronous Prof John D. Kubiatowicz » Send completes after send buffer may be reused http://www.cs.berkeley.edu/~kubitron/cs252

4/27/2009 cs252-S09, Lecture 24 2

Synchronous Message Passing Asynch. Message Passing: Optimistic

Source Destination Source Destination (1) Initiate send Recv Psrc, local VA, len (1) Initiate send (2) Address translation on Psrc Send Pdest, local VA, len (2) Address translation (3) Local/remote check Send (P , local VA, len) (3) Local/remote check dest (4) Send-ready request Send-rdy req (4) Send data (5) Remote check for (5) Remote check for posted receive; on fail, Tag match allocate data buffer posted receive Wait Tag check Data-xfer req (assume success) Processor Allocate buffer Action?

(6) Reply transaction Recv-rdy reply

(7) Bulk data transfer Source VA Dest VA or ID Recv P , local VA, len Time src

Data-xfer req Time • More powerful programming model • Constrained programming model. • Wildcard receive => non-deterministic • Deterministic! What happens when threads added? • Destination contention very limited. • Storage required within msg layer? • User/System boundary?

4/27/2009 cs252-S09, Lecture 24 3 4/27/2009 cs252-S09, Lecture 24 4 Asynch. Msg Passing: Conservative Features of Msg Passing Abstraction

Destination • Source knows send data address, dest. knows Source receive data address (1) Initiate send

(2) Address translation on Pdest Send Pdest, local VA, len – after handshake they both know both (3) Local/remote check (4) Send-ready request Send-rdy req • Arbitrary storage “outside the local address spaces”

(5) Remote check for posted – may post many sends before any receives receive (assume fail); Return and compute record send-ready Tag check – non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors » fine print says these are limited too (6) Receive-ready request Recv Psrc, local VA, len (7) Bulk data reply • Optimistically, can be 1-phase transaction Source VA Dest VA or ID – Compare to 2-phase for shared address space Recv-rdy req – Need some sort of flow control » Credit scheme?

Data-xfer reply • More conservative: 3-phase transaction Time • Where is the buffering? – includes a request / response • Essential point: combined synchronization and • Contention control? Receiver initiated protocol? communication in a single package! • Short message optimizations 4/27/2009 cs252-S09, Lecture 24 5 4/27/2009 cs252-S09, Lecture 24 6

Active Messages Common Challenges

• Input buffer overflow – N-1 queue over-commitment => must slow sources Request handler • Options: Reply – reserve space per source (credit) handler » when available for reuse? • Ack or Higher level – Refuse input when full » backpressure in reliable network » tree saturation • User-level analog of network transaction » free – transfer data packet and invoke handler to extract it from the » what happens to traffic not bound for congested dest? network and integrate with on-going computation – Reserve ack back channel • Request/Reply – drop packets • Event notification: interrupts, polling, events? – Utilize higher-level semantics of programming model • May also perform memory-to-memory transfer

4/27/2009 cs252-S09, Lecture 24 7 4/27/2009 cs252-S09, Lecture 24 8 Spectrum of Designs Net Transactions: Physical DMA

• None: Physical bit stream – blind, physical DMA nCUBE, iPSC, . . . Data Dest • User/System DMA – User-level port CM-5, *T, Alewife, RAW channels

Addr Addr – User-level handler J-Machine, Monsoon, . . . Cmd Status, Length interrupt  Length • Remote virtual address Rdy Rdy Memory P – Processing, translation Paragon, Meiko CS-2 Memory P • Global physical address – Proc + Memory controller RP3, BBN, T3D • DMA controlled by regs, generates interrupts

• Cache-to-cache • Physical => OS initiates transfers sender auth

– Cache controller Dash, Alewife, KSR, Flash • Send-side dest addr – construct system “envelope” around user data in kernel area Increasing HW Support, Specialization, Intrusiveness, Performance (???) • Receive – receive into system buffer, since no interpretation in user space

4/27/2009 cs252-S09, Lecture 24 9 4/27/2009 cs252-S09, Lecture 24 10

nCUBE Network Interface Conventional LAN NI

Host Memory NIC

Input ports Output ports

  trncv

Switch NIC Controller

Data TX addr DMA DMA Addr Addr Addr Addr Addr Addr RX channels len Length Length Length

Addr Len Addr Len Memory Status Status bus IO Bus Next Next Memory Processor mem bus Addr Len Addr Len Status Status • independent DMA channel per link direction Next Next Proc – leave input buffers always open Addr Len Addr Len Status Status – segmented messages Os 16 ins 260 cy 13 us Next Next Or 18 200 cy 15 us • routing interprets envelope - includes interrupt – dimension-order routing on hypercube – bit-serial with 36 bit cut-through • Costs: Marshalling, OS calls, interrupts 4/27/2009 cs252-S09, Lecture 24 11 4/27/2009 cs252-S09, Lecture 24 12 User Level Ports Example: CM-5

Virtual address space User/system Data Dest Net output • Input and output Diagnostics network port Control network FIFO for each Data network Net input port network Processor PM PM  Status • 2 data networks Processing Processing Control I/O partition partition partition processors Status, Registers Mem P Mem P interrupt • tag per message

Program counter – index NI mapping table SPARC FPU Data Control • context switching? networks network $ • initiate transaction at user level $ NI ctrl SRAM • deliver to user without OS intervention MBUS • Alewife integrated Vector Vector DRAM unit DRAM DRAM unit DRAM • network port in user space NI on chip ctrl ctrl ctrl ctrl – May use virtual memory to map physical I/O to user mode DRAM DRAM DRAM DRAM • *T and iWARP also • User/system flag in envelope – protection check, translation, routing, media access in src CA Os 50 cy 1.5 us – user/sys check in dest CA, interrupt on system Or 53 cy 1.6 us interrupt 10us 4/27/2009 cs252-S09, Lecture 24 13 4/27/2009 cs252-S09, Lecture 24 14

RAW processor: Systolic Computation User Level Handlers

User/system

Data Address Dest



Mem Mem P P

• Very fast support for systolic processing • Hardware support to vector to address specified in – Streaming from one processor to another message » Simple moves into network ports and out of network ports – On arrival, hardware fetches handler address and starts execution – Static router programmed at same time as processors • Active Messages: two options • Also included dynamic network for unpredictable – Computation in background threaads » Handler never blocks: it integrates message into computation computations (and things like cache misses) – Computation in handlers (Message Driven Processing) » Handler does work, may need to send messages or block

4/27/2009 cs252-S09, Lecture 24 15 4/27/2009 cs252-S09, Lecture 24 16 J-Machine Alewife Messaging • Send message – write words to special network interface registers – Execute atomic launch instruction • Receive – Generate interrupt/launch user-level context – Examine message by reading from special network interface registers • Each node a small mdg – Execute dispose message driven processor – Exit atomic section • HW support to queue msgs and dispatch to msg handler task

4/27/2009 cs252-S09, Lecture 24 17 4/27/2009 cs252-S09, Lecture 24 18

Sharing of Network Interface The Fetch Deadlock Problem • What if user in middle of constructing message • Even if a node cannot issue a request, it must sink and must context switch??? network transactions! – Need Atomic Send operation! – Incoming transaction may be request  generate a response. » Message either completely in network or not at all – Closed system (finite buffering) » Can save/restore user’s work if necessary (think about single set of network interface registers • Deadlock occurs even if network deadlock free! – J-Machine mistake: after start sending message must let sender finish » Flits start entering network with first SEND instruction » Only a SENDE instruction constructs tail of message • Receive Atomicity – If want to allow user-level interrupts or polling, must give user control over network reception NETWORK » Closer user is to network, easier it is for him/her to screw NETWORK it up: Refuse to empty network, etc » However, must allow atomicity: way for good user to select when their message handlers get interrupted – Polling: ultimate receive atomicity – never interrupted » Fine as long as user keeps absorbing messages 4/27/2009 cs252-S09, Lecture 24 19 4/27/2009 cs252-S09, Lecture 24 20 Solutions to Fetch Deadlock? Example Queue Topology: Alewife

• logically independent request/reply networks • Message-Passing and – physical networks Shared-Memory both need – virtual channels with separate input/output queues messages • bound requests and reserve input buffer space – Thus, can provide both! – K(P-1) requests + K responses per node • When deadlock detected, – service discipline to avoid fetch deadlock? start storing messages to • NACK on input buffer full memory (out of hardware) – NACK delivery? – Remove deadlock by increasing available queue space • Alewife Solution: • When network starts flowing – Dynamically increase buffer space to memory when necessary again, relaunch queued – Argument: this is an uncommon case, so use to fix messages – They take loopback path to be handled by local hardware

4/27/2009 cs252-S09, Lecture 24 21 4/27/2009 cs252-S09, Lecture 24 22

Natural Extensions of Memory System Sequential Consistency • Memory operations from a proc become visible P P 1 n Scale (to itself and others) in program order Switch • There exists a total order, consistent with this (Interleaved) First-level $ P1 Pn partial order - i.e., an interleaving – the position at which a write occurs in the hypothetical total order $ $ (Interleaved) should be the same with respect to all processors Main memory Interconnection network • Said another way: – For any possible individual run of a program on multiple processors Mem Mem Shared Cache – Should be able to come up with a serial interleaving of all operations that respects Centralized Memory Dance Hall, UMA P1 Pn » Program Order » Read-after-write orderings (locally and through network) $ $ Mem Mem » Also Write-after-read, write-after-write

Interconnection network

Distributed Memory (NUMA) 4/27/2009 cs252-S09, Lecture 24 23 4/27/2009 cs252-S09, Lecture 24 24 Sequential Consistency Sequential Consistency Example

Processors P P issuing memory 1 2 Pn references as Processor 1 Processor 2 One Consistent Serial Order per program or der

LD1 A  5 LD5 B  2 LD1 A  5 The “switch” is randomly set after each memory  …  reference LD2 B 7 LD2 B 7

ST1 A,6 LD6 A  6 LD5 B  2 Memory … ST4 B,21 ST1 A,6  …  • Total order achieved by interleaving accesses from LD3 A 6 LD6 A 6 different processes LD4 B  21 LD7 A  6 ST4 B,21

– Maintains program order, and memory operations, from all ST2 B,13 … LD3 A  6 processes, appear to [issue, execute, complete] atomically w.r.t. others ST3 B,4 LD8 B  4 LD4 B  21

– as if there were no caches, and a single memory LD7 A  6

• “A multiprocessor is sequentially consistent if the result of ST2 B,13 any execution is the same as if the operations of all the ST B,4 processors were executed in some sequential order, and the 3 operations of each individual processor appear in this LD8 B  4 sequence in the order specified by its program.” [Lamport, 1979] 4/27/2009 cs252-S09, Lecture 24 25 4/27/2009 cs252-S09, Lecture 24 26

Summary

• Many different Message-Passing styles – Global Address space: 2-way – Optimistic message passing: 1-way – Conservative transfer: 3-way • “Fetch Deadlock” – RequestResponse introduces cycle through network – Fix with: » 2 networks » dynamic increase in buffer space • Network Interfaces – User-level access – DMA – Atomicity

4/27/2009 cs252-S09, Lecture 24 27