Dataflow: Passing the Token

Dataflow: Passing the Token Arvind Computer Science and Artificial Intelligence Lab M.I.T. ISCA 2005, Madison, WI June 6, 2005 1 Inspiration: Jack Dennis General purpose parallel machines based on a dataflow graph model of computation Inspired all the major players in dataflow during seventies and eighties, including Kim Gostelow and I @ UC Irvine ISCA, Madison, WI, June 6, 2005 Arvind - 2 EM4: single-chip dataflow micro, Dataflow80PESigma-1: Machinesmultiprocessor, The largest ETL, Japan Static - mostlydataflow for signal machine, processing ETL, Japan NEC - NEDIP and IPP Hughes, Hitachi, AT&T, Loral, TI, Sanyo M.I.T. Engineering model Shown at ... Supercomputing 96 K. Hiraki Dynamic Shown at Manchester (‘81) Supercomputing 91 GregM.I.T. Papadopoulos -S.TTDA Sakai, Monsoon (‘88) M.I.T./Motorola - Monsoon (‘91) (8 PEs, 8 IS) ETL - SIGMA-1 (‘88) (128 PEs,128John IS) Gurd T. ShimadaETL - EM4 (‘90) (80 PEs), EM-X (‘96) (80 PEs) Sandia - EPS88, EPS-2 Related machines: IBM - Empire Burton Smith’s ... Andy Y. KodamaChris DenelcorMonsoon HEP, Jack Horizon, Tera Boughton Joerg Costanza ISCA, Madison, WI, June 6, 2005 Arvind - 3 Software Influences • Parallel Compilers – Intermediate representations: DFG, CDFG (SSA, φ ,...) – Software pipelining Keshav Pingali, G. Gao, Bob Rao, .. • Functional Languages and their compilers • Active Messages David Culler •Compiling for FPGAs, ... Wim Bohm, Seth Goldstein... • Synchronous dataflow –Lustre, Signal Ed Lee @ Berkeley ISCA, Madison, WI, June 6, 2005 Arvind - 4 This talk is mostly about MIT work • Dataflow graphs – A clean model of parallel computation • Static Dataflow Machines – Not general-purpose enough • Dynamic Dataflow Machines – As easy to build as a simple pipelined processor • The software view – The memory model: I-structures • Monsoon and its performance •Musings ISCA, Madison, WI, June 6, 2005 Arvind - 5 Dataflow Graphs {x = a + b; y = b * 7 a b in (x-y) * (x+y)} 12+ *7 • Values in dataflow graphs are represented as tokens x ip = 3 y token < ip , p , v > p = L 3 - 4 + instruction ptr port data • An operator executes when all its input tokens are present; copies of the result token are 5 * distributed to the destination operators no separate control flow ISCA, Madison, WI, June 6, 2005 Arvind - 6 Dataflow Operators • A small set of dataflow operators can be used to define a general programming language Fork Primitive Ops Switch Merge TT T F + T F ⇒ TT T F + T F ISCA, Madison, WI, June 6, 2005 Arvind - 7 Well Behaved Schemas ••• ••• one-in-one-out & self cleaning P P ••• ••• Before After Conditional BoundedLoop Loop T F F T F p fg T F f T F Needed for resource management ISCA, Madison, WI, June 6, 2005 Arvind - 8 Outline • Dataflow graphs √ – A clean model of parallel computation • Static Dataflow Machines ← – Not general-purpose enough • Dynamic Dataflow Machines – As easy to build as a simple pipelined processor • The software view – The memory model: I-structures • Monsoon and its performance •Musings ISCA, Madison, WI, June 6, 2005 Arvind - 9 Static Dataflow Machine: Instruction Templates 1 Opcode 2 + Destination 1 3 3L 4L * Destination 2 4 - 3R 4R 5 + 5L Operand 1 * 5R Operand 2 out Each arc in the graph has a a operand slot in the program 12 + Presence bits x b 3 *7 - ISCA, Madison, WI, June 6, 2005 4 y + 5 * Arvind - 10 Static Dataflow Machine Jack Dennis, 1973 Receive Instruction Templates 1 Op dest1 dest2 p1 src1 p2 src2 2 . FU FU FU FU FU Send <s1, p1, v1>, <s2, p2, v2> • Many such processors can be connected together • Programs can be statically divided among the processor ISCA, Madison, WI, June 6, 2005 Arvind - 11 Static Dataflow: Problems/Limitations • Mismatch between the model and the implementation – The model requires unbounded FIFO token queues per arc but the architecture provides storage for one token per arc – The architecture does not ensure FIFO order in the reuse of an operand slot – The merge operator has a unique firing rule • The static model does not support – Function calls - No easy solution in – Data Structures the static framework - Dynamic dataflow provided a framework for solutions ISCA, Madison, WI, June 6, 2005 Arvind - 12 Outline • Dataflow graphs √ – A clean model of parallel computation • Static Dataflow Machines √ – Not general-purpose enough • Dynamic Dataflow Machines ← – As easy to build as a simple pipelined processor • The software view – The memory model: I-structures • Monsoon and its performance •Musings ISCA, Madison, WI, June 6, 2005 Arvind - 13 Dynamic Dataflow Architectures • Allocate instruction templates, i.e., a frame, dynamically to support each loop iteration and procedure call – termination detection needed to deallocate frames • The code can be shared if we separate the code and the operand storage a token <fp, ip, port, data> frame instruction pointer pointer ISCA, Madison, WI, June 6, 2005 Arvind - 14 A Frame in Dynamic Dataflow 1 + 1 3L, 4L Program a b 2 * 2 3R, 4R 3 - 3 5L 12+ *7 4 + 4 5R 5 * 5 out x <fp, ip, p , v> y 3 - 4 + 1 2 7 3 5 * 4 Frame 5 Need to provide storage for only one operand/operator ISCA, Madison, WI, June 6, 2005 Arvind - 15 Monsoon Processor Greg Papadopoulos op rd1,d2 Instruction ip Fetch Code fp+r Operand Fetch Token Frames Queue ALU Form Token Network Network ISCA, Madison, WI, June 6, 2005 Arvind - 16 Temporary Registers & Threads Robert Iannucci op rS1,S2 Instruction Fetch n sets of Code registers (n = pipeline Operand depth) Frames Fetch Token Queue Registers Registers evaporate ALU when an instruction thread is broken Robert IannucciForm Token Registers are also Network used for exceptions & Network interrupts ISCA, Madison, WI, June 6, 2005 Arvind - 17 Actual Monsoon Pipeline: Eight Stages 32 Instruction Instruction Fetch Memory Effective Address 3 Presence Presence Bit bits Operation User System Queue Queue 72 Frame Frame Memory Operation 72 72 144 2R, 2W ALU 144 Network Registers Form Token ISCA, Madison, WI, June 6, 2005 Arvind - 18 Instructions directly control the pipeline The opcode specifies an operation for each pipeline stage: opcode r dest1 [dest2] EA WM RegOp ALU FormToken Easy to implement; no hazard detection EA - effective address FP + r: frame relative r: absolute IP + r: code relative (not supported) WM - waiting matching Unary; Normal; Sticky; Exchange; Imperative PBs X port → PBs X Frame op X ALU inhibit Register ops: ALU: VL X VR → V’L X V’R , CC Form token: VL X VR X Tag1 X Tag2 X CC → Token1 X Token2 ISCA, Madison, WI, June 6, 2005 Arvind - 19 Procedure Linkage Operators f a1 an ... get frame extract tag change Tag 0 change Tag 1 change Tag n Like standard call/return but caller & 1: n: callee can be Fork Graph for f active simultaneously token in frame 0 change Tag 0 change Tag 1 token in frame 1 ISCA, Madison, WI, June 6, 2005 Arvind - 20 Outline • Dataflow graphs √ – A clean model of parallel computation • Static Dataflow Machines √ – Not general-purpose enough • Dynamic Dataflow Machines √ – As easy to build as a simple pipelined processor • The software view ← – The memory model: I-structures • Monsoon and its performance •Musings ISCA, Madison, WI, June 6, 2005 Arvind - 21 Parallel Language Model Tree of Global Heap of Activation Shared Objects Frames f: g: h: active threads asynchronous loop and parallel at all levels ISCA, Madison, WI, June 6, 2005 Arvind - 22 Id World implicit parallelism Id Dataflow Graphs + I-Structures + . TTDA Monsoon *T *T-Voyager ISCA, Madison, WI, June 6, 2005 Arvind - 23 Id World people • Rishiyur Nikhil, • Keshav Pingali, • Vinod Kathail, • David Culler •Ken Traub • Steve Heller, • Richard Soley, •DinartMores • Jamey Hicks, • Alex Caro, •Andy Shaw, •Boon Ang •Shail Anditya • R Paul Johnson Keshav Pingali David Culler Ken Traub •Paul Barth R.S. Nikhil •Jan Maessen •Christine Flood • Jonathan Young • Derek Chiou • Arun Iyangar •Zena Ariola • Mike Bekerle •K. Eknadham(IBM) • Wim Bohm (Colorado) • Joe Stoy (Oxford) •... Boon S. Ang Jamey Hicks Derek Chiou Steve Heller ISCA, Madison, WI, June 6, 2005 Arvind - 24 Data Structures in Dataflow • Data structures reside in a Memory structure store ⇒ tokens carry pointers P . P a • I-structures: Write-once, Read multiple times or I-fetch – allocate, write, read, ..., read, deallocate ⇒ No problem if a reader arrives before the writer at the memory location a v I-store ISCA, Madison, WI, June 6, 2005 Arvind - 25 I-Structure Storage: Split-phase operations & Presence bits <s, fp, a > I-structure s Memory I-Fetch address to be read a1 a2 v1 v t a 3 2 fp.ip ⇓ a4 forwarding address s I-Fetch <a, Read, (t,fp)> split phase t • Need to deal with multiple deferred reads • other operations: fetch/store,ISCA, Madison, take/put, WI, June 6,clear 2005 Arvind - 26 Outline • Dataflow graphs √ – A clean parallel model of computation • Static Dataflow Machines √ – Not general-purpose enough • Dynamic Dataflow Machines √ – As easy to build as a simple pipelined processor • The software view √ – The memory model: I-structures • Monsoon and its performance ← •Musings ISCA, Madison, WI, June 6, 2005 Arvind - 27 The Monsoon Project Motorola Cambridge Research Center + MIT MIT-Motorola collaboration 1988-91 Research Prototypes Monsoon 16-node I-structure Processor Fat Tree 64-bit 100 MB/sec 10M tokens/sec Unix Box 4M 64-bit words 16 2-node systems (MIT, LANL, Motorola, Colorado, Oregon, McGill, USC, ...) 2 16-node systems (MIT, LANL) Id World Software Tony Dahbura ISCA, Madison, WI, June 6, 2005 Arvind - 28 Id Applications on Monsoon @ MIT •Numerical – Hydrodynamics - SIMPLE – Global Circulation Model - GCM – Photon-Neutron Transport code -GAMTEB – N-body problem •Symbolic – Combinatorics - free tree matching,Paraffins –Id-in-Id compiler •System – I/O Library – Heap Storage Allocator on Monsoon •Fun and Games –Breakout –Life – Spreadsheet ISCA, Madison, WI, June 6, 2005 Arvind - 29 Id Run Time System (RTS) on Monsoon • Frame Manager: Allocates frame memory on processors for procedure and loop activations Derek Chiou • Heap Manager: Allocates storage in I- Structure memory or in Processor memory for heap objects.

Load more