Cs184c: Computer Architecture [Parallel and Multithreaded

CS184c: Computer Architecture Reading [Parallel and Multithreaded] • Shared Memory – Focus: H&P Ch 8 • At least read this… Day 7: April 24, 2001 – Retrospectives Threaded Abstract Machine (TAM) • Valuable and short Simultaneous Multi-Threading (SMT) – ISCA papers • Good primary sources CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Today • TAM Threaded Abstract Machine • SMT CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon TAM TL0 Model • Parallel Assembly Language • Activition Frame (like stack frame) • Fine-Grained Threading – Variables • Hybrid Dataflow – Synchronization • Scheduling Hierarchy – Thread stack (continuation vectors) • Heap Storage – I-structures CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 1 TL0 Ops Scheduling Hierarchy • RISC-like ALU Ops • Intra-frame • FORK – Related threads in same frame • SWITCH – Frame runs on single processor • STOP – Schedule together, exploit locality • POST • (cache, maybe regs) • FALLOC • Inter-frame • FFREE – Only swap when exhaust work in current • SWAP frame CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Intra-Frame Scheduling TL0/CM5 Intra-frame • Simple (local) stack of pending threads • Fork on thread • Fork places new PC on stack – Fall through 0 inst • STOP pops next PC off stack – Unsynch branch 3 inst • Stack initialized with code to exit – Successful synch 4 inst activation frame – Unsuccessful synch 8 inst – Including schedule next frame • Push thread onto LCV 3-6 inst – Save live registers CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Fib Example Multiprocessor Parallelism • [look at how this turns into TL0 code] • Comes from frame allocations • Runtime policy where allocate frames – Maybe use work stealing? CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 2 Frame Scheduling CM5 Frame Scheduling Costs • Inlets to non-active frames initiate • Inlet Posts on non-running thread pending thread stack (RCV) – 10-15 instructions • First inlet may place frame on • Swap to next frame processor’s runable frame queue – 14 instructions • SWAP instruction picks next frame branches to its enter thread • Average thread cost 7 cycles – Constitutes 15-30% TL0 instr CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Cycle Instruction Mix Breakdown [Culler et. Al. [Culler et. Al. JPDC, July 1993] JPDC, July 1993] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Speedup Example Thread Stats • Thread lengths 3—17 • Threads run per “quantum” 7—530 [Culler et. Al. JPDC, July 1993] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon [Culler et. Al. JPDC, July 1993] 3 Great Project • Develop optimized mArch for TAM – Hardware support/architecture for single- Multithreaded Architectures cycle thread-switch/post CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Problem Idea • Long latency of operations • Run something else useful while stalled – Non-local memory fetch • In particular, another thread – Long latency operations (mpy, fp) – Another PC • Wastes processor cycles while stalled • If processor stalls on return • Again, use parallelism to “tolerate” – Latency problem turns into a throughput (utilization) problem latency – CPU sits idle CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon HEP/mUnity/Tera HEP Pipeline • Provide a number of contexts – Copies of register file… • Number of contexts ³ operation latency – Pipeline depth – Roundtrip time to main memory • Run each round-robin [figure: Arvind+Innucci, DFVLR’87] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 4 Strict Interleaved Threading • Uses parallelism to get throughput SMT • Potentially poor single-threaded performance – Increases end-to-end latency of thread CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Can we do both? SuperScalar Inefficiency • Issue from multiple threads into pipeline Unused Slot • No worse than (super)scalar on single thread • More throughput with multiple threads – Fill in what would have been empty issue slots with instructions from different Recall: limited threads Scalar IPC CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon SMT Promise SMT Estimates (ideal) Fill in empty slots with other threads [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 5 SMT Estimates (ideal) SMT uArch • Observation: exploit register renaming – Get small modifications to existing superscalar architecture [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon SMT uArch Stopped Here 4/24/01 • N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon SMT uArch Performance • Changes: – Multiple PCs – Control to decide how to fetch from – Separate return stacks per thread – Per-thread reorder/commit/flush/trap – Thread id w/ BTB – Larger register file • More things outstanding [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 6 Optimizing: fetch freedom Optimizing: Fetch Alg. • RR=Round Robin • ICOUNT – priority to • RR.X.Y thread w/ fewest – X – threads do fetch pending instrs in cycle • BRCOUNT – Y – instructions • MISSCOUNT fetched/thread • IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96] [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Throughput Improvement Costs • 8-issue superscalar – Achieves little over 2 instructions per cycle • Optimized SMT – Achieves 5.4 instructions per cycle on 8 threads • 2.5x throughput increase [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Costs Not Done, yet… • Conventional SMT formulation is for coarse-grained threads • Combine SMT w/ TAM ? – Fill pipeline from multiple runnable threads in activation frame – ?multiple activation frames? – Eliminate thread switch overhead? [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 7 Thought? Big Ideas • SMT reduce need for split-phase • Primitives operations? – Parallel Assembly Language – Threads for control – Synchronization (post, full-empty) • Latency Hiding – Threads, split-phase operation • Exploit Locality – Create locality • Scheduling quanta CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 8.

Load more