Cs184c: Computer Architecture [Parallel and Multithreaded

CS184c: Computer Architecture Reading [Parallel and Multithreaded] • Shared Memory – Focus: H&P Ch 8 • At least read this… Day 7: April 24, 2001 – Retrospectives Threaded Abstract Machine (TAM) • Valuable and short Simultaneous Multi-Threading (SMT) – ISCA papers • Good primary sources CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Today • TAM Threaded Abstract Machine • SMT CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon TAM TL0 Model • Parallel Assembly Language • Activition Frame (like stack frame) • Fine-Grained Threading – Variables • Hybrid Dataflow – Synchronization • Scheduling Hierarchy – Thread stack (continuation vectors) • Heap Storage – I-structures CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 1 TL0 Ops Scheduling Hierarchy • RISC-like ALU Ops • Intra-frame • FORK – Related threads in same frame • SWITCH – Frame runs on single processor • STOP – Schedule together, exploit locality • POST • (cache, maybe regs) • FALLOC • Inter-frame • FFREE – Only swap when exhaust work in current • SWAP frame CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Intra-Frame Scheduling TL0/CM5 Intra-frame • Simple (local) stack of pending threads • Fork on thread • Fork places new PC on stack – Fall through 0 inst • STOP pops next PC off stack – Unsynch branch 3 inst • Stack initialized with code to exit – Successful synch 4 inst activation frame – Unsuccessful synch 8 inst – Including schedule next frame • Push thread onto LCV 3-6 inst – Save live registers CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Fib Example Multiprocessor Parallelism • [look at how this turns into TL0 code] • Comes from frame allocations • Runtime policy where allocate frames – Maybe use work stealing? CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 2 Frame Scheduling CM5 Frame Scheduling Costs • Inlets to non-active frames initiate • Inlet Posts on non-running thread pending thread stack (RCV) – 10-15 instructions • First inlet may place frame on • Swap to next frame processor’s runable frame queue – 14 instructions • SWAP instruction picks next frame branches to its enter thread • Average thread cost 7 cycles – Constitutes 15-30% TL0 instr CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Cycle Instruction Mix Breakdown [Culler et. Al. [Culler et. Al. JPDC, July 1993] JPDC, July 1993] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Speedup Example Thread Stats • Thread lengths 3—17 • Threads run per “quantum” 7—530 [Culler et. Al. JPDC, July 1993] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon [Culler et. Al. JPDC, July 1993] 3 Great Project • Develop optimized mArch for TAM – Hardware support/architecture for single- Multithreaded Architectures cycle thread-switch/post CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Problem Idea • Long latency of operations • Run something else useful while stalled – Non-local memory fetch • In particular, another thread – Long latency operations (mpy, fp) – Another PC • Wastes processor cycles while stalled • If processor stalls on return • Again, use parallelism to “tolerate” – Latency problem turns into a throughput (utilization) problem latency – CPU sits idle CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon HEP/mUnity/Tera HEP Pipeline • Provide a number of contexts – Copies of register file… • Number of contexts ³ operation latency – Pipeline depth – Roundtrip time to main memory • Run each round-robin [figure: Arvind+Innucci, DFVLR’87] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 4 Strict Interleaved Threading • Uses parallelism to get throughput SMT • Potentially poor single-threaded performance – Increases end-to-end latency of thread CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Can we do both? SuperScalar Inefficiency • Issue from multiple threads into pipeline Unused Slot • No worse than (super)scalar on single thread • More throughput with multiple threads – Fill in what would have been empty issue slots with instructions from different Recall: limited threads Scalar IPC CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon SMT Promise SMT Estimates (ideal) Fill in empty slots with other threads [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 5 SMT Estimates (ideal) SMT uArch • Observation: exploit register renaming – Get small modifications to existing superscalar architecture [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon SMT uArch Stopped Here 4/24/01 • N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon SMT uArch Performance • Changes: – Multiple PCs – Control to decide how to fetch from – Separate return stacks per thread – Per-thread reorder/commit/flush/trap – Thread id w/ BTB – Larger register file • More things outstanding [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 6 Optimizing: fetch freedom Optimizing: Fetch Alg. • RR=Round Robin • ICOUNT – priority to • RR.X.Y thread w/ fewest – X – threads do fetch pending instrs in cycle • BRCOUNT – Y – instructions • MISSCOUNT fetched/thread • IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96] [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Throughput Improvement Costs • 8-issue superscalar – Achieves little over 2 instructions per cycle • Optimized SMT – Achieves 5.4 instructions per cycle on 8 threads • 2.5x throughput increase [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon Costs Not Done, yet… • Conventional SMT formulation is for coarse-grained threads • Combine SMT w/ TAM ? – Fill pipeline from multiple runnable threads in activation frame – ?multiple activation frames? – Eliminate thread switch overhead? [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 7 Thought? Big Ideas • SMT reduce need for split-phase • Primitives operations? – Parallel Assembly Language – Threads for control – Synchronization (post, full-empty) • Latency Hiding – Threads, split-phase operation • Exploit Locality – Create locality • Scheduling quanta CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon 8.

Cs184c: Computer Architecture [Parallel and Multithreaded

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support