CS184c: Computer Architecture Reading [Parallel and Multithreaded] • Shared Memory – Focus: H&P Ch 8 • At least read this… Day 7: April 24, 2001 – Retrospectives Threaded Abstract Machine (TAM) • Valuable and short Simultaneous Multi-Threading (SMT) – ISCA papers • Good primary sources
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Today
• TAM Threaded Abstract Machine • SMT
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
TAM TL0 Model
• Parallel Assembly Language • Activition Frame (like stack frame) • Fine-Grained Threading – Variables • Hybrid Dataflow – Synchronization • Scheduling Hierarchy – Thread stack (continuation vectors) • Heap Storage – I-structures
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
1 TL0 Ops Scheduling Hierarchy
• RISC-like ALU Ops • Intra-frame • FORK – Related threads in same frame • SWITCH – Frame runs on single processor • STOP – Schedule together, exploit locality • POST • (cache, maybe regs) • FALLOC • Inter-frame • FFREE – Only swap when exhaust work in current • SWAP frame
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Intra-Frame Scheduling TL0/CM5 Intra-frame
• Simple (local) stack of pending threads • Fork on thread • Fork places new PC on stack – Fall through 0 inst • STOP pops next PC off stack – Unsynch branch 3 inst • Stack initialized with code to exit – Successful synch 4 inst activation frame – Unsuccessful synch 8 inst – Including schedule next frame • Push thread onto LCV 3-6 inst – Save live registers
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Fib Example Multiprocessor Parallelism
• [look at how this turns into TL0 code] • Comes from frame allocations • Runtime policy where allocate frames – Maybe use work stealing?
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
2 Frame Scheduling CM5 Frame Scheduling Costs
• Inlets to non-active frames initiate • Inlet Posts on non-running thread pending thread stack (RCV) – 10-15 instructions • First inlet may place frame on • Swap to next frame processor’s runable frame queue – 14 instructions • SWAP instruction picks next frame branches to its enter thread • Average thread cost 7 cycles – Constitutes 15-30% TL0 instr
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Cycle Instruction Mix Breakdown
[Culler et. Al. [Culler et. Al. JPDC, July 1993] JPDC, July 1993]
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Speedup Example Thread Stats
• Thread lengths 3—17 • Threads run per “quantum” 7—530
[Culler et. Al. JPDC, July 1993]
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon [Culler et. Al. JPDC, July 1993]
3 Great Project
• Develop optimized mArch for TAM – Hardware support/architecture for single- Multithreaded Architectures cycle thread-switch/post
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Problem Idea
• Long latency of operations • Run something else useful while stalled – Non-local memory fetch • In particular, another thread – Long latency operations (mpy, fp) – Another PC • Wastes processor cycles while stalled • If processor stalls on return • Again, use parallelism to “tolerate” – Latency problem turns into a throughput (utilization) problem latency – CPU sits idle
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
HEP/mUnity/Tera HEP Pipeline
• Provide a number of contexts – Copies of register file… • Number of contexts ³ operation latency – Pipeline depth – Roundtrip time to main memory • Run each round-robin
[figure: Arvind+Innucci, DFVLR’87]
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
4 Strict Interleaved Threading
• Uses parallelism to get throughput SMT • Potentially poor single-threaded performance – Increases end-to-end latency of thread
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Can we do both? SuperScalar Inefficiency
• Issue from multiple threads into pipeline Unused Slot • No worse than (super)scalar on single thread • More throughput with multiple threads – Fill in what would have been empty issue slots with instructions from different Recall: limited threads Scalar IPC
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
SMT Promise SMT Estimates (ideal)
Fill in empty slots with other threads
[Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
5 SMT Estimates (ideal) SMT uArch
• Observation: exploit register renaming – Get small modifications to existing superscalar architecture
[Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
SMT uArch
Stopped Here
4/24/01
• N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
SMT uArch Performance
• Changes: – Multiple PCs – Control to decide how to fetch from – Separate return stacks per thread – Per-thread reorder/commit/flush/trap – Thread id w/ BTB – Larger register file • More things outstanding [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
6 Optimizing: fetch freedom Optimizing: Fetch Alg.
• RR=Round Robin • ICOUNT – priority to • RR.X.Y thread w/ fewest – X – threads do fetch pending instrs in cycle • BRCOUNT – Y – instructions • MISSCOUNT fetched/thread • IQPOSN – penalize threads w/ old instrs (at front of queues)
[Tullsen et. al. ISCA ’96] [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Throughput Improvement Costs
• 8-issue superscalar – Achieves little over 2 instructions per cycle • Optimized SMT – Achieves 5.4 instructions per cycle on 8 threads • 2.5x throughput increase
[Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
Costs Not Done, yet…
• Conventional SMT formulation is for coarse-grained threads • Combine SMT w/ TAM ? – Fill pipeline from multiple runnable threads in activation frame – ?multiple activation frames? – Eliminate thread switch overhead?
[Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
7 Thought? Big Ideas
• SMT reduce need for split-phase • Primitives operations? – Parallel Assembly Language – Threads for control – Synchronization (post, full-empty) • Latency Hiding – Threads, split-phase operation • Exploit Locality – Create locality • Scheduling quanta
CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon
8