<<

CS184c: Architecture Reading [Parallel and Multithreaded] • Shared Memory – Focus: H&P Ch 8 • At least read this… Day 7: April 24, 2001 – Retrospectives Threaded Abstract Machine (TAM) • Valuable and short Simultaneous Multi-Threading (SMT) – ISCA papers • Good primary sources

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Today

• TAM Threaded Abstract Machine • SMT

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

TAM TL0 Model

• Parallel Assembly Language • Activition Frame (like stack frame) • Fine-Grained Threading – Variables • Hybrid Dataflow – Synchronization • Scheduling Hierarchy – Thread stack (continuation vectors) • Heap Storage – I-structures

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

1 TL0 Ops Scheduling Hierarchy

• RISC-like ALU Ops • Intra-frame • FORK – Related threads in same frame • SWITCH – Frame runs on single • STOP – Schedule together, exploit locality • POST • (cache, maybe regs) • FALLOC • Inter-frame • FFREE – Only swap when exhaust work in current • SWAP frame

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Intra-Frame Scheduling TL0/CM5 Intra-frame

• Simple (local) stack of pending threads • Fork on thread • Fork places new PC on stack – Fall through 0 inst • STOP pops next PC off stack – Unsynch branch 3 inst • Stack initialized with code to exit – Successful synch 4 inst activation frame – Unsuccessful synch 8 inst – Including schedule next frame • Push thread onto LCV 3-6 inst – Save live registers

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Fib Example Multiprocessor Parallelism

• [look at how this turns into TL0 code] • Comes from frame allocations • Runtime policy where allocate frames – Maybe use work stealing?

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

2 Frame Scheduling CM5 Frame Scheduling Costs

• Inlets to non-active frames initiate • Inlet Posts on non-running thread pending thread stack (RCV) – 10-15 instructions • First inlet may place frame on • Swap to next frame processor’s runable frame queue – 14 instructions • SWAP instruction picks next frame branches to its enter thread • Average thread cost 7 cycles – Constitutes 15-30% TL0 instr

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Cycle Instruction Mix Breakdown

[Culler et. Al. [Culler et. Al. JPDC, July 1993] JPDC, July 1993]

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Speedup Example Thread Stats

• Thread lengths 3—17 • Threads run per “quantum” 7—530

[Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon [Culler et. Al. JPDC, July 1993]

3 Great Project

• Develop optimized mArch for TAM – Hardware support/architecture for single- Multithreaded Architectures cycle thread-switch/post

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Problem Idea

• Long latency of operations • Run something else useful while stalled – Non-local memory fetch • In particular, another thread – Long latency operations (mpy, fp) – Another PC • Wastes processor cycles while stalled • If processor stalls on return • Again, use parallelism to “tolerate” – Latency problem turns into a throughput (utilization) problem latency – CPU sits idle

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

HEP/mUnity/Tera HEP Pipeline

• Provide a number of contexts – Copies of … • Number of contexts ³ operation latency – Pipeline depth – Roundtrip time to main memory • Run each round-robin

[figure: Arvind+Innucci, DFVLR’87]

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

4 Strict Interleaved Threading

• Uses parallelism to get throughput SMT • Potentially poor single-threaded performance – Increases end-to-end latency of thread

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Can we do both? SuperScalar Inefficiency

• Issue from multiple threads into pipeline Unused Slot • No worse than (super)scalar on single thread • More throughput with multiple threads – Fill in what would have been empty issue slots with instructions from different Recall: limited threads Scalar IPC

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

SMT Promise SMT Estimates (ideal)

Fill in empty slots with other threads

[Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

5 SMT Estimates (ideal) SMT uArch

• Observation: exploit – Get small modifications to existing superscalar architecture

[Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

SMT uArch

Stopped Here

4/24/01

• N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

SMT uArch Performance

• Changes: – Multiple PCs – Control to decide how to fetch from – Separate return stacks per thread – Per-thread reorder/commit/flush/trap – Thread id w/ BTB – Larger register file • More things outstanding [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

6 Optimizing: fetch freedom Optimizing: Fetch Alg.

• RR=Round Robin • ICOUNT – priority to • RR.X.Y thread w/ fewest – X – threads do fetch pending instrs in cycle • BRCOUNT – Y – instructions • MISSCOUNT fetched/thread • IQPOSN – penalize threads w/ old instrs (at front of queues)

[Tullsen et. al. ISCA ’96] [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Throughput Improvement Costs

• 8-issue superscalar – Achieves little over 2 • Optimized SMT – Achieves 5.4 instructions per cycle on 8 threads • 2.5x throughput increase

[Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

Costs Not Done, yet…

• Conventional SMT formulation is for coarse-grained threads • Combine SMT w/ TAM ? – Fill pipeline from multiple runnable threads in activation frame – ?multiple activation frames? – Eliminate thread switch overhead?

[Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

7 Thought? Big Ideas

• SMT reduce need for split-phase • Primitives operations? – Parallel Assembly Language – Threads for control – Synchronization (post, full-empty) • Latency Hiding – Threads, split-phase operation • Exploit Locality – Create locality • Scheduling quanta

CALTECH cs184c Spring2001-- DeHon CALTECH cs184c Spring2001-- DeHon

8