© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470

Slides developed in part by Profs. Falsafi,Falsafi, Hill, Hoe, LipastiLipasti,, Martin, Roth, ShenShen,, Smith, Sohi,Sohi, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Pennsylvania, and University of Wisconsin. Lecture 24 EECS 470 Slide 1 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Announcements

• HW 6 Posted, due 12/7

• Project due 12/10

ˆ In‐class presentations (~8 minutes + questions)

Lecture 24 EECS 470 Slide 2 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Base Snoopy Organization

P

Addr Cmd Data

Tags Tags - and and side state data RAM state controller for for snoop P - side controller To Comparator controller

Tag Write-back buffer

To Comparator controller

Snoop state Addr Cmd Addr Cmd

System bus

Lecture 24 EECS 470 Slide 3 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Non-Atomic State Transitions

Operations involve multiple actions

ˆ Look up cache tags

ˆ Bus arbitration

ˆ Check for writeback

ˆ Even if bus is atomic, overall set of actions is not

ˆ Race conditions among multiple operations

Suppose P1 and P2 attempt to write cached block A

ˆ Each decides to issue BusUpgr to allow S –> M

Issues

ˆ Handle requests for other blocks while waiting to acquire bus

ˆ Must handle requests for this block A

Lecture 24 EECS 470 Slide 4 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Non-Atomicity ÆTransient States

Two types of states

PrRd/— • Stable (e.g. MESI) P r W r /—

• Transient or Intermediate M

BusRdX/Flush Increases complexity BusRd/Flush

P r W r /—

B u s G ran t/B us U pg r E

S → M B u s G ran t/ BusGrant/BusRdX BusRd (S) BusRd/Flush P r W r / P r R d / — BusReq BusRdX/Flush

S BusRdX/Flush′ BusGrant/ I → M BusRd (S) BusR dX/Flush′

I → S,E PrRd/— BusRd/Flush′ PrRd/BusReq PrWr/BusReq

I

Lecture 24 EECS 470 Slide 5 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Scalability problems of Snoopy Coherence

• Prohibitive bus bandwidth

ˆ Required bandwidth grows with # CPUS…

ˆ … but available BW per bus is fixed

ˆ Adding busses makes serialization/ordering hard • Prohibitive processor snooping bandwidth

ˆ All caches do tag lookup when ANY processor accesses memory

ˆ Inclusion limits this to L2, but still lots of lookups

• UhUpshot: bus‐bdbased coherence dd’oesn’t scale bbdeyond 8–16 CPUs

Lecture 24 EECS 470 Slide 6 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Scalable

• Scalable cache coherence: two part solution

• Part I: bus bandwidth

ˆ Replace non‐scalable bandwidth substrate (bus)…

ˆ …with scalable bandwidth one (point‐to‐point network, eeg.g., mesh)

• Part II: processor snooppging bandwidth

ˆ Interesting: most snoops result in no action

ˆ Replace non‐scalable broadcast protocol (spam everyone)…

ˆ …with scalable directory protocol (only spam processors that care)

Lecture 24 EECS 470 Slide 7 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Directory Coherence Protocols

• Observe: physical address space statically partitioned

+ Can easily determine which memory module holds a given line

 That memory module sometimes called “home”

– Can’t easily determine which processors have line in their caches

ˆ Bus‐based protocol: broadcast events to all pp/rocessors/caches

± Simple and fast, but non‐scalable • Directories: non‐broadcast coherence protocol

ˆ Extend memory to track caching information

ˆ For each physical cache line whose home this is, track:

 Owner: which processor has a dirty copy (I.e., M state)

 Sharers: which processors have clean copies (I.e., S state)

ˆ Processor sends coherence event to home directory

 Home directory only sends events to processors that care

Lecture 24 EECS 470 Slide 8 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Read Processing

Node #1 Directory Node #2 Read A (miss)

A: Shared, #1

Lecture 24 EECS 470 Slide 9 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Write Processing

Node #1 Directory Node #2 Read A (miss)

A: Shared, #1

A: Mod., #2

Trade‐offs: • Longer accesses (3‐hop between Processor, directory, other Processor) • Lower bandwidth Æ no snoops necessary Makes sense either for CMPs (lots of L1 miss traffic) or large‐scale servers (shared‐memory MP > 32 nodes) Lecture 24 EECS 470 Slide 10 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Chip Performance Scalability: His tory & Expect a tions 1000000 100000 NUMA on c hip ? P 10000

in MI in SMT SMP on chip (( 1000 100 Superscalar Out-of-Order ance

mm 10 First Cores 1 Micro RISC

Perfor 0.1 0.01 1970 1980 1990 2000 2010 Goal: 1 Tera inst/sec by 2010!

Lecture 24 EECS 470 Slide 11 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha: Performance and Managed ClitComplexity

Large‐scale server based on CMP nodes CMP architecture

ˆ excellent platform for exploiting ‐level parallelism

ˆ inherently emphasizes replication over monolithic complexity

DiDesign meththdlodology reduces ilimplemen ttitation complitlexity

ˆ novel simulation methodology

ˆ use ASIC physical design

Piranha: 2x performance advantage with team size of approximately 20 people

Lecture 24 EECS 470 Slide 12 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: 1‐issue, in‐order, CPU 500MHz

Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip

Lecture 24 EECS 470 Slide 13 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: 1‐issue, in‐order, CPU 500MHz L1 caches: I&D, 64KB, 2‐way I$ D$

Lecture 24 EECS 470 Slide 14 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way Intra‐chip (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay

ICS

I$ D$ I$ D$ I$ D$ I$ D$

CPU CPU CPU CPU

Lecture 24 EECS 470 Slide 15 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS

I$ D$ I$ D$ I$ D$ I$ D$ L2$ L2$ L2$ L2$

CPU CPU CPU CPU

Lecture 24 EECS 470 Slide 16 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS (MC) RDRAM, 12.8GB/sec I$ D$ I$ D$ I$ D$ I$ D$ L2$ L2$ L2$ L2$

CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL

8 banks @1.6GB/sec Lecture 24 EECS 470 Slide 17 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE) I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving

CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL

Lecture 24 EECS 470 Slide 18 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: 4Links4 Links MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, @ 8GB/s CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way uter ICS Memory Controller (MC) oo

R RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: CPU CPU CPU CPU 4‐port Xbar router MEM-CTL MEM-CTL MEM-CTL MEM-CTL topology independent 32GB/sec total bandwidth

Lecture 24 EECS 470 Slide 19 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node

Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch ()(ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way uter ICS Memory Controller (()MC) oo

R RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: CPU CPU CPU CPU 4‐port Xbar router MEM-CTL MEM-CTL MEM-CTL MEM-CTL topology independent 32GB/sec total bandwidth

Lecture 24 EECS 470 Slide 20 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Single-Chip Piranha Performance

350 350

e L2Miss

mm 300 L2Hit 233 250 CPU 191

cution Ti 200 ee 145 150 100 100 100 alized Ex mm 50 34 44 Nor 0 P1 INO OOO P8 P1 INO OOO P8 500 MH MHz 1GHz 1GHz 500MHz 500 MH MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue 1-issue 1-issue 4-issue 1-issue OLTP DSS Piranha’s performance margin 3x for OLTP and 2.2x for DSS Piranha has more outstanding misses Î better utilizes memory system

Lecture 24 EECS 470 Slide 21 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Performance And Utilization

• Performance (IPC) important • Utilization (actual IPC / peak IPC) important too

• Even moderate superscalars ((g,e.g., 4‐way) not fully utilized

ˆ Average sustained IPC: 1.5–2 → <50% utilization

 Mis‐predicted branches

 Cache misses, especially L2

 Data dependences

• Multi‐threading (MT)

ˆ Improve utilization by multi‐plexing multiple threads on single CPU

ˆ One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can

Lecture 24 EECS 470 Slide 22 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Latency vs Throughput

• MT trades (single‐thread) latency for throughput

– Sharing processor degrades latency of individual threads

+ But improves aggregate latency of both threads

+ Improves utilization • Example

ˆ Thread A: individual latency=10s, latency with thread B=15s

ˆ Thread B: individual latency=20s, latency with thread A=25s

ˆ Sequential latency (first A then B or vice versa): 30s

ˆ Parallel latency (A and B simultaneously): 25s

– MT slows each thread by 5s

+ But improves total latency by 5s • Different workloads have different parallelism

ˆ SFPSpecFP has ltlots of ILP (can use an 8‐wide machine )

ˆ Server workloads have TLP (can use multiple threads) Lecture 24 EECS 470 Slide 23 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Core Sharing

Time sharing:

ˆ Run one thread

ˆ On a long‐latency operation (e.g., cache miss), switch

ˆ Also known as “switch‐on‐miss” multithreading

ˆ EgE.g., Niagara (UltraSPARC T1/T2)

Space sharing:

ˆ Across pipeline depth

 Fetch and issue each cycle form a different thread

ˆ Both across pipeline width and depth

 Fetch and issue each cycle from from multiple threads

 Policy to decide which to fetch gets complicated

 Also known as “simultaneous” multithreading

 E.g., , IBM POWER5 Lecture 24 EECS 470 Slide 24 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Instruction Issue

Time

Reduced function unit utilization due to dependencies

Lecture 24 EECS 470 Slide 25 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Superscalar Issue

Time

Superscalar leads to more performance, but lower utilization

Lecture 24 EECS 470 Slide 26 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Predicated Issue

Time

Adds to function unit utilization, but results are thrown away

Lecture 24 EECS 470 Slide 27 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Chip Multiprocessor

Time

Limited utilization when only running one thread

Lecture 24 EECS 470 Slide 28 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Coarse-grain Multithreading

Time

Preserves single‐thread performance, but can only hide long latencies (i. e., main memory accesses)

Lecture 24 EECS 470 Slide 29 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Coarse-Grain Multithreading (CGMT)

• Coarse‐Grain Multi‐Threading (CGMT)

+ Sacrifices very little single thread performance (of one thread)

– Tolerates only long latencies (e.g., L2 misses)

ˆ Thread scheduling policy

 Designate a “preferred” thread ((g,e.g., thread A)

 Switch to thread B on thread A L2 miss

 Switch back to A when A L2 miss returns

ˆ Pipeline partitioning

 None, flush on switch

– Can’t tolerate latencies shorter than twice pipeline depth

 Need short in‐order pppipeline for good performance

ˆ Example: IBM Northstar/Pulsar

Lecture 24 EECS 470 Slide 30 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar CGMT

regfile

I$ D$ B P

• CGMT

thread scheduler regfile regfile

I$ D$ B P

L2 miss? Lecture 24 EECS 470 Slide 31 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fine Grained Multithreading

Time

Saturated workload ‐> Lots of threads

Unsaturated workload ‐> Lots of stalls

Intra‐thread dependencies still limit performance

Lecture 24 EECS 470 Slide 32 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fine-Grain Multithreading (FGMT)

• Fine‐Grain Multithreading (FGMT)

– Sacrifices significant single thread performance

+ Tolerates all latencies (e.g., L2 misses, mispredicted branches, etc.)

ˆ Thread scheduling policy

 Switch threads every cycle (round‐robin), L2 miss or no

ˆ Pipeline partitioning

 Dynamic, no flushing

 Length of pipeline doesn’t matter

– Need a lot of threads

ˆ Extreme example: Denelcor HEP

 So many threads (100+), it didn’t even need caches

 Failed commercially

ˆ Not popular today

 Many threads → many register files Lecture 24 EECS 470 Slide 33 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fine-Grain Multithreading

• FGMT

ˆ (Many) more threads

ˆ Multiple threads in pipeline at once

regfilfile thread scheduler regfile regfile regfile

I$ D$ B P

Lecture 24 EECS 470 Slide 34 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Simultaneous Multithreading

Time

Maximum utilization of function units by independent operations

Lecture 24 EECS 470 Slide 35 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Simultaneous Multithreading (SMT)

• Can we multithread an out‐of‐order machine?

ˆ Don’t want to give up performance benefits

ˆ Don’t want to give up natural tolerance of D$ (L1) miss latency • Simultaneous multithreading (SMT)

+ TlTolera tes all lltatenc ies (e.g., L2 misses, miiditdspredicted bbh)ranches)

± Sacrifices some single thread performance

ˆ Thread scheduling policy

 Round‐robin (just like FGMT)

ˆ Pipeline partitioning

 Dynamic, hmmm…

ˆ Example: Pentium4 (hyper‐threading): 5‐way issue, 2 threads

ˆ Another example: Alpha 21464: 8‐way issue, 4 threads (canceled)

Lecture 24 EECS 470 Slide 36 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Simultaneous Multithreading (SMT) map table

regfile

I$ D$ B P

• SMT

ˆ Replicate map table, share physical

thread scheduler map tables

regfilfile

I$ D$ B P

Lecture 24 EECS 470 Slide 37 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Issues for SMT

• Cache interference

ˆ General concern for all MT variants

ˆ Can the working sets of multiple threads fit in the caches?

ˆ Shared memory SPMD threads help here

+ Same insns → share I$

+ Shared data → less D$ contention

 MT is good for “server” workloads

ˆ To keep miss rates low, SMT mihight need a larger L2 (hih(which is OK)

 Out‐of‐order tolerates L1 misses

• Large map table and physical register file

ˆ #mt‐entries = (#threads * #arch‐regs)

ˆ #phys‐regs = (#threads * #arch‐regs) + #in‐flight insns

Lecture 24 EECS 470 Slide 38 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar SMT Resource Partitioning

• How are ROB/MOB, RS partitioned in SMT?

ˆ Depends on what you want to achieve • Static partitioning

ˆ Divide ROB/MOB, RS into T static equal‐sized partitions

+ Ensures tha t low‐IPC threa ds d’tdon’t starve hig h‐IPC ones

 Low‐IPC threads stall and occupy ROB/MOB, RS slots

– Low utilization • Dynamic partitioning

ˆ Divide ROB/MOB, RS into dynamically resizing partitions

ˆ Let threads fight for amongst themselves

+ High utilization

– Possible starvation

ˆ ICOUNT: fetch policy prefers thread with fewest in‐flight insns

Lecture 24 EECS 470 Slide 39 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Power Implications of MT

• Is MT (of any kind) power efficient?

ˆ Static power? Yes

 Dissipated regardless of utilization

ˆ Dynamic power? Less clear, but probably yes

 Highly utilization dependent

 Major factor is additional cache activity

 Some debate here

ˆ Overall? Yes

 Static power relatively increasing

Lecture 24 EECS 470 Slide 40 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar SMT vs. CMP

• If you wanted to run multiple threads would you build a…

ˆ Chip multiprocessor (CMP): multiple separate pipelines?

ˆ A multithreaded processor (SMT): a single larger pipeline? • Both will get you throughput on multiple threads

ˆ CMP will be silimpler, possibly ftfaster clklock

ˆ SMT will get you better performance (IPC) on a single thread

 SMT is basically an ILP engine that converts TLP to ILP

 CMP is mainly a TLP engine • Again, do both

ˆ Sun’ s Niagara (UltraSPARC T1)

ˆ 8 processors, each with 4‐threads (coarse‐grained threading)

ˆ 1Ghz clock, in‐order, short pipeline (6 stages or so)

ˆ Designed for power‐efficient “throughput computing”

Lecture 24 EECS 470 Slide 41 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Niagara: 32 Threads on Chip

8 six‐stage in‐order pipelines Each pipeline, 4‐way fine‐grain multithreaded Small L1 write‐through caches Shared L2

Screams for Web, OLTP Shipping @1.2Ghz clock

Lecture 24 EECS 470 Slide 42 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Alpha 21464 Architecture Overview [Slides from Shub u MkhjMukherjee @ In te l]

8‐wide out‐of‐order superscalar Large on‐chip L2 cache Direct RAMBUS interface On‐chip router for system interconnect Glueless, directory‐based, ccNUMA

ˆ for up to 512‐way multiprocessing 4‐way simultaneous multithreading (SMT)

Lecture 24 EECS 470 Slide 43 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Basic Out-of-order Pipeline

FthFetch DdDecod Queue Reg EtExecut DhDcach Reg RtiRetire e/Map Read e e/Stor Write e Buffer

PC

Register Map

Regs Regs Dcache Icache

Thread‐ blind

Lecture 24 EECS 470 Slide 44 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar SMT Pipeline

Fetch Decod Queue Reg Execut Dcach Reg Retire e/Map Read e e/Stor Write e Buffer

PC

Register Map Regs Dcache Regs Icache

Lecture 24 EECS 470 Slide 45 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Changes for SMT

Basic pipeline – unchanged

Replicated resources

ˆ Program counters

ˆ Register maps

Shared resources

ˆ Register file (size increased)

ˆ Instruction queue

ˆ First and second level caches

ˆ Translation buffers

ˆ

Lecture 24 EECS 470 Slide 46 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fetch Policy (Key)

Simple fetch (e.g., round‐robin) policies don’t work. Why? • Instruction from slow threads hog the pipeline • Once instructions placed in IQ, can not remove them until they execute

Fetch optimization: • i‐count • Favor faster threads • Count retirement rate and bias accordingly • Must avoid starvation (tricky)

Lecture 24 EECS 470 Slide 47 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Multiprogrammed workload

250%

200%

1T 150% 2T 3T 100% 4T 50%

0% SpecInt SpecFP Mixed Int/FP

Lecture 24 EECS 470 Slide 48 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Decomposed SPEC95 Applications

250%

200%

1T 150% 2T 3T 100% 4T 50%

0% Turb3d Swm256 Tomcatv

Lecture 24 EECS 470 Slide 49 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Multithreaded Applications

300%

250%

200% 1T 150% 2T 4T 100%

50%

0% Barnes Chess Sort TP

Lecture 24 EECS 470 Slide 50