© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470
Slides developed in part by Profs. Falsafi,Falsafi, Hill, Hoe, LipastiLipasti,, Martin, Roth, ShenShen,, Smith, Sohi,Sohi, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Pennsylvania, and University of Wisconsin. Lecture 24 EECS 470 Slide 1 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Announcements
• HW 6 Posted, due 12/7
• Project due 12/10
In‐class presentations (~8 minutes + questions)
Lecture 24 EECS 470 Slide 2 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Base Snoopy Organization
P
Addr Cmd Data
Tags Tags Processor- and and side state Cache data RAM state controller for for snoop P Bus- side controller To Comparator controller
Tag Write-back buffer
To Comparator controller
Snoop state Addr Cmd Data buffer Addr Cmd
System bus
Lecture 24 EECS 470 Slide 3 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Non-Atomic State Transitions
Operations involve multiple actions
Look up cache tags
Bus arbitration
Check for writeback
Even if bus is atomic, overall set of actions is not
Race conditions among multiple operations
Suppose P1 and P2 attempt to write cached block A
Each decides to issue BusUpgr to allow S –> M
Issues
Handle requests for other blocks while waiting to acquire bus
Must handle requests for this block A
Lecture 24 EECS 470 Slide 4 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Non-Atomicity ÆTransient States
Two types of states
PrRd/— • Stable (e.g. MESI) P r W r /—
• Transient or Intermediate M
BusRdX/Flush Increases complexity BusRd/Flush
P r W r /—
B u s G ran t/B us U pg r E
S → M B u s G ran t/ BusGrant/BusRdX BusRd (S) BusRd/Flush P r W r / P r R d / — BusReq BusRdX/Flush
S BusRdX/Flush′ BusGrant/ I → M BusRd (S) BusR dX/Flush′
I → S,E PrRd/— BusRd/Flush′ PrRd/BusReq PrWr/BusReq
I
Lecture 24 EECS 470 Slide 5 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Scalability problems of Snoopy Coherence
• Prohibitive bus bandwidth
Required bandwidth grows with # CPUS…
… but available BW per bus is fixed
Adding busses makes serialization/ordering hard • Prohibitive processor snooping bandwidth
All caches do tag lookup when ANY processor accesses memory
Inclusion limits this to L2, but still lots of lookups
• UhUpshot: bus‐bdbased coherence dd’oesn’t scale bbdeyond 8–16 CPUs
Lecture 24 EECS 470 Slide 6 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Scalable Cache Coherence
• Scalable cache coherence: two part solution
• Part I: bus bandwidth
Replace non‐scalable bandwidth substrate (bus)…
…with scalable bandwidth one (point‐to‐point network, eeg.g., mesh)
• Part II: processor snooppging bandwidth
Interesting: most snoops result in no action
Replace non‐scalable broadcast protocol (spam everyone)…
…with scalable directory protocol (only spam processors that care)
Lecture 24 EECS 470 Slide 7 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Directory Coherence Protocols
• Observe: physical address space statically partitioned
+ Can easily determine which memory module holds a given line
That memory module sometimes called “home”
– Can’t easily determine which processors have line in their caches
Bus‐based protocol: broadcast events to all pp/rocessors/caches
± Simple and fast, but non‐scalable • Directories: non‐broadcast coherence protocol
Extend memory to track caching information
For each physical cache line whose home this is, track:
Owner: which processor has a dirty copy (I.e., M state)
Sharers: which processors have clean copies (I.e., S state)
Processor sends coherence event to home directory
Home directory only sends events to processors that care
Lecture 24 EECS 470 Slide 8 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Read Processing
Node #1 Directory Node #2 Read A (miss)
A: Shared, #1
Lecture 24 EECS 470 Slide 9 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Write Processing
Node #1 Directory Node #2 Read A (miss)
A: Shared, #1
A: Mod., #2
Trade‐offs: • Longer accesses (3‐hop between Processor, directory, other Processor) • Lower bandwidth Æ no snoops necessary Makes sense either for CMPs (lots of L1 miss traffic) or large‐scale servers (shared‐memory MP > 32 nodes) Lecture 24 EECS 470 Slide 10 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Chip Performance Scalability: His tory & Expect a tions 1000000 100000 NUMA on c hip ? P 10000
in MI in SMT SMP on chip (( 1000 100 Superscalar Out-of-Order ance
mm 10 First Cores 1 Micro RISC
Perfor 0.1 0.01 1970 1980 1990 2000 2010 Goal: 1 Tera inst/sec by 2010!
Lecture 24 EECS 470 Slide 11 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha: Performance and Managed ClitComplexity
Large‐scale server based on CMP nodes CMP architecture
excellent platform for exploiting thread‐level parallelism
inherently emphasizes replication over monolithic complexity
DiDesign meththdlodology reduces ilimplemen ttitation complitlexity
novel simulation methodology
use ASIC physical design process
Piranha: 2x performance advantage with team size of approximately 20 people
Lecture 24 EECS 470 Slide 12 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: 1‐issue, in‐order, CPU 500MHz
Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing
Lecture 24 EECS 470 Slide 13 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: 1‐issue, in‐order, CPU 500MHz L1 caches: I&D, 64KB, 2‐way I$ D$
Lecture 24 EECS 470 Slide 14 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay
ICS
I$ D$ I$ D$ I$ D$ I$ D$
CPU CPU CPU CPU
Lecture 24 EECS 470 Slide 15 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS
I$ D$ I$ D$ I$ D$ I$ D$ L2$ L2$ L2$ L2$
CPU CPU CPU CPU
Lecture 24 EECS 470 Slide 16 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS Memory Controller (MC) RDRAM, 12.8GB/sec I$ D$ I$ D$ I$ D$ I$ D$ L2$ L2$ L2$ L2$
CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL
8 banks @1.6GB/sec Lecture 24 EECS 470 Slide 17 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE) I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving
CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL
Lecture 24 EECS 470 Slide 18 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: 4Links4 Links MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, @ 8GB/s CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way uter ICS Memory Controller (MC) oo
R RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: CPU CPU CPU CPU 4‐port Xbar router MEM-CTL MEM-CTL MEM-CTL MEM-CTL topology independent 32GB/sec total bandwidth
Lecture 24 EECS 470 Slide 19 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node
Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch ()(ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way uter ICS Memory Controller (()MC) oo
R RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: CPU CPU CPU CPU 4‐port Xbar router MEM-CTL MEM-CTL MEM-CTL MEM-CTL topology independent 32GB/sec total bandwidth
Lecture 24 EECS 470 Slide 20 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Single-Chip Piranha Performance
350 350
e L2Miss
mm 300 L2Hit 233 250 CPU 191
cution Ti 200 ee 145 150 100 100 100 alized Ex mm 50 34 44 Nor 0 P1 INO OOO P8 P1 INO OOO P8 500 MH MHz 1GHz 1GHz 500MHz 500 MH MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue 1-issue 1-issue 4-issue 1-issue OLTP DSS Piranha’s performance margin 3x for OLTP and 2.2x for DSS Piranha has more outstanding misses Î better utilizes memory system
Lecture 24 EECS 470 Slide 21 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Performance And Utilization
• Performance (IPC) important • Utilization (actual IPC / peak IPC) important too
• Even moderate superscalars ((g,e.g., 4‐way) not fully utilized
Average sustained IPC: 1.5–2 → <50% utilization
Mis‐predicted branches
Cache misses, especially L2
Data dependences
• Multi‐threading (MT)
Improve utilization by multi‐plexing multiple threads on single CPU
One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can
Lecture 24 EECS 470 Slide 22 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Latency vs Throughput
• MT trades (single‐thread) latency for throughput
– Sharing processor degrades latency of individual threads
+ But improves aggregate latency of both threads
+ Improves utilization • Example
Thread A: individual latency=10s, latency with thread B=15s
Thread B: individual latency=20s, latency with thread A=25s
Sequential latency (first A then B or vice versa): 30s
Parallel latency (A and B simultaneously): 25s
– MT slows each thread by 5s
+ But improves total latency by 5s • Different workloads have different parallelism
SFPSpecFP has ltlots of ILP (can use an 8‐wide machine )
Server workloads have TLP (can use multiple threads) Lecture 24 EECS 470 Slide 23 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Core Sharing
Time sharing:
Run one thread
On a long‐latency operation (e.g., cache miss), switch
Also known as “switch‐on‐miss” multithreading
EgE.g., Niagara (UltraSPARC T1/T2)
Space sharing:
Across pipeline depth
Fetch and issue each cycle form a different thread
Both across pipeline width and depth
Fetch and issue each cycle from from multiple threads
Policy to decide which to fetch gets complicated
Also known as “simultaneous” multithreading
E.g., Alpha 21464, IBM POWER5 Lecture 24 EECS 470 Slide 24 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Instruction Issue
Time
Reduced function unit utilization due to dependencies
Lecture 24 EECS 470 Slide 25 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Superscalar Issue
Time
Superscalar leads to more performance, but lower utilization
Lecture 24 EECS 470 Slide 26 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Predicated Issue
Time
Adds to function unit utilization, but results are thrown away
Lecture 24 EECS 470 Slide 27 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Chip Multiprocessor
Time
Limited utilization when only running one thread
Lecture 24 EECS 470 Slide 28 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Coarse-grain Multithreading
Time
Preserves single‐thread performance, but can only hide long latencies (i. e., main memory accesses)
Lecture 24 EECS 470 Slide 29 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Coarse-Grain Multithreading (CGMT)
• Coarse‐Grain Multi‐Threading (CGMT)
+ Sacrifices very little single thread performance (of one thread)
– Tolerates only long latencies (e.g., L2 misses)
Thread scheduling policy
Designate a “preferred” thread ((g,e.g., thread A)
Switch to thread B on thread A L2 miss
Switch back to A when A L2 miss returns
Pipeline partitioning
None, flush on switch
– Can’t tolerate latencies shorter than twice pipeline depth
Need short in‐order pppipeline for good performance
Example: IBM Northstar/Pulsar
Lecture 24 EECS 470 Slide 30 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar CGMT
regfile
I$ D$ B P
• CGMT
thread scheduler regfile regfile
I$ D$ B P
L2 miss? Lecture 24 EECS 470 Slide 31 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fine Grained Multithreading
Time
Saturated workload ‐> Lots of threads
Unsaturated workload ‐> Lots of stalls
Intra‐thread dependencies still limit performance
Lecture 24 EECS 470 Slide 32 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fine-Grain Multithreading (FGMT)
• Fine‐Grain Multithreading (FGMT)
– Sacrifices significant single thread performance
+ Tolerates all latencies (e.g., L2 misses, mispredicted branches, etc.)
Thread scheduling policy
Switch threads every cycle (round‐robin), L2 miss or no
Pipeline partitioning
Dynamic, no flushing
Length of pipeline doesn’t matter
– Need a lot of threads
Extreme example: Denelcor HEP
So many threads (100+), it didn’t even need caches
Failed commercially
Not popular today
Many threads → many register files Lecture 24 EECS 470 Slide 33 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fine-Grain Multithreading
• FGMT
(Many) more threads
Multiple threads in pipeline at once
regfilfile thread scheduler regfile regfile regfile
I$ D$ B P
Lecture 24 EECS 470 Slide 34 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Simultaneous Multithreading
Time
Maximum utilization of function units by independent operations
Lecture 24 EECS 470 Slide 35 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Simultaneous Multithreading (SMT)
• Can we multithread an out‐of‐order machine?
Don’t want to give up performance benefits
Don’t want to give up natural tolerance of D$ (L1) miss latency • Simultaneous multithreading (SMT)
+ TlTolera tes all lltatenc ies (e.g., L2 misses, miiditdspredicted bbh)ranches)
± Sacrifices some single thread performance
Thread scheduling policy
Round‐robin (just like FGMT)
Pipeline partitioning
Dynamic, hmmm…
Example: Pentium4 (hyper‐threading): 5‐way issue, 2 threads
Another example: Alpha 21464: 8‐way issue, 4 threads (canceled)
Lecture 24 EECS 470 Slide 36 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Simultaneous Multithreading (SMT) map table
regfile
I$ D$ B P
• SMT
Replicate map table, share physical register file
thread scheduler map tables
regfilfile
I$ D$ B P
Lecture 24 EECS 470 Slide 37 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Issues for SMT
• Cache interference
General concern for all MT variants
Can the working sets of multiple threads fit in the caches?
Shared memory SPMD threads help here
+ Same insns → share I$
+ Shared data → less D$ contention
MT is good for “server” workloads
To keep miss rates low, SMT mihight need a larger L2 (hih(which is OK)
Out‐of‐order tolerates L1 misses
• Large map table and physical register file
#mt‐entries = (#threads * #arch‐regs)
#phys‐regs = (#threads * #arch‐regs) + #in‐flight insns
Lecture 24 EECS 470 Slide 38 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar SMT Resource Partitioning
• How are ROB/MOB, RS partitioned in SMT?
Depends on what you want to achieve • Static partitioning
Divide ROB/MOB, RS into T static equal‐sized partitions
+ Ensures tha t low‐IPC threa ds d’tdon’t starve hig h‐IPC ones
Low‐IPC threads stall and occupy ROB/MOB, RS slots
– Low utilization • Dynamic partitioning
Divide ROB/MOB, RS into dynamically resizing partitions
Let threads fight for amongst themselves
+ High utilization
– Possible starvation
ICOUNT: fetch policy prefers thread with fewest in‐flight insns
Lecture 24 EECS 470 Slide 39 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Power Implications of MT
• Is MT (of any kind) power efficient?
Static power? Yes
Dissipated regardless of utilization
Dynamic power? Less clear, but probably yes
Highly utilization dependent
Major factor is additional cache activity
Some debate here
Overall? Yes
Static power relatively increasing
Lecture 24 EECS 470 Slide 40 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar SMT vs. CMP
• If you wanted to run multiple threads would you build a…
Chip multiprocessor (CMP): multiple separate pipelines?
A multithreaded processor (SMT): a single larger pipeline? • Both will get you throughput on multiple threads
CMP will be silimpler, possibly ftfaster clklock
SMT will get you better performance (IPC) on a single thread
SMT is basically an ILP engine that converts TLP to ILP
CMP is mainly a TLP engine • Again, do both
Sun’ s Niagara (UltraSPARC T1)
8 processors, each with 4‐threads (coarse‐grained threading)
1Ghz clock, in‐order, short pipeline (6 stages or so)
Designed for power‐efficient “throughput computing”
Lecture 24 EECS 470 Slide 41 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Niagara: 32 Threads on Chip
8 six‐stage in‐order pipelines Each pipeline, 4‐way fine‐grain multithreaded Small L1 write‐through caches Shared L2
Screams for Web, OLTP Shipping @1.2Ghz clock
Lecture 24 EECS 470 Slide 42 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Alpha 21464 Architecture Overview [Slides from Shub u MkhjMukherjee @ In te l]
8‐wide out‐of‐order superscalar Large on‐chip L2 cache Direct RAMBUS interface On‐chip router for system interconnect Glueless, directory‐based, ccNUMA
for up to 512‐way multiprocessing 4‐way simultaneous multithreading (SMT)
Lecture 24 EECS 470 Slide 43 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Basic Out-of-order Pipeline
FthFetch DdDecod Queue Reg EtExecut DhDcach Reg RtiRetire e/Map Read e e/Stor Write e Buffer
PC
Register Map
Regs Regs Dcache Icache
Thread‐ blind
Lecture 24 EECS 470 Slide 44 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar SMT Pipeline
Fetch Decod Queue Reg Execut Dcach Reg Retire e/Map Read e e/Stor Write e Buffer
PC
Register Map Regs Dcache Regs Icache
Lecture 24 EECS 470 Slide 45 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Changes for SMT
Basic pipeline – unchanged
Replicated resources
Program counters
Register maps
Shared resources
Register file (size increased)
Instruction queue
First and second level caches
Translation buffers
Lecture 24 EECS 470 Slide 46 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Fetch Policy (Key)
Simple fetch (e.g., round‐robin) policies don’t work. Why? • Instruction from slow threads hog the pipeline • Once instructions placed in IQ, can not remove them until they execute
Fetch optimization: • i‐count • Favor faster threads • Count retirement rate and bias accordingly • Must avoid starvation (tricky)
Lecture 24 EECS 470 Slide 47 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Multiprogrammed workload
250%
200%
1T 150% 2T 3T 100% 4T 50%
0% SpecInt SpecFP Mixed Int/FP
Lecture 24 EECS 470 Slide 48 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Decomposed SPEC95 Applications
250%
200%
1T 150% 2T 3T 100% 4T 50%
0% Turb3d Swm256 Tomcatv
Lecture 24 EECS 470 Slide 49 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Multithreaded Applications
300%
250%
200% 1T 150% 2T 4T 100%
50%
0% Barnes Chess Sort TP
Lecture 24 EECS 470 Slide 50