EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading Fall 2007 Prof

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi,Falsafi, Hill, Hoe, LipastiLipasti,, Martin, Roth, ShenShen,, Smith, Sohi,Sohi, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Pennsylvania, and University of Wisconsin. Lecture 24 EECS 470 Slide 1 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Announcements • HW 6 Posted, due 12/7 • Project due 12/10 In‐class presentations (~8 minutes + questions) Lecture 24 EECS 470 Slide 2 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Base Snoopy Organization P Addr Cmd Data Tags Tags Processor- and and side state Cache data RAM state controller for for snoop P Bus- side controller To Comparator controller Tag Write-back buffer To Comparator controller Snoop state Addr Cmd Data buffer Addr Cmd System bus Lecture 24 EECS 470 Slide 3 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Non-Atomic State Transitions Operations involve multiple actions Look up cache tags Bus arbitration Check for writeback Even if bus is atomic, overall set of actions is not Race conditions among multiple operations Suppose P1 and P2 attempt to write cached block A Each decides to issue BusUpgr to allow S –> M Issues Handle requests for other blocks while waiting to acquire bus Must handle requests for this block A Lecture 24 EECS 470 Slide 4 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Non-Atomicity ÆTransient States Two types of states PrRd/— • Stable (e.g. MESI) P r W r /— • Transient or Intermediate M BusRdX/Flush Increases complexity BusRd/Flush P r W r /— B u s G ran t/B us U pg r E S → M B u s G ran t/ BusGrant/BusRdX BusRd (S) BusRd/Flush P r W r / P r R d / — BusReq BusRdX/Flush S BusRdX/Flush′ BusGrant/ I → M BusRd (S) BusR dX/Flush′ I → S,E PrRd/— BusRd/Flush′ PrRd/BusReq PrWr/BusReq I Lecture 24 EECS 470 Slide 5 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Scalability problems of Snoopy Coherence • Prohibitive bus bandwidth Required bandwidth grows with # CPUS… … but available BW per bus is fixed Adding busses makes serialization/ordering hard • Prohibitive processor snooping bandwidth All caches do tag lookup when ANY processor accesses memory Inclusion limits this to L2, but still lots of lookups • UhUpshot: bus‐bdbased coherence d’doesn’t scale bdbeyond 8–16 CPUs Lecture 24 EECS 470 Slide 6 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Scalable Cache Coherence • Scalable cache coherence: two part solution • Part I: bus bandwidth Replace non‐scalable bandwidth substrate (bus)… …with scalable bandwidth one (point‐to‐point network, ege.g., mesh) • Part II: processor snooppging bandwidth Interesting: most snoops result in no action Replace non‐scalable broadcast protocol (spam everyone)… …with scalable directory protocol (only spam processors that care) Lecture 24 EECS 470 Slide 7 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Directory Coherence Protocols • Observe: physical address space statically partitioned + Can easily determine which memory module holds a given line That memory module sometimes called “home” – Can’t easily determine which processors have line in their caches Bus‐based protocol: broadcast events to all p/processors/caches ± Simple and fast, but non‐scalable • Directories: non‐broadcast coherence protocol Extend memory to track caching information For each physical cache line whose home this is, track: Owner: which processor has a dirty copy (I.e., M state) Sharers: which processors have clean copies (I.e., S state) Processor sends coherence event to home directory Home directory only sends events to processors that care Lecture 24 EECS 470 Slide 8 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Read Processing Node #1 Directory Node #2 Read A (miss) A: Shared, #1 Lecture 24 EECS 470 Slide 9 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Write Processing Node #1 Directory Node #2 Read A (miss) A: Shared, #1 A: Mod., #2 Trade‐offs: • Longer accesses (3‐hop between Processor, directory, other Processor) • Lower bandwidth Æ no snoops necessary Makes sense either for CMPs (lots of L1 miss traffic) or large‐scale servers (shared‐memory MP > 32 nodes) Lecture 24 EECS 470 Slide 10 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Chip Performance Scalability: His tory & Expec ta tions 1000000 100000 NUMA on c hip ? P 10000 in MI in SMT SMP on chip (( 1000 100 Superscalar Out-of-Order ance mm 10 First Cores 1 Micro RISC Perfor 0.1 0.01 1970 1980 1990 2000 2010 Goal: 1 Tera inst/sec by 2010! Lecture 24 EECS 470 Slide 11 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha: Performance and Managed ClitComplexity Large‐scale server based on CMP nodes CMP architecture excellent platform for exploiting thread‐level parallelism inherently emphasizes replication over monolithic complexity DiDesign methdlthodology reduces ilimplemen ttitation complitlexity novel simulation methodology use ASIC physical design process Piranha: 2x performance advantage with team size of approximately 20 people Lecture 24 EECS 470 Slide 12 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: 1‐issue, in‐order, CPU 500MHz Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Lecture 24 EECS 470 Slide 13 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: 1‐issue, in‐order, CPU 500MHz L1 caches: I&D, 64KB, 2‐way I$ D$ Lecture 24 EECS 470 Slide 14 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay ICS I$ D$ I$ D$ I$ D$ I$ D$ CPU CPU CPU CPU Lecture 24 EECS 470 Slide 15 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS I$ D$ I$ D$ I$ D$ I$ D$ L2$ L2$ L2$ L2$ CPU CPU CPU CPU Lecture 24 EECS 470 Slide 16 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS Memory Controller (MC) RDRAM, 12.8GB/sec I$ D$ I$ D$ I$ D$ I$ D$ L2$ L2$ L2$ L2$ CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL 8 banks @1.6GB/sec Lecture 24 EECS 470 Slide 17 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way ICS Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE) I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL Lecture 24 EECS 470 Slide 18 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: 4Links4 Links MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, @ 8GB/s CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way uter ICS Memory Controller (MC) oo R RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ μprog., 1K μinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: CPU CPU CPU CPU 4‐port Xbar router MEM-CTL MEM-CTL MEM-CTL MEM-CTL topology independent 32GB/sec total bandwidth Lecture 24 EECS 470 Slide 19 © Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1‐issue, in‐order, CPU CPU CPU CPU 500MHz L1 caches: I&D, 64KB, 2‐way HE L2$ L2$ L2$ L2$ Intra‐chip switch (()ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1‐cycle delay L2 cache: shared, 1MB, 8‐way uter ICS Memory Controller (()MC) oo R RDRAM, 12.8GB/sec Protocol Engines (HE

EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading Fall 2007 Prof

Implicitly-Multithreaded Processors

Kaisen Lin and Michael Conley

A Speculative Control Scheme for an Energy-Efficient Banked Register File

The Microarchitecture of a Low Power Register File

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading

PERL – a Register-Less Processor

A Modern Primer on Processing in Memory

Advanced X86

Motorola Mpc107 Pci Bridge/Integrated Memory Controller

The Impulse Memory Controller

Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained Cmps