Multicore and Multithreaded Processors Multithreaded Cores Multiprocessors Multicore and Multithreaded Processors

Multicore and Multithreaded Processors Multithreaded Cores • Why multicore? • So far, our core executes one thread at a time • Thread-level parallelism • Multithreaded core: execute multiple threads at a time • Multithreaded cores • Old idea … but made a big comeback fairly recently • Multiprocessors • How do we execute multiple threads on same core? • Design issues • Coarse-grain switching • Examples • Fine-grain switching • Simultaneous multithreading (SMT) “hyperthreading” (Intel) • Benefits? • Better instruction throughput • Greater resource utilization • Tolerates long latency events (e.g., cache misses) • Cheaper than multiple complete cores (does this matter any more?) © 2009 Daniel J. Sorin 11 © 2009 Daniel J. Sorin 12 Multiprocessors Multicore and Multithreaded Processors • Multiprocessors have been around a long time … just not • Why multicore? on a single chip • Thread-level parallelism • Mainframes and servers with 2-64 processors • Multithreaded cores • Supercomputers with 100s or 1000s of processors • Multiprocessors • Now, multiprocessor on a single chip • Design issues • “Chip multiprocessor” (CMP) or “multicore processor” • Examples • Why does “single chip” matter so much? • ICQ: What’s fundamentally different about having a multiprocessor that fits on one chip vs. on multiple chips? © 2009 Daniel J. Sorin 13 © 2009 Daniel J. Sorin 14 Multiprocessor Microarchitecture Interconnection Networks • Many design issues unique to multiprocessors • Networks have many design aspects • Interconnection network • We focus on one here (topology) see ECE 259 for more on this • Communication between cores • Memory system design • Topology is the structure of the interconnect • Others? • Geometric property topology has nice mathematical properties • Direct vs Indirect Networks • Direct: All switches attached to host nodes (e.g., mesh) • Indirect: Many switches not attached to host nodes (e.g., tree) 15 © 2009 Daniel J. Sorin © 2009 Daniel J. Sorin ECE 152 Direct Topologies: k-ary d-cubes Examples of k-ary d-cubes • Often called k-ary n-cubes • 1D Ring = k-ary 1-cube • d = 1 [always] • General class of regular, direct topologies • k = N [always] = 4 [here] • Subsumes rings, tori, cubes, etc. • Ave dist = ? • d dimensions • 1 for ring • 2 for mesh or torus • 3 for cube • 2D Torus = k-ary 2-cube • Can choose arbitrarily large d, except for cost of switches • d = 2 [always] • k = log dN (always) = 3 [here] • k switches in each dimension • Ave dist = ? • Note: k can be different in each dimension (e.g., 2,3,4-ary 3-cube) © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 k-ary d-cubes in Real World Indirect Topologies • Compaq Alpha 21364 (and 21464, R.I.P.) • Indirect topology – most switches not attached to nodes • 2D torus (k-ary 2-cube) • Some common indirect topologies • Cray T3D and T3E • Crossbar • 3D torus (k-ary, 3-cube) • Tree • Intel Larrabee (Knight’s Ferry/Corner/Landing) • Butterfly • 1D ring • Each of the above topologies comes in many flavors • Tilera64 • 2D torus © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Indirect Topologies: Crossbar Indirect Topologies: Trees • Crossbar = single switch that directly connects n inputs to • Indirect topology – most switches not attached to nodes m outputs • Tree : send message up from leaf to closest common • Logically equivalent to m n:1 muxes ancestor, then down to recipient • Very useful component that is used frequently in0 • N host nodes at leaves • k = branching factor of tree (k=2 binary tree) in1 • d = height of tree = log kN in2 in3 out0 out1 out2 out3 out4 © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Indirect Topologies: Fat Trees Indirect Topologies: Butterflies • Problem with trees: too much contention at or • Multistage: nodes at ends, switches in middle near root • Exactly one path between each pair of nodes • Fat tree : same as tree, but with more bandwidth • Each node sees a tree rooted at itself near the root (by adding multiple roots and high order switches) CM-5 “Thinned” Fat Tree © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Indirect Topologies: More Butterflies Indirect Networks in Real World • In general, called k-ary, n-flies • Thinking Machines CM-5 (old … older than you) • n stages of radix-k switches • Fat tree • Sun UltraEnterprise E10000 (about as old as you) • Have many nice features, esp. log distances n • 4 trees (interleaved by address) • But conflicts cause tree saturation • How can we spread the traffic more evenly? Benes (pronounced “BEN-ish”) Network • Routes all permutations w/o conflict N ° Reversed Butterfly ° N • Notice similarity to fat tree (fold in half) ° Butterfly • Randomization is major breakthrough © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Multiprocessor Microarchitecture Communication Between Cores (Threads) • Many design issues unique to multiprocessors • How should threads communicate with each other? • Interconnection network • Two popular options • Communication between cores • Shared memory • Memory system design • Perform loads and stores to shared addresses • Others? • Requires synchronization (can’t read before write) • Message passing • Send messages between threads (cores) • No shared address space 27 © 2009 Daniel J. Sorin © 2009 Daniel J. Sorin ECE 152 What is (Hardware) Shared Memory? Some Memory System Options P1 Pn • Take multiple microprocessors Switch P1 Pn (Interleaved) First-level $ • Implement a memory system with a single global physical $ $ Bus address space (usually) (Interleaved) Main memory • Communication assist HW does the “magic” of cache coherence Mem I/O devices (a) Shared cache (b) Bus-based shared memory • Goal 1: Minimize memory latency P1 Pn P • Use co-location & caches P1 n $ $ $ $ Mem Mem • Goal 2: Maximize memory bandwidth Interconnection network • Use parallelism & caches Interconnection network Mem Mem (c) Dancehall (d) Distributed-memory © 2009 Daniel J. Sorin 29 © 2009 Daniel J. Sorin 30 Cache Coherence Why Cache Coherent Shared Memory? • According to Webster’s dictionary … • Pluses • Cache : a secure place of storage • For applications - looks like multitasking uniprocessor • Coherent : logically consistent • For OS - only evolutionary extensions required • Easy to do communication without OS • Cache Coherence: keep storage logically consistent • Software can worry about correctness first and then performance • Coherence requires enforcement of 2 properties • Minuses • Proper synchronization is complex • Communication is implicit so may be harder to optimize 1) Write propagation • More work for hardware designers (i.e., us!) • All writes eventually become visible to other processors • Result 2) Write serialization • Cache coherent shared memory machines are the most successful • All processors see writes to same block in same order parallel machines ever © 2009 Daniel J. Sorin 31 © 2009 Daniel J. Sorin 32 In More Detail Symmetric Multiprocessors (SMPs) • Efficient naming • Multiple cores • Virtual to physical mapping with TLBs • Easy and efficient caching • Each has a cache (or multiple caches in a hierarchy) • Caching is natural and well-understood • Can be done in HW automatically • Connect with logical bus (totally-ordered broadcast) • Physical bus = set of shared wires • Logical bus = functional equivalent of physical bus • Implement Snooping Cache Coherence Protocol • Broadcast all cache misses on bus • All caches “snoop” bus and may act (e.g., respond with data) • Memory responds otherwise © 2009 Daniel J. Sorin 33 © 2009 Daniel J. Sorin 34 Cache Coherence Problem (Step 1) Cache Coherence Problem (Step 2) P1 P2 ld r2, x P1 P2 ld r2, x ld r2, x Time Time Interconnection Network Interconnection Network x x Main Memory Main Memory © 2009 Daniel J. Sorin 35 © 2009 Daniel J. Sorin 36 Cache Coherence Problem (Step 3) Snooping Cache-Coherence Protocols • Each cache controller “snoops” all bus transactions • Transaction is relevant if it is for a block this cache contains P1 P2 ld r2, x • Take action to ensure coherence • Invalidate ld r2, x • Update add r1, r2, r4 Time st x, r1 • Supply value to requestor if Owner • Actions depend on the state of the block and the protocol Interconnection Network • Main memory controller also snoops on bus • If no cache is owner, then memory is owner x • Simultaneous operation of independent controllers Main Memory © 2009 Daniel J. Sorin 37 © 2009 Daniel J. Sorin 38 Simple 2-State Invalidate Snooping Protocol A 3-State Write-Back Invalidation Protocol Load / -- Store / OwnGETX • Write-through, • 2-State Protocol no-write-allocate + Simple hardware and protocol OtherGETS / -- cache • Uses lots of bandwidth (every write goes on bus!) Valid OtherGETX/ -- • 3-State Protocol (MSI) • Proc actions: • Modified Load, Store • One cache exclusively has valid (modified) copy Owner • Memory is stale Load / OwnGETS • Shared Invalid • Bus actions: Store / OwnGETX • >= 1 cache and memory have valid copy (memory = owner) GETS, GETX • Invalid (only memory has valid copy and memory is owner) OtherGETS / -- OtherGETX / -- • Must invalidate all other copies before entering modified state Notation: observed event / action taken • Requires bus transaction (order and invalidate) © 2009 Daniel J. Sorin 39 © 2009 Daniel J. Sorin 40 MSI Processor and Bus Actions MSI State Diagram • Processor: Load /-- Store / -- • Load • Store M • Writeback on replacement of modified block • Bus -/OtherGETS • GetShared ( GETS ): Get without intent to modify, data could come Store / OwnGETX -/OtherGETX from

Multicore and Multithreaded Processors Multithreaded Cores Multiprocessors Multicore and Multithreaded Processors

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

A Modern Primer on Processing in Memory

Advanced X86

Motorola Mpc107 Pci Bridge/Integrated Memory Controller

The Impulse Memory Controller

Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained Cmps

COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011

WP127: "Embedded System Design Considerations" V1.0 (03/06/2002)

Computer Architecture Lecture 12: Memory Interference and Quality of Service

Memory Controller SC White Paper

Intel Xeon Scalable Family Balanced Memory Configurations

10Th Gen Intel® Core™ Processor Families Datasheet, Vol. 1