Multicore and Multithreaded Processors Multithreaded Cores Multiprocessors Multicore and Multithreaded Processors

Multicore and Multithreaded Processors Multithreaded Cores • Why multicore? • So far, our core executes one thread at a time • Thread-level parallelism • Multithreaded core: execute multiple threads at a time • Multithreaded cores • Old idea … but made a big comeback fairly recently • Multiprocessors • How do we execute multiple threads on same core? • Design issues • Coarse-grain switching • Examples • Fine-grain switching • Simultaneous multithreading (SMT) “hyperthreading” (Intel) • Benefits? • Better instruction throughput • Greater resource utilization • Tolerates long latency events (e.g., cache misses) • Cheaper than multiple complete cores (does this matter any more?) © 2009 Daniel J. Sorin 11 © 2009 Daniel J. Sorin 12 Multiprocessors Multicore and Multithreaded Processors • Multiprocessors have been around a long time … just not • Why multicore? on a single chip • Thread-level parallelism • Mainframes and servers with 2-64 processors • Multithreaded cores • Supercomputers with 100s or 1000s of processors • Multiprocessors • Now, multiprocessor on a single chip • Design issues • “Chip multiprocessor” (CMP) or “multicore processor” • Examples • Why does “single chip” matter so much? • ICQ: What’s fundamentally different about having a multiprocessor that fits on one chip vs. on multiple chips? © 2009 Daniel J. Sorin 13 © 2009 Daniel J. Sorin 14 Multiprocessor Microarchitecture Interconnection Networks • Many design issues unique to multiprocessors • Networks have many design aspects • Interconnection network • We focus on one here (topology) see ECE 259 for more on this • Communication between cores • Memory system design • Topology is the structure of the interconnect • Others? • Geometric property topology has nice mathematical properties • Direct vs Indirect Networks • Direct: All switches attached to host nodes (e.g., mesh) • Indirect: Many switches not attached to host nodes (e.g., tree) 15 © 2009 Daniel J. Sorin © 2009 Daniel J. Sorin ECE 152 Direct Topologies: k-ary d-cubes Examples of k-ary d-cubes • Often called k-ary n-cubes • 1D Ring = k-ary 1-cube • d = 1 [always] • General class of regular, direct topologies • k = N [always] = 4 [here] • Subsumes rings, tori, cubes, etc. • Ave dist = ? • d dimensions • 1 for ring • 2 for mesh or torus • 3 for cube • 2D Torus = k-ary 2-cube • Can choose arbitrarily large d, except for cost of switches • d = 2 [always] • k = log dN (always) = 3 [here] • k switches in each dimension • Ave dist = ? • Note: k can be different in each dimension (e.g., 2,3,4-ary 3-cube) © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 k-ary d-cubes in Real World Indirect Topologies • Compaq Alpha 21364 (and 21464, R.I.P.) • Indirect topology – most switches not attached to nodes • 2D torus (k-ary 2-cube) • Some common indirect topologies • Cray T3D and T3E • Crossbar • 3D torus (k-ary, 3-cube) • Tree • Intel Larrabee (Knight’s Ferry/Corner/Landing) • Butterfly • 1D ring • Each of the above topologies comes in many flavors • Tilera64 • 2D torus © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Indirect Topologies: Crossbar Indirect Topologies: Trees • Crossbar = single switch that directly connects n inputs to • Indirect topology – most switches not attached to nodes m outputs • Tree : send message up from leaf to closest common • Logically equivalent to m n:1 muxes ancestor, then down to recipient • Very useful component that is used frequently in0 • N host nodes at leaves • k = branching factor of tree (k=2 binary tree) in1 • d = height of tree = log kN in2 in3 out0 out1 out2 out3 out4 © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Indirect Topologies: Fat Trees Indirect Topologies: Butterflies • Problem with trees: too much contention at or • Multistage: nodes at ends, switches in middle near root • Exactly one path between each pair of nodes • Fat tree : same as tree, but with more bandwidth • Each node sees a tree rooted at itself near the root (by adding multiple roots and high order switches) CM-5 “Thinned” Fat Tree © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Indirect Topologies: More Butterflies Indirect Networks in Real World • In general, called k-ary, n-flies • Thinking Machines CM-5 (old … older than you) • n stages of radix-k switches • Fat tree • Sun UltraEnterprise E10000 (about as old as you) • Have many nice features, esp. log distances n • 4 trees (interleaved by address) • But conflicts cause tree saturation • How can we spread the traffic more evenly? Benes (pronounced “BEN-ish”) Network • Routes all permutations w/o conflict N ° Reversed Butterfly ° N • Notice similarity to fat tree (fold in half) ° Butterfly • Randomization is major breakthrough © 2009 Daniel J. Sorin ECE 152 © 2009 Daniel J. Sorin ECE 152 Multiprocessor Microarchitecture Communication Between Cores (Threads) • Many design issues unique to multiprocessors • How should threads communicate with each other? • Interconnection network • Two popular options • Communication between cores • Shared memory • Memory system design • Perform loads and stores to shared addresses • Others? • Requires synchronization (can’t read before write) • Message passing • Send messages between threads (cores) • No shared address space 27 © 2009 Daniel J. Sorin © 2009 Daniel J. Sorin ECE 152 What is (Hardware) Shared Memory? Some Memory System Options P1 Pn • Take multiple microprocessors Switch P1 Pn (Interleaved) First-level $ • Implement a memory system with a single global physical $ $ Bus address space (usually) (Interleaved) Main memory • Communication assist HW does the “magic” of cache coherence Mem I/O devices (a) Shared cache (b) Bus-based shared memory • Goal 1: Minimize memory latency P1 Pn P • Use co-location & caches P1 n $ $ $ $ Mem Mem • Goal 2: Maximize memory bandwidth Interconnection network • Use parallelism & caches Interconnection network Mem Mem (c) Dancehall (d) Distributed-memory © 2009 Daniel J. Sorin 29 © 2009 Daniel J. Sorin 30 Cache Coherence Why Cache Coherent Shared Memory? • According to Webster’s dictionary … • Pluses • Cache : a secure place of storage • For applications - looks like multitasking uniprocessor • Coherent : logically consistent • For OS - only evolutionary extensions required • Easy to do communication without OS • Cache Coherence: keep storage logically consistent • Software can worry about correctness first and then performance • Coherence requires enforcement of 2 properties • Minuses • Proper synchronization is complex • Communication is implicit so may be harder to optimize 1) Write propagation • More work for hardware designers (i.e., us!) • All writes eventually become visible to other processors • Result 2) Write serialization • Cache coherent shared memory machines are the most successful • All processors see writes to same block in same order parallel machines ever © 2009 Daniel J. Sorin 31 © 2009 Daniel J. Sorin 32 In More Detail Symmetric Multiprocessors (SMPs) • Efficient naming • Multiple cores • Virtual to physical mapping with TLBs • Easy and efficient caching • Each has a cache (or multiple caches in a hierarchy) • Caching is natural and well-understood • Can be done in HW automatically • Connect with logical bus (totally-ordered broadcast) • Physical bus = set of shared wires • Logical bus = functional equivalent of physical bus • Implement Snooping Cache Coherence Protocol • Broadcast all cache misses on bus • All caches “snoop” bus and may act (e.g., respond with data) • Memory responds otherwise © 2009 Daniel J. Sorin 33 © 2009 Daniel J. Sorin 34 Cache Coherence Problem (Step 1) Cache Coherence Problem (Step 2) P1 P2 ld r2, x P1 P2 ld r2, x ld r2, x Time Time Interconnection Network Interconnection Network x x Main Memory Main Memory © 2009 Daniel J. Sorin 35 © 2009 Daniel J. Sorin 36 Cache Coherence Problem (Step 3) Snooping Cache-Coherence Protocols • Each cache controller “snoops” all bus transactions • Transaction is relevant if it is for a block this cache contains P1 P2 ld r2, x • Take action to ensure coherence • Invalidate ld r2, x • Update add r1, r2, r4 Time st x, r1 • Supply value to requestor if Owner • Actions depend on the state of the block and the protocol Interconnection Network • Main memory controller also snoops on bus • If no cache is owner, then memory is owner x • Simultaneous operation of independent controllers Main Memory © 2009 Daniel J. Sorin 37 © 2009 Daniel J. Sorin 38 Simple 2-State Invalidate Snooping Protocol A 3-State Write-Back Invalidation Protocol Load / -- Store / OwnGETX • Write-through, • 2-State Protocol no-write-allocate + Simple hardware and protocol OtherGETS / -- cache • Uses lots of bandwidth (every write goes on bus!) Valid OtherGETX/ -- • 3-State Protocol (MSI) • Proc actions: • Modified Load, Store • One cache exclusively has valid (modified) copy Owner • Memory is stale Load / OwnGETS • Shared Invalid • Bus actions: Store / OwnGETX • >= 1 cache and memory have valid copy (memory = owner) GETS, GETX • Invalid (only memory has valid copy and memory is owner) OtherGETS / -- OtherGETX / -- • Must invalidate all other copies before entering modified state Notation: observed event / action taken • Requires bus transaction (order and invalidate) © 2009 Daniel J. Sorin 39 © 2009 Daniel J. Sorin 40 MSI Processor and Bus Actions MSI State Diagram • Processor: Load /-- Store / -- • Load • Store M • Writeback on replacement of modified block • Bus -/OtherGETS • GetShared ( GETS ): Get without intent to modify, data could come Store / OwnGETX -/OtherGETX from

Multicore and Multithreaded Processors Multithreaded Cores Multiprocessors Multicore and Multithreaded Processors

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support