Shared Memory? Some Memory System Options

Multiprocessor Microarchitecture Communication Between Cores (Threads) • Many design issues unique to multiprocessors • How should threads communicate with each other? • Interconnection network • Two popular options • Communication between cores • Shared memory • Memory system design • Perform loads and stores to shared addresses • Others? • Requires synchronization (can’t read before write) • Message passing • Send messages between threads (cores) • No shared address space 27 © 2009 Daniel J. Sorin © 2009 Daniel J. Sorin ECE 152 What is (Hardware) Shared Memory? Some Memory System Options P1 Pn • Take multiple microprocessors Switch P1 Pn (Interleaved) First-level $ • Implement a memory system with a single global physical $ $ Bus address space (usually) (Interleaved) Main memory • Communication assist HW does the “magic” of cache coherence Mem I/O devices (a) Shared cache (b) Bus-based shared memory • Goal 1: Minimize memory latency P1 Pn P • Use co-location & caches P1 n $ $ $ $ Mem Mem • Goal 2: Maximize memory bandwidth Interconnection network • Use parallelism & caches Interconnection network Mem Mem (c) Dancehall (d) Distributed-memory © 2009 Daniel J. Sorin 29 © 2009 Daniel J. Sorin 30 Cache Coherence Why Cache Coherent Shared Memory? • According to Webster’s dictionary … • Pluses • Cache : a secure place of storage • For applications - looks like multitasking uniprocessor • Coherent : logically consistent • For OS - only evolutionary extensions required • Easy to do communication without OS • Cache Coherence: keep storage logically consistent • Software can worry about correctness first and then performance • Coherence requires enforcement of 2 properties • Minuses • Proper synchronization is complex • Communication is implicit so may be harder to optimize 1) Write propagation • More work for hardware designers (i.e., us!) • All writes eventually become visible to other processors • Result 2) Write serialization • Symmetric Multiprocessors (SMPs) are the most successful parallel • All processors see writes to same block in same order machines ever • And the first with multi-billion-dollar markets! © 2009 Daniel J. Sorin 31 © 2009 Daniel J. Sorin 32 In More Detail Symmetric Multiprocessors (SMPs) • Efficient naming • Multiple cores • Virtual to physical mapping with TLBs • Easy and efficient caching • Each has a cache (or multiple caches in a hierarchy) • Caching is natural and well-understood • Can be done in HW automatically • Connect with logical bus (totally-ordered broadcast) • Physical bus = set of shared wires • Logical bus = functional equivalent of physical bus • Implement Snooping Cache Coherence Protocol • Broadcast all cache misses on bus • All caches “snoop” bus and may act (e.g., respond with data) • Memory responds otherwise © 2009 Daniel J. Sorin 33 © 2009 Daniel J. Sorin 34 Cache Coherence Problem (Step 1) Cache Coherence Problem (Step 2) P1 P2 ld r2, x P1 P2 ld r2, x ld r2, x Time Time Interconnection Network Interconnection Network x x Main Memory Main Memory © 2009 Daniel J. Sorin 35 © 2009 Daniel J. Sorin 36 Cache Coherence Problem (Step 3) Snooping Cache-Coherence Protocols • Each cache controller “snoops” all bus transactions • Transaction is relevant if it is for a block this cache contains P1 P2 ld r2, x • Take action to ensure coherence • Invalidate ld r2, x • Update add r1, r2, r4 Time st x, r1 • Supply value to requestor if Owner • Actions depend on the state of the block and the protocol Interconnection Network • Main memory controller also snoops on bus • If no cache is owner, then memory is owner x • Simultaneous operation of independent controllers Main Memory © 2009 Daniel J. Sorin 37 © 2009 Daniel J. Sorin 38 Simple 2-State Invalidate Snooping Protocol A 3-State Write-Back Invalidation Protocol Load / -- Store / OwnGETX • Write-through, • 2-State Protocol no-write-allocate + Simple hardware and protocol OtherGETS / -- cache • Uses lots of bandwidth (every write goes on bus!) Valid OtherGETX/ -- • 3-State Protocol (MSI) • Proc actions: • Modified Load, Store • One cache exclusively has valid (modified) copy Owner • Memory is stale Load / OwnGETS • Shared Invalid • Bus actions: Store / OwnGETX • >= 1 cache and memory have valid copy (memory = owner) GETS, GETX • Invalid (only memory has valid copy and memory is owner) OtherGETS / -- OtherGETX / -- • Must invalidate all other copies before entering modified state Notation: observed event / action taken • Requires bus transaction (order and invalidate) © 2009 Daniel J. Sorin 39 © 2009 Daniel J. Sorin 40 MSI Processor and Bus Actions MSI State Diagram • Processor: Load /-- Store / -- • Load • Store M • Writeback on replacement of modified block • Bus -/OtherGETS • GetShared ( GETS ): Get without intent to modify, data could come Store / OwnGETX -/OtherGETX from memory or another cache S • GetExclusive ( GETX) : Get with intent to modify, must invalidate all Store / OwnGETX Writeback / OwnPUTX other caches’ copies OtherBusRdX / -- • PutExclusive ( PUTX ): cache controller puts contents on bus and Writeback / -- Load / OwnGETS Load / -- memory is updated -/OtherGETS I • Definition: cache-to-cache transfer occurs when another cache satisfies GETS or GETX request • Let’s draw it! Note: we never take any action on an OtherPUTX © 2009 Daniel J. Sorin 41 © 2009 Daniel J. Sorin 42 An MSI Protocol Example Multicore and Multithreaded Processors Proc Action P1 State P2 state P3 state Bus Act Data from • Why multicore? initially I I I • Thread-level parallelism 1. P1 load u I S I I GETS Memory 2. P3 load u S I I S GETS Memory • Multithreaded cores 3. P3 store u S I I S M GETX Memory or P1 (?) • Multiprocessors 4. P1 load u I S I M S GETS P3’s cache • Design issues 5. P2 load u S I S S GETS Memory • Examples • Single writer, multiple reader protocol • Why Modified to Shared in line 4? • What if not in any cache? Memory responds • Read then Write produces 2 bus transactions • Slow and wasteful of bandwidth for a common sequence of actions © 2009 Daniel J. Sorin 43 © 2009 Daniel J. Sorin 44 Some Real-World Multicores • Intel/AMD 2/4/8-core chips • Pretty standard • Sun’s Niagara (UltraSPARC T1-T3) • 4-16 simple, in-order, multithreaded cores • [D.O.A] Sun’s Rock processor: 16 cores • Cell Broadband Engine: in PlayStation 3 • Intel’s Larrabee: 80 simple x86 cores in a ring • Cisco CRS-1 Processor: 188 in-order cores • Graphics processing units (GPUs): hundreds of “cores” © 2009 Daniel J. Sorin ECE 152.

Shared Memory? Some Memory System Options

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support