Parallelism and the Memory Hierarchy Todd C

Parallelism and the Memory Hierarchy Todd C. Mowry & Dave Eckhardt I. Brief History of Parallel Processing II. Impact of Parallelism on Memory Protec7on III. Cache Coherence Carnegie Mellon Todd Mowry & Dave Eckhardt 15-410: Parallel Mem Hierarchies 1 A Brief History of Parallel Processing • Ini7al Focus (star7ng in 1970’s): “Supercomputers” for Scien7fic Compu7ng C.mmp at CMU (1971) Cray XMP (circa 1984) Thinking Machines CM-2 (circa 1987) 16 PDP-11 processors 4 vector processors 65,536 1-bit processors + 2048 floang-point co-processors Blacklight at the Piasburgh Supercomputer Center SGI UV 1000cc-NUMA (today) 4096 processor cores Carnegie Mellon 15-410: Intro to Parallelism 2 Todd Mowry & Dave Eckhardt A Brief History of Parallel Processing • Ini7al Focus (star7ng in 1970’s): “Supercomputers” for Scien7fic Compu7ng • Another Driving Applicaon (star7ng in early 90’s): Databases Sun Enterprise 10000 (circa 1997) Oracle SuperCluster M6-32 (today) 64 UltraSPARC-II processors 32 SPARC M2 processors Carnegie Mellon 15-410: Intro to Parallelism 3 Todd Mowry & Dave Eckhardt A Brief History of Parallel Processing • Ini7al Focus (star7ng in 1970’s): “Supercomputers” for Scien7fic Compu7ng • Another Driving Applicaon (star7ng in early 90’s): Databases • Inflec7on point in 2004: Intel hits the Power Density Wall Pat Gelsinger, ISSCC 2001" Carnegie Mellon 15-410: Intro to Parallelism 4 Todd Mowry & Dave Eckhardt Impact of the Power Density Wall •! The real “Moore’s Law” con7nues –! i.e. # of transistors per chip con7nues to increase exponen7ally •! But thermal limitaons prevent us from scaling up clock rates –! otherwise the chips would start to melt, given prac7cal heat sink technology •! How can we deliver more performance to individual applicaons? à! increasing numbers of cores per chip •! Caveat: –! in order for a given applicaon to run faster, it must exploit parallelism Carnegie Mellon 15-410: Intro to Parallelism 5 Todd Mowry & Dave Eckhardt Parallel Machines Today Examples from Apple’s product line: iPad Re7na iPhone 5s Mac Pro 2 A6X cores 2 A7 cores 12 Intel Xeon E5 cores (+ 4 graphics cores) iMac MacBook Pro Re7na 15” 4 Intel Core i5 cores 4 Intel Core i7 cores (Images from apple.com) Carnegie Mellon 15-410: Intro to Parallelism 6 Todd Mowry & Dave Eckhardt Example “Mul7core” Processor: Intel Core i7 Intel Core i7-980X (6 cores) •! Cores: six 3.33 GHz Nahelem processors (with 2-way “Hyper-Threading”) •! Caches: 64KB L1 (private), 256KB L2 (private), 12MB L3 (shared) Carnegie Mellon 15-410: Intro to Parallelism 7 Todd Mowry & Dave Eckhardt Impact of Parallel Processing on the Kernel (vs. Other Layers) System Layers •! Kernel itself becomes a parallel program Programmer –! avoid boalenecks when accessing data structures •! lock conten7on, communicaon, load balancing, etc. Programming Language –! use all of the standard parallel programming tricks Compiler •! Thread scheduling gets more complicated User-Level Run-Time SW –! parallel programmers usually assume: Kernel •! all threads running simultaneously Hardware –! load balancing, avoiding synchronizaon problems •! threads don’t move between processors –! for op7mizing communicaon and cache locality •! Primi7ves for naming, communicang, and coordinang need to be fast –! Shared Address Space: virtual memory management across threads –! Message Passing: low-latency send/receive primives Carnegie Mellon 15-410: Intro to Parallelism 8 Todd Mowry & Dave Eckhardt Case Study: Protec7on •! One important role of the OS: –! provide protecon so that buggy processes don’t corrupt other processes •! Shared Address Space: –! access permissions for virtual memory pages •! e.g., set pages to read-only during copy-on-write op7mizaon •! Message Passing: –! ensure that target thread of send() message is a valid recipient Carnegie Mellon 15-410: Intro to Parallelism 9 Todd Mowry & Dave Eckhardt How Do We Propagate Changes to Access Permissions? •! What if parallel machines (and VM management) simply looked like this: CPU 0 CPU 1 CPU 2 … CPU N (No caches or TLBs.) Set Read-Only Memory Page Table Read-Write Read-Only f029 Read-Only f835 •! Updates to the page tables (in shared memory) could be read by other threads •! But this would be a very slow machine! –! Why? Carnegie Mellon 15-410: Intro to Parallelism 10 Todd Mowry & Dave Eckhardt VM Management in a Mul7core Processor TLB 0 TLB 1 TLB 2 TLB N CPU 0 CPU 1 CPU 2 … CPU N L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Set Read-Only L3 Cache Memory Page Table Read-Write Read-Only f029 Read-Only f835 “TLB Shootdown”: –! relevant entries in the TLBs of other processor cores need to be flushed Carnegie Mellon 15-410: Intro to Parallelism 11 Todd Mowry & Dave Eckhardt TLB Shootdown TLB 0 TLB 1 TLB 2 TLB N CPU 0 CPU 1 CPU 2 … CPU N Lock PTE Page Table Read-Write Read-Only f029 0 1 2 N Read-Only f835 1.! Ini7ang core triggers OS to lock the corresponding Page Table Entry (PTE) 2.! OS generates a list of cores that may be using this PTE (erring conservavely) 3.! Ini7ang core sends an Inter-Processor Interrupt (IPI) to those other cores –! reques7ng that they invalidate their corresponding TLB entries 4.! Ini7ang core invalidates local TLB entry; waits for acknowledgements 5.! Other cores receive interrupts, execute interrupt handler which invalidates TLBs –! send an acknowledgement back to the ini7ang core 6.! Once ini7ang core receives all acknowledgements, it unlocks the PTE Carnegie Mellon 15-410: Intro to Parallelism 12 Todd Mowry & Dave Eckhardt TLB Shootdown Timeline •! Image from: Villavieja et al, “DiDi: Mi7gang the Performance Impact of TLB Shootdowns Using a Shared TLB Directory.” In Proceedings of PACT 2011. Carnegie Mellon 15-410: Intro to Parallelism 13 Todd Mowry & Dave Eckhardt Performance of TLB Shootdown Image from: Baumann et al,“The Mul7kernel: a New OS Architecture for Scalable Mul7core •! Expensive operaon Systems.” In Proceedings of SOSP ‘09. •! e.g., over 10,000 cycles on 8 or more cores •! Gets more expensive with increasing numbers of cores Carnegie Mellon 15-410: Intro to Parallelism 14 Todd Mowry & Dave Eckhardt Now Let’s Consider Consistency Issues with the Caches TLB 0 TLB 1 TLB 2 TLB N CPU 0 CPU 1 CPU 2 … CPU N L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ L3 Cache Memory Carnegie Mellon 15-410: Intro to Parallelism 15 Todd Mowry & Dave Eckhardt Caches in a Single-Processor Machine (Review from 213) •! Ideally, memory would be arbitrarily fast, large, and cheap –! unfortunately, you can’t have all three (e.g., fast à small, large à slow) –! cache hierarchies are a hybrid approach •! if all goes well, they will behave as though they are both fast and large •! Cache hierarchies work due to locality –! temporal locality à even relavely small caches may have high hit rates –! spaal locality à move data in blocks (e.g., 64 bytes) •! Locang the data: –! Main memory: directly addressed •! we know exactly where to look: –! at the unique locaon corresponding to the address –! Cache: may or may not be somewhere in the cache •! need tags to iden7fy the data •! may need to check mul7ple locaons, depending on the degree of associavity Carnegie Mellon 15-410: Parallel Mem Hierarchies 16 Todd Mowry & Dave Eckhardt Cache Read •!Locate set •!Check if any line in set E = 2e lines per set has matching tag •!Yes + line valid: hit •!Locate data star7ng at offset Address of word: t bits s bits b bits S = 2s sets tag set block index offset data begins at this offset v tag 0 1 2 B-1 valid bit B = 2b bytes per cache block (the data) Carnegie Mellon 15-410: Parallel Mem Hierarchies 17 Todd Mowry & Dave Eckhardt Intel Quad Core i7 Cache Hierarchy Processor package Core 0 Core 3 Regs Regs L1 I-cache and D-cache: 32 KB, 8-way, Access: 4 cycles L1 L1 L1 L1 D-cache I-cache D-cache I-cache … L2 unified cache: 256 KB, 8-way, L2 unified cache L2 unified cache Access: 11 cycles L3 unified cache: 8 MB, 16-way, L3 unified cache Access: 30-40 cycles (shared by all cores) Block size: 64 bytes for all caches. Main memory Carnegie Mellon 15-410: Parallel Mem Hierarchies 18 Todd Mowry & Dave Eckhardt Simple Mul7core Example: Core 0 Loads X load r1 ß X (r1 = 17) Core 0 Core 3 L1-D L1-D Miss X 17 … L2 L2 X 17 Miss L3 X 17 Miss Main memory X: 17 Retrieve from memory, fill caches Carnegie Mellon 15-410: Parallel Mem Hierarchies 19 Todd Mowry & Dave Eckhardt Example Con7nued: Core 3 Also Does a Load of X load r2 ß X (r2 = 17) Core 0 Core 3 Example of construc7ve sharing: L1-D L1-D •! Core 3 benefited Miss X 17 X 17 from Core 0 … bringing X into the L2 L2 L3 cache X 17 Miss X 17 Fairly straighworward L3 with only loads. But what happens when X 17 Hit! (fill L2 & L1) stores occur? Main memory X: 17 Carnegie Mellon 15-410: Parallel Mem Hierarchies 20 Todd Mowry & Dave Eckhardt Review: How Are Stores Handled in Uniprocessor Caches? store 5 à X L1-D X 17 à 5 L2 X 17 What happens here? L3 We need to make sure we don’t have a X 17 consistency problem Main memory X: 17 Carnegie Mellon 15-410: Parallel Mem Hierarchies 21 Todd Mowry & Dave Eckhardt Op7ons for Handling Stores in Uniprocessor Caches Opon 1: Write-Through Caches à propagate immediately store 5 à X L1-D What are the advantages and disadvantages X 17 à 5 of this approach? L2 X 17 17 à 5 L3 X 17 17 à 5 Main memory X: 17 17 à 5 Carnegie Mellon 15-410: Parallel Mem Hierarchies 22 Todd Mowry & Dave Eckhardt Op7ons for Handling Stores in Uniprocessor Caches Opon 1: Write-Through Caches Opon 2: Write-Back Caches •! propagate immediately •! defer propagaon un7l evic7on store 5 à X •! keep track of dirty status in tags L1-D L1-D X 17 à 5 X 17 à 5 Dirty Upon evicon, if data is dirty, L2 L2 write it back.

Load more