Parallelism and the Memory Hierarchy Todd C. Mowry & Dave Eckhardt
I. Brief History of Parallel Processing
II. Impact of Parallelism on Memory Protec on
III. Cache Coherence
Carnegie Mellon Todd Mowry & Dave Eckhardt 15-410: Parallel Mem Hierarchies 1 A Brief History of Parallel Processing
• Ini al Focus (star ng in 1970’s): “Supercomputers” for Scien fic Compu ng
C.mmp at CMU (1971) Cray XMP (circa 1984) Thinking Machines CM-2 (circa 1987) 16 PDP-11 processors 4 vector processors 65,536 1-bit processors + 2048 floa ng-point co-processors
Blacklight at the Pi sburgh Supercomputer Center
SGI UV 1000cc-NUMA (today) 4096 processor cores Carnegie Mellon 15-410: Intro to Parallelism 2 Todd Mowry & Dave Eckhardt A Brief History of Parallel Processing
• Ini al Focus (star ng in 1970’s): “Supercomputers” for Scien fic Compu ng • Another Driving Applica on (star ng in early 90’s): Databases
Sun Enterprise 10000 (circa 1997) Oracle SuperCluster M6-32 (today) 64 UltraSPARC-II processors 32 SPARC M2 processors
Carnegie Mellon 15-410: Intro to Parallelism 3 Todd Mowry & Dave Eckhardt A Brief History of Parallel Processing
• Ini al Focus (star ng in 1970’s): “Supercomputers” for Scien fic Compu ng • Another Driving Applica on (star ng in early 90’s): Databases • Inflec on point in 2004: Intel hits the Power Density Wall
Pat Gelsinger, ISSCC 2001
Carnegie Mellon 15-410: Intro to Parallelism 4 Todd Mowry & Dave Eckhardt Impact of the Power Density Wall
• The real “Moore’s Law” con nues – i.e. # of transistors per chip con nues to increase exponen ally • But thermal limita ons prevent us from scaling up clock rates – otherwise the chips would start to melt, given prac cal heat sink technology
• How can we deliver more performance to individual applica ons? à increasing numbers of cores per chip
• Caveat: – in order for a given applica on to run faster, it must exploit parallelism
Carnegie Mellon 15-410: Intro to Parallelism 5 Todd Mowry & Dave Eckhardt Parallel Machines Today
Examples from Apple’s product line:
iPad Re na iPhone 5s Mac Pro 2 A6X cores 2 A7 cores 12 Intel Xeon E5 cores (+ 4 graphics cores)
iMac MacBook Pro Re na 15” 4 Intel Core i5 cores 4 Intel Core i7 cores
(Images from apple.com)
Carnegie Mellon 15-410: Intro to Parallelism 6 Todd Mowry & Dave Eckhardt Example “Mul core” Processor: Intel Core i7
Intel Core i7-980X (6 cores)
• Cores: six 3.33 GHz Nahelem processors (with 2-way “Hyper-Threading”) • Caches: 64KB L1 (private), 256KB L2 (private), 12MB L3 (shared)
Carnegie Mellon 15-410: Intro to Parallelism 7 Todd Mowry & Dave Eckhardt Impact of Parallel Processing on the Kernel (vs. Other Layers) System Layers • Kernel itself becomes a parallel program Programmer – avoid bo lenecks when accessing data structures • lock conten on, communica on, load balancing, etc. Programming Language – use all of the standard parallel programming tricks Compiler • Thread scheduling gets more complicated User-Level Run-Time SW – parallel programmers usually assume: Kernel • all threads running simultaneously Hardware – load balancing, avoiding synchroniza on problems • threads don’t move between processors – for op mizing communica on and cache locality • Primi ves for naming, communica ng, and coordina ng need to be fast – Shared Address Space: virtual memory management across threads – Message Passing: low-latency send/receive primi ves
Carnegie Mellon 15-410: Intro to Parallelism 8 Todd Mowry & Dave Eckhardt Case Study: Protec on
• One important role of the OS: – provide protec on so that buggy processes don’t corrupt other processes
• Shared Address Space: – access permissions for virtual memory pages • e.g., set pages to read-only during copy-on-write op miza on
• Message Passing: – ensure that target thread of send() message is a valid recipient
Carnegie Mellon 15-410: Intro to Parallelism 9 Todd Mowry & Dave Eckhardt How Do We Propagate Changes to Access Permissions?
• What if parallel machines (and VM management) simply looked like this:
CPU 0 CPU 1 CPU 2 … CPU N (No caches or TLBs.)
Set Read-Only Memory Page Table Read-Write Read-Only f029 Read-Only f835
• Updates to the page tables (in shared memory) could be read by other threads • But this would be a very slow machine! – Why?
Carnegie Mellon 15-410: Intro to Parallelism 10 Todd Mowry & Dave Eckhardt VM Management in a Mul core Processor TLB 0 TLB 1 TLB 2 TLB N
CPU 0 CPU 1 CPU 2 … CPU N
L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $
Set Read-Only L3 Cache
Memory Page Table Read-Write Read-Only f029 Read-Only f835
“TLB Shootdown”: – relevant entries in the TLBs of other processor cores need to be flushed
Carnegie Mellon 15-410: Intro to Parallelism 11 Todd Mowry & Dave Eckhardt TLB Shootdown TLB 0 TLB 1 TLB 2 TLB N
CPU 0 CPU 1 CPU 2 … CPU N
Lock PTE Page Table Read-Write Read-Only f029 0 1 2 N Read-Only f835
1. Ini a ng core triggers OS to lock the corresponding Page Table Entry (PTE) 2. OS generates a list of cores that may be using this PTE (erring conserva vely) 3. Ini a ng core sends an Inter-Processor Interrupt (IPI) to those other cores – reques ng that they invalidate their corresponding TLB entries 4. Ini a ng core invalidates local TLB entry; waits for acknowledgements 5. Other cores receive interrupts, execute interrupt handler which invalidates TLBs – send an acknowledgement back to the ini a ng core 6. Once ini a ng core receives all acknowledgements, it unlocks the PTE
Carnegie Mellon 15-410: Intro to Parallelism 12 Todd Mowry & Dave Eckhardt TLB Shootdown Timeline
• Image from: Villavieja et al, “DiDi: Mi ga ng the Performance Impact of TLB Shootdowns Using a Shared TLB Directory.” In Proceedings of PACT 2011.
Carnegie Mellon 15-410: Intro to Parallelism 13 Todd Mowry & Dave Eckhardt Performance of TLB Shootdown
Image from: Baumann et al,“The Mul kernel: a New OS Architecture for Scalable Mul core • Expensive opera on Systems.” In Proceedings of SOSP ‘09. • e.g., over 10,000 cycles on 8 or more cores • Gets more expensive with increasing numbers of cores
Carnegie Mellon 15-410: Intro to Parallelism 14 Todd Mowry & Dave Eckhardt Now Let’s Consider Consistency Issues with the Caches
TLB 0 TLB 1 TLB 2 TLB N
CPU 0 CPU 1 CPU 2 … CPU N
L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $
L3 Cache
Memory
Carnegie Mellon 15-410: Intro to Parallelism 15 Todd Mowry & Dave Eckhardt Caches in a Single-Processor Machine (Review from 213)
• Ideally, memory would be arbitrarily fast, large, and cheap – unfortunately, you can’t have all three (e.g., fast à small, large à slow) – cache hierarchies are a hybrid approach • if all goes well, they will behave as though they are both fast and large • Cache hierarchies work due to locality – temporal locality à even rela vely small caches may have high hit rates – spa al locality à move data in blocks (e.g., 64 bytes) • Loca ng the data: – Main memory: directly addressed • we know exactly where to look: – at the unique loca on corresponding to the address – Cache: may or may not be somewhere in the cache • need tags to iden fy the data • may need to check mul ple loca ons, depending on the degree of associa vity
Carnegie Mellon 15-410: Parallel Mem Hierarchies 16 Todd Mowry & Dave Eckhardt Cache Read • Locate set • Check if any line in set E = 2e lines per set has matching tag • Yes + line valid: hit • Locate data star ng at offset Address of word: t bits s bits b bits S = 2s sets tag set block index offset
data begins at this offset v tag 0 1 2 B-1
valid bit B = 2b bytes per cache block (the data) Carnegie Mellon 15-410: Parallel Mem Hierarchies 17 Todd Mowry & Dave Eckhardt Intel Quad Core i7 Cache Hierarchy Processor package Core 0 Core 3 Regs Regs L1 I-cache and D-cache: 32 KB, 8-way, Access: 4 cycles L1 L1 L1 L1
D-cache I-cache D-cache I-cache … L2 unified cache: 256 KB, 8-way, L2 unified cache L2 unified cache Access: 11 cycles
L3 unified cache: 8 MB, 16-way, L3 unified cache Access: 30-40 cycles (shared by all cores) Block size: 64 bytes for all caches. Main memory
Carnegie Mellon 15-410: Parallel Mem Hierarchies 18 Todd Mowry & Dave Eckhardt Simple Mul core Example: Core 0 Loads X load r1 ß X (r1 = 17) Core 0 Core 3 L1-D L1-D Miss X 17 … L2 L2 X 17 Miss
L3 X 17 Miss
Main memory
X: 17 Retrieve from memory, fill caches
Carnegie Mellon 15-410: Parallel Mem Hierarchies 19 Todd Mowry & Dave Eckhardt Example Con nued: Core 3 Also Does a Load of X load r2 ß X (r2 = 17) Core 0 Core 3 Example of construc ve sharing: L1-D L1-D • Core 3 benefited Miss X 17 X 17 from Core 0 … bringing X into the L2 L2 L3 cache X 17 Miss X 17
Fairly straigh orward L3 with only loads. But what happens when X 17 Hit! (fill L2 & L1) stores occur?
Main memory
X: 17
Carnegie Mellon 15-410: Parallel Mem Hierarchies 20 Todd Mowry & Dave Eckhardt Review: How Are Stores Handled in Uniprocessor Caches?
store 5 à X
L1-D
X 17 à 5
L2 X 17 What happens here?
L3 We need to make sure we don’t have a X 17 consistency problem
Main memory
X: 17
Carnegie Mellon 15-410: Parallel Mem Hierarchies 21 Todd Mowry & Dave Eckhardt Op ons for Handling Stores in Uniprocessor Caches Op on 1: Write-Through Caches à propagate immediately store 5 à X
L1-D What are the advantages and disadvantages X 17 à 5 of this approach?
L2 X 17 17 à 5
L3 X 17 17 à 5
Main memory
X: 17 17 à 5
Carnegie Mellon 15-410: Parallel Mem Hierarchies 22 Todd Mowry & Dave Eckhardt Op ons for Handling Stores in Uniprocessor Caches Op on 1: Write-Through Caches Op on 2: Write-Back Caches • propagate immediately • defer propaga on un l evic on store 5 à X • keep track of dirty status in tags
L1-D L1-D
X 17 à 5 X 17 à 5 Dirty Upon evic on, if data is dirty, L2 L2 write it back. X 17 17 à 5 X 17 Write-back is more commonly L3 L3 used in prac ce X 17 17 à 5 X 17 (due to bandwidth limita ons)
Main memory Main memory
X: 17 17 à 5 X: 17
Carnegie Mellon 15-410: Parallel Mem Hierarchies 23 Todd Mowry & Dave Eckhardt Resuming Mul core Example 1. store 5 à X 2. load r2 ß X (r2 = 17) Hmm… Core 0 Core 3 L1-D L1-D What is supposed to happen in this case? X 17 à 5 Dirty Hit! X 17
… Is it incorrect behavior L2 L2 for r2 to equal 17? X 17 X 17 • if not, when would
it be?
L3 (Note: core-to-core X 17 communica on o en takes tens of cycles.) Main memory
X: 17
Carnegie Mellon 15-410: Parallel Mem Hierarchies 24 Todd Mowry & Dave Eckhardt What is Correct Behavior for a Parallel Memory Hierarchy?
• Note: side-effects of writes are only observable when reads occur – so we will focus on the values returned by reads
• Intui ve answer: – reading a loca on should return the latest value wri en (by any thread)
• Hmm… what does “latest” mean exactly? – within a thread, it can be defined by program order – but what about across threads? • the most recent write in physical me? – hopefully not, because there is no way that the hardware can pull that off » e.g., if it takes >10 cycles to communicate between processors, there is no way that processor 0 can know what processor 1 did 2 clock cks ago • most recent based upon something else? – Hmm…
Carnegie Mellon 15-410: Parallel Mem Hierarchies 25 Todd Mowry & Dave Eckhardt Refining Our Intui on Thread 0 Thread 1 Thread 2 // write evens to X // write odds to X … for (i=0; i