Parallel Processors and Cache Coherency

Intro Cache Coherency Notes Computer Architecture II Parallel Processors and Cache Coherency Syed Asad Alam School of Computer Science and Statistics 1/31 Intro Cache Coherency Notes Introduction 2/31 Intro Cache Coherency Notes Flynn’s Taxonomy Introduced by M. J. Flynn SISD → Single processor with a single instruction stream and a single memory (Uniprocessors) SIMD → Single machine instruction with multiple processing elements operating on independent set of data MISD → Multiple processors, each of which execute different instruction sequence on a single sequence of data MIMD → Multiple processors execute different instruction sequences on different data sets 3/31 Intro Cache Coherency Notes Flynn’s Taxonomy Introduced by M. J. Flynn SISD → Single processor with a single instruction stream and a single memory (Uniprocessors) SIMD → Single machine instruction with multiple processing elements operating on independent set of data MISD → Multiple processors, each of which execute different instruction sequence on a single sequence of data MIMD → Multiple processors execute different instruction sequences on different data sets Text sourced from: Computer Organization and Architecture Designing for Performance, William Stallings, Chapter 17 & 18 3/31 Intro Cache Coherency Notes Flynn’s Taxonomy1 1 W. Stallings 4/31 Intro Cache Coherency Notes Flynn’s Taxonomy2 2 W. Stallings 5/31 Intro Cache Coherency Notes Multi-Computer/Cluster – Loosely Coupled 6/31 Intro Cache Coherency Notes Multi-Computer/Cluster – Loosely Coupled 6/31 Intro Cache Coherency Notes Cluster Computer Architecture3 3 W. Stallings 7/31 Intro Cache Coherency Notes Example IITAC Cluster [in Lloyd building] 346 x IBM e326 compute node each with 2 x 2.4GHz 64bit AMD Opteron 250 CPUs, 4GB RAM, ... 80GB SATA scratch disk, 2 x Broadcom BCM5704 Gbit Ethernet and PCI-X 10Gb InfiniBand connect 2 x Voltaire 288 port InfiniBand switches + 512 port Gbit switch Debian [Linux] Theoretical peak performance 3.4TFlops [Linpac 2.7Tflops] th ranked 345 most powerful supercomputer [in 2006] Other examples → A distributed system, cloud, ... 8/31 Intro Cache Coherency Notes Multiprocessor – Tightly Coupled CPUs physically share memory and I/O Cache coherency can be handled by CPU Inter-processor communication via shared memory Symmetric multiprocessing possible [ie. single shared copy of operating system] Critical sections protected by locks/semaphores Processes can easily migrate between CPUs 9/31 Intro Cache Coherency Notes Symmetric Multiprocessors Key characteristics There are two or more similar processors of comparable capability These processors share the same main memory and I/O facilities and are interconnected by a bus or other internal connection scheme, such that memory access time is approximately the same for each processor All processors share access to I/O devices All processors can perform the same functions (hence the term symmetric) The system is controlled by an integrated operating system that provides interaction between processors and their programs at the job, task, file, and data element levels. 10/31 Intro Cache Coherency Notes Example – IBM z990 Multiprocessor 11/31 Intro Cache Coherency Notes Example – Intel i7 12/31 Intro Cache Coherency Notes Example – ARM11 MPCore Processor 13/31 Intro Cache Coherency Notes Cache Coherency 14/31 Intro Cache Coherency Notes Multiprocessor Cache Coherency What is the problem? 15/31 Intro Cache Coherency Notes Multiprocessor Cache Coherency What is the problem? → Guarantee CPU always reads up-to-date value Why is this even a problem? → Multiple copies of the same data in different caches 15/31 Intro Cache Coherency Notes Writing Protocol Write back → Write through → 16/31 Intro Cache Coherency Notes Writing Protocol Write back → Line only updated in one cache unless replaced Write through → Line and memory updated together, inconsistency unless other caches monitors memory traffic or receive update 16/31 Intro Cache Coherency Notes Solutions Software Solutions Hardware Solutions 17/31 Intro Cache Coherency Notes Solutions Software Solutions Compiler based coherence mechanisms Analysis of code Determine data items unsafe for caching Prevent shared data variable from being cached (too conservative) Enforce coherence during critical periods and determine safe periods Hardware Solutions 17/31 Intro Cache Coherency Notes Solutions Software Solutions Compiler based coherence mechanisms Analysis of code Determine data items unsafe for caching Prevent shared data variable from being cached (too conservative) Enforce coherence during critical periods and determine safe periods Hardware Solutions Cache coherence protocols Dynamic recognition of potential inconsistency conditions Two categories 17/31 Intro Cache Coherency Notes Solutions Software Solutions Compiler based coherence mechanisms Analysis of code Determine data items unsafe for caching Prevent shared data variable from being cached (too conservative) Enforce coherence during critical periods and determine safe periods Hardware Solutions Cache coherence protocols Dynamic recognition of potential inconsistency conditions Two categories Directory protocols Snoopy protocols 17/31 Intro Cache Coherency Notes Directory Protocols Based on a centralized controller Collects and maintains information about data in cache in a directory To write Processor → Requests exclusive access Controller → Forces other processors to invalidate entries To read Requesting processor → Miss notification to controller Modifying Processor → Writes modified block to memory Drawbacks Central bottleneck Communication overhead 18/31 Intro Cache Coherency Notes Snoopy Protocols Distributed control with all cache controllers Responsibility on each cache controller to recognize shared data Updates → Broadcast to all caches Each cache controller → Snoop (observe) broadcast notifications 19/31 Intro Cache Coherency Notes Snoopy Protocols Distributed control with all cache controllers Responsibility on each cache controller to recognize shared data Updates → Broadcast to all caches Each cache controller → Snoop (observe) broadcast notifications Two approaches Write invalidate Write update/broadcast 19/31 Intro Cache Coherency Notes Snoopy Protocols Distributed control with all cache controllers Responsibility on each cache controller to recognize shared data Updates → Broadcast to all caches Each cache controller → Snoop (observe) broadcast notifications Two approaches Write invalidate Write update/broadcast Each approach has its benefits Adaptive protocols exist Pentium 4 Write invalidate Four states → Modified, exclusive, shared, invalid Two extra bits in the cache tag Referred to as MESI protocol 19/31 Intro Cache Coherency Notes Snoopy Protocols Distributed control with all cache controllers Responsibility on each cache controller to recognize shared data Updates → Broadcast to all caches Each cache controller → Snoop (observe) broadcast notifications Two approaches Write invalidate Write update/broadcast Each approach has its benefits Adaptive protocols exist Pentium 4 Write invalidate Four states → Modified, exclusive, shared, invalid Two extra bits in the cache tag Referred to as MESI protocol Other protocols → Dragon, Firefly, Berkeley, Illinois, write-once, write through 19/31 Intro Cache Coherency Notes Write Through4 Write through caches Observed bus write → Invalidate entry Write hit to dirty line → Fetch data from memory 4 J. Jones 20/31 Intro Cache Coherency Notes Write Through4 Write through caches Observed bus write → Invalidate entry Write hit to dirty line → Fetch data from memory www.scss.tcd.ie/Jeremy.Jones/VivioJS/caches/writeThroughHelp.htm 4 J. Jones 20/31 Intro Cache Coherency Notes Write Once Write back caches (but with one write through update) Cache lines in one of four states Invalid → Cache line updated elsewhere Valid → Cache(s) ≡ Memory Reserved → Cache line updated once, updated through to memory Dirty → Multiple updates, only one copy 21/31 Intro Cache Coherency Notes Write Once5 5 J. Jones 22/31 Intro Cache Coherency Notes Write Once Read miss Dirty in another cache → Copy from cache, update to valid Nowhere → Copy from memory, set to valid 23/31 Intro Cache Coherency Notes Write Once Read miss Dirty in another cache → Copy from cache, update to valid Nowhere → Copy from memory, set to valid Write hit Dirty → Proceed locally Reserved → Proceed locally, update to dirty Valid → Write through to memory, update to reserved. Invalid in other caches. No write back on replacement Invalid → Read from other cache, update to reserved, other caches invalid 23/31 Intro Cache Coherency Notes Write Once Read miss Dirty in another cache → Copy from cache, update to valid Nowhere → Copy from memory, set to valid Write hit Dirty → Proceed locally Reserved → Proceed locally, update to dirty Valid → Write through to memory, update to reserved. Invalid in other caches. No write back on replacement Invalid → Read from other cache, update to reserved, other caches invalid Write miss Nowhere → Copy from memory, set to reserved Dirty in another cache → Copy from cache, set to dirty/reserved, other caches invalid 23/31 Intro Cache Coherency Notes Write Once Read miss Dirty in another cache → Copy from cache, update to valid Nowhere → Copy from memory, set to valid Write hit Dirty → Proceed locally Reserved → Proceed locally, update to dirty Valid → Write through to memory, update to reserved. Invalid in other caches. No write back on replacement Invalid → Read from other cache, update to reserved, other caches invalid

Parallel Processors and Cache Coherency

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Computer Architecture, Part 7

Mjb › Cs575 › Handouts › Cache.1Pp.Pdf Caching Issues in Multicore Performance