EEL 5764 Graduate Computer Architecture Chapter 4

Outline • MP Motivation EEL 5764 Graduate Computer • SISD v. SIMD v. MIMD Architecture • Centralized vs. Distributed Memory • Challenges to Parallel Programming • Consistency, Coherency, Write Serialization Chapter 4 - Multiprocessors and TLP • Snoopy Cache Ann Gordon-Ross • Directory-based protocols and examples Electrical and Computer Engineering University of Florida http://www.ann.ece.ufl.edu/ These slides are provided by: David Patterson Electrical Engineering and Computer Sciences, University of California, Berkeley Modifications/additions have been made from the originals 11/11/09 2 Uniprocessor Performance (SPECint) - Déjà vu all over again? Revisited….yet again 3X “… today’s processors … are nearing an impasse as technologies approach From Hennessy and Patterson, the speed of light..” Computer Architecture: A Quantitative Approach, 4th edition, 2006! David Mitchell, The Transputer: The Time Is Now (1989) • Transputer had bad timing (Uniprocessor performance!) " Procrastination rewarded: 2X seq. perf. / 1.5 years • “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) • All microprocessor companies switch to MP (2X CPUs / 2 yrs) " Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year! AMD/’05! Intel/’06! IBM/’04! Sun/’05! Processors/chip! 2! 2! 2! 8! • VAX : 25%/year 1978 to 1986 Threads/Processor! !1 !2 !2 !4 • RISC + x86: 52%/year 1986 to 2002 11/11/09 3 Threads/chip11/11/09 ! 2 4 4 32 4 • RISC + x86: ??%/year 2002 to present ! ! ! ! Other Factors Pushing Multiprocessors Other Factors Pushing Multiprocessors • Growth in data-intensive applications • Lessons learned: – Data bases, file servers, … – Improved understanding in how to use – Inherently parallel - SMT can’t fully exploit multiprocessors effectively • Growing interest in servers, server perf. » Especially in servers where significant natural TLP – Internet – Advantages in replication rather than unique • Increasing desktop perf. less important design (outside of graphics) » In uniprocessor, redesign every few years = – Don’t need to run Word any faster tremendous R&D • Or many designs for different customer demands (Celeron – But near unbounded performance increase has lead to vs. Pentium) terrible programming » Shift efforts to multiprocessor • Simple add more processors for more performance 11/11/09 5 11/11/09 6 M.J. Flynn, "Very High-Speed Computers", Outline Flynn’s Taxonomy Proc. of the IEEE, V 54, 1900-1909, Dec. 1966. • MP Motivation • Flynn divided the world into two streams in 1966 = instruction and data • SISD v. SIMD v. MIMD Single Instruction Single Single Instruction Multiple • Centralized vs. Distributed Memory Data (SISD) Data SIMD • Challenges to Parallel Programming (Uniprocessor) (single PC: Vector, CM-2) • Consistency, Coherency, Write Serialization Multiple Instruction Single Multiple Instruction Multiple Data (MISD) Data MIMD • Snoopy Cache (????) (Clusters, SMP servers) • Directory-based protocols and examples • SIMD " Data Level Parallelism • MIMD " Thread Level Parallelism • MIMD popular because – Flexible: N pgms and 1 multithreaded pgm – Cost-effective: same MPU in desktop & MIMD 11/11/09 7 11/11/09 8 Outline Back to Basics • MP Motivation • A parallel computer is… – … a collection of processing elements that cooperate and • SISD v. SIMD v. MIMD communicate to solve large problems fast. • Centralized vs. Distributed Memory • How do we build a parallel architecture? • Challenges to Parallel Programming – Computer Architecture + Communication Architecture • 2 classes of multiprocessors WRT memory: • Consistency, Coherency, Write Serialization 1. Centralized Memory Multiprocessor • Snoopy Cache • Take a single design and just keep adding more • Directory-based protocols and examples processors/cores • few dozen processor chips (and < 100 cores) in 2006 • Small enough to share single, centralized memory • But interconnect is becoming a bottleneck….. 2. Physically Distributed-Memory multiprocessor • Can have larger number chips and cores • BW demands are met by distributing memory among processors 11/11/09 9 11/11/09 10 Centralized vs. Distributed Memory Centralized Memory Multiprocessor Intel AMD • Also called symmetric multiprocessors (SMPs) Scale • main memory has a symmetric relationship to all processors P Pn 1 P1 Pn • All processors see same access time to memory • Reducing interconnect bottleneck $ $ $ $ Mem Mem • Large caches " single memory can satisfy memory demands Interconnection network of small number of processors Interconnection network Mem Mem • How big can the design realistically be? • Scale to a few dozen processors by using a switch and by Centralized Memory Distributed Memory using many memory banks • Scaling beyond that is technically conceivable but….it All memory is far Close memory and far memory becomes less attractive as the number of processors sharing centralized memory increases Logically connected but • Longer wires = longer latency on different banks • Higher load = higher power • More contention = bottleneck for shared resource 11/11/09 11 11/11/09 12 2 Models for Communication and Distributed Memory Multiprocessor Memory Architecture 1. message-passing multiprocessors • Distributed memory is a “must have” for • Communication occurs by explicitly passing messages among big designs the processors 2. shared memory multiprocessors • Pros: • Communication occurs through a shared address space (via • Cost-effective way to scale memory bandwidth loads and stores): either • If most accesses are to local memory • UMA (Uniform Memory Access time) for shared address, • Reduces latency of local memory accesses centralized memory MP • NUMA (Non Uniform Memory Access time multiprocessor) for • Cons: shared address, distributed memory MP • Communicating data between processors more complex • More complicated • Software aware • In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or • Must change software to take advantage of increased memory BW sharing address space 11/11/09 13 11/11/09 14 Outline Challenges of Parallel Processing • MP Motivation • First challenge is % of program inherently • SISD v. SIMD v. MIMD sequential • Centralized vs. Distributed Memory • Suppose 80X speedup from 100 processors. What fraction of original program can be • Challenges to Parallel Programming sequential? • Consistency, Coherency, Write Serialization a. 10% • Snoopy Cache b. 5% • Directory-based protocols and examples c. 1% d. <1% 11/11/09 15 11/11/09 16 Amdahl’s Law Answers Challenges of Parallel Processing • Second challenge is long latency to remote memory • Suppose 32 CPU MP, 2GHz, 200 ns remote memory (400 clock cycles), all local accesses hit memory hierarchy and base CPI is 0.5. • What is the performance impact if 0.2% instructions involve remote access? a. 1.5X b. 2.0X c. 2.5X 11/11/09 17 11/11/09 18 CPI Equation Challenges of Parallel Processing • CPI = Base CPI + 1. Need new advances in algorithms Remote request rate • Application parallelism x Remote request cost 2. New programming languages • CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 • Hard to program parallel applications • No communication is 1.3/0.5 or 2.6 faster 3. How to deal with long remote latency than 0.2% instructions involve local impact access • both by architect and by the programmer – For example, reduce frequency of remote accesses either by » Caching shared data (HW) » Restructuring the data layout to make more accesses local (SW) 11/11/09 19 11/11/09 20 Symmetric Shared-Memory Architectures Outline - UMA • MP Motivation • From multiple boards on a shared bus to • SISD v. SIMD v. MIMD multiple processors inside a single chip • Centralized vs. Distributed Memory • Equal access time for all processors to • Challenges to Parallel Programming memory via shared bus • Consistency, Coherency, Write Serialization • Each processor will cache both • Snoopy Cache – Private data are used by a single processor – Shared data are used by multiple processors • Directory-based protocols and examples • Advantage of caching shared data – Reduces latency to shared data, memory bandwidth for shared data, and interconnect bandwidth – But adds cache coherence problem 11/11/09 21 11/11/09 22 Example Cache Coherence Problem Not Just Cache Coherency…. P1 P2 P3 • Getting single variable values coherent isn’t the u = ? u = ? 3 only issue 4 $ $ 5 $ – Coherency alone doesn’t lead to correct program execution u :5 u :5 u = 7 • Also deals with synchronization of different variables that interact – Shared data values not only need to be coherent but order of I/O devices access to those values must be protected 1 2 u :5 Memory – Processors see different values for u after event 3 – With write back caches, depends on which cache flushes first » Processes accessing main memory may see very stale value – Unacceptable for programming, and its frequent! 11/11/09 23 11/11/09 24 This process should see value Example Intuitive Memory Model written immediately P1 P2 P • Reading an address should return the last value /*Assume initial value of A and flag is 0*/ L1 written to that address /*spin idly*/ 100:67 A = 1; while (flag == 0); – Easy in uniprocessors L2 flag = 1; print A; 100:35 – In multiprocessors, more complicated than just seeing Memory the last value written • expect memory to respect order between accesses to » How do you define write Disk 100:34 order between different different locations issued by a given process processes – to

EEL 5764 Graduate Computer Architecture Chapter 4

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Computer Architecture, Part 7