Amdahl's Law in the Multicore

Amdahl’s Law in the Multicore Era Mark D. Hill and Michael R. Marty University of Wisconsin—Madison August 2008 @ Semiahmoo Workshop IBM’s Dr. Thomas Puzak: Everyone knows Amdahl’s Law But quickly forgets it! © 2008 Multifacet Project University of Wisconsin-Madison Executive Summary • Develop A Corollary to Amdahl’s Law – Simple Model of Multicore Hardware – Complements Amdahl’s software model – Fixed chip resources for cores – Core performance improves sub-linearly with resources • Research Implications (1) Need Dramatic Increases in Parallelism (No Surprise) • 99% parallel limits 256 cores to speedup 72 • New Moore’s Law: Double Parallelism Every Two Years? (2) Many larger chips need increased core performance (3) HW/SW for asymmetric designs (one/few cores enhanced) (4) HW/SW for dynamic designs (serial parallel) 8/6/2008 4 Wisconsin Multifacet Project Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 8/6/2008 5 Wisconsin Multifacet Project 8/6/2008 Percent Multiprocessor Papers in ISCA 100 http://www.cs.wisc.edu/~markhill/mp2001.html & Rajwar, Hill Source: 10 20 30 40 50 60 70 80 90 0 How hasArchitecture Research Prepared? 1973 1974 1976 1977 1978 1979 1980 1981 The Rise & Fall of Multiprocessor Papers in ISCA, &Fall of Multiprocessor Rise Papers The 1982 1983 1984 1985 SMP Bulge SMP 1986 1987 1988 11 1989 1990 1991 1992 (3/2001) (3/2001) 1993 1994 Multicore 1995 up Lead 1996 Wisconsin Multifacet Project MultifacetWisconsin 1997 to 1998 1999 2000 2001 2002 Next? 2003 What 2004 2005 2006 2007 8/6/2008 Percent Multiprocessor Papers in ISCA 100 Source: Hill, 2/2008 2/2008 Hill, Source: 10 20 30 40 50 60 70 80 90 0 How hasArchitecture Research Prepared? 1973 1974 1976 1977 1978 1979 1980 1981 1982 1983 Overreact? Architecture Research Will 1984 1985 1986 1987 1988 12 1989 1990 1991 1992 1993 1994 1995 1996 Multicore Ramp Wisconsin Multifacet Project MultifacetWisconsin 1997 1998 Reacted? 1999 2000 2001 2002 2003 2004 2005 2006 2007 What About PL/Compilers (PLDI) Research? 100% 90% 80% Lead up What 70% to Next? 60% Multicore 50% 40% End of Small SMP Bulge? Gentle Multicore Ramp 30% 20% 10% 0% Percent Multiprocessor Multiprocessor Papers Percent 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 PLDI Begins Source: Steve Jackson, 3/2008 8/6/2008 14 Wisconsin Multifacet Project What About Systems (SOSP/OSDI) Research? 100% 90% 80% 70% Lead up What 60% to 50% Next? Small SMP Bulge Multicore 40% NO Multicore 30% Ramp (Yet) 20% 10% 0% Percent Multiprocessor Papers Multiprocessor Percent 1967 1969 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1994 1995 1996 1997 1999 1999 2000 2001 2002 2003 2004 2005 2006 2007 SOSP odd years only ODSI even & SOSP odd Source: Michael Swift, 3/2008 8/6/2008 15 Wisconsin Multifacet Project Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 8/6/2008 16 Wisconsin Multifacet Project Recall Amdahl’s Law • Begins with Simple Software Assumption (Limit Arg.) – Fraction F of execution time perfectly parallelizable – No Overhead for – Scheduling – Communication – Synchronization, etc. – Fraction 1 – F Completely Serial • Time on 1 core = (1 – F) / 1 + F / 1 = 1 • Time on N cores = (1 – F) / 1 + F / N 8/6/2008 17 Wisconsin Multifacet Project Recall Amdahl’s Law [1967] 1 Amdahl’s Speedup = F 1 - F + 1 N • For mainframes, Amdahl expected 1 - F = 35% – For a 4-processor speedup = 2 – For infinite-processor speedup < 3 – Therefore, stay with mainframes with one/few processors • Amdahl’s Law applied to Minicomputer to PC Eras • What about the Multicore Era? 8/6/2008 18 Wisconsin Multifacet Project Designing Multicore Chips Hard • Designers must confront single-core design options – Instruction fetch, wakeup, select – Execution unit configuation & operand bypass – Load/queue(s) & data cache – Checkpoint, log, runahead, commit. • As well as additional design degrees of freedom – How many cores? How big each? – Shared caches: levels? How many banks? – Memory interface: How many banks? – On-chip interconnect: bus, switched, ordered? 8/6/2008 19 Wisconsin Multifacet Project Want Simple Multicore Hardware Model To Complement Amdahl’s Simple Software Model (1) Chip Hardware Roughly Partitioned into – Multiple Cores (with L1 caches) – The Rest (L2/L3 cache banks, interconnect, pads, etc.) – Changing Core Size/Number does NOT change The Rest (2) Resources for Multiple Cores Bounded – Bound of N resources per chip for cores – Due to area, power, cost ($$$), or multiple factors – Bound = Power? (but our pictures use Area) 8/6/2008 20 Wisconsin Multifacet Project Want Simple Multicore Hardware Model, cont. (3) Micro-architects can improve single-core performance using more of the bounded resource • A Simple Base Core – Consumes 1 Base Core Equivalent (BCE) resources – Provides performance normalized to 1 • An Enhanced Core (in same process generation) – Consumes R BCEs – Performance as a function Perf(R) • What does function Perf(R) look like? 8/6/2008 21 Wisconsin Multifacet Project More on Enhanced Cores • (Performance Perf(R) consuming R BCEs resources) • If Perf(R) > R Always enhance core • Cost-effectively speedups both sequential & parallel • Therefore, Equations Assume Perf(R) < R • Graphs Assume Perf(R) = Square Root of R – 2x performance for 4 BCEs, 3x for 9 BCEs, etc. – Why? Models diminishing returns with ―no coefficients‖ – Alpha EV4/5/6 [Kumar 11/2005] & Intel’s Pollack’s Law • How to speedup enhanced core? – <Insert favorite or TBD micro-architectural ideas here> 8/6/2008 22 Wisconsin Multifacet Project Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 8/6/2008 23 Wisconsin Multifacet Project How Many (Symmetric) Cores per Chip? • Each Chip Bounded to N BCEs (for all cores) • Each Core consumes R BCEs • Assume Symmetric Multicore = All Cores Identical • Therefore, N/R Cores per Chip — (N/R)*R = N • For an N = 16 BCE Chip: Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core 8/6/2008 24 Wisconsin Multifacet Project Performance of Symmetric Multicore Chips • Serial Fraction 1-F uses 1 core at rate Perf(R) • Serial time = (1 – F) / Perf(R) • Parallel Fraction uses N/R cores at rate Perf(R) each • Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N • Therefore, w.r.t. one base core: 1 Symmetric Speedup = F * R 1 - F + Perf(R) Perf(R)*N • Implications? Enhanced Cores speed Serial & Parallel 8/6/2008 25 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 14 12 10 8 6 4 Symmetric Speedup Symmetric F=0.5 2 F=0.5 R=16, 0 Cores=1, 1 2 4 8 16 Speedup=4 (16 cores) (8 cores) R BCEs (2 cores) (1 core) (4 cores) F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16)) Need to increase parallelism to make multicore optimal! 8/6/2008 26 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 14 12 10 8 F=0.9 6 4 F=0.9, R=2, Cores=8, Speedup=6.7 Symmetric Speedup Symmetric F=0.5 2 F=0.5 0 R=16, 1 2 4 8 16 Cores=1, R BCEs Speedup=4 At F=0.9, Multicore optimal, but speedup limited Need to obtain even more parallelism! 8/6/2008 27 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 14 F=0.99 F1, R=1, Cores=16, Speedup16 12 F=0.975 10 8 F=0.9 6 4 Symmetric Speedup Symmetric F=0.5 2 0 1 2 4 8 16 R BCEs F matters: Amdahl’s Law applies to multicore chips MANY Researchers should target parallelism F first 8/6/2008 28 Wisconsin Multifacet Project Need a Third ―Moore’s Law?‖ • Technologist’s Moore’s Law – Double Transistors per Chip every 2 years – Slows or stops: TBD • Microarchitect’s Moore’s Law – Double Performance per Core every 2 years – Slowed or stopped: Early 2000s • Multicore’s Moore’s Law – Double Cores per Chip every 2 years – & Double Parallelism per Workload every 2 years – & Aided by Architectural Support for Parallelism – = Double Performance per Chip every 2 years – Starting now • Software as Producer, not Consumer, of Performance Gains! 8/6/2008 29 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 14 F=0.99 12 F=0.975 10 8 F=0.9 6 Recall F=0.9, R=2, Cores=8, Speedup=6.7 4 Symmetric Speedup Symmetric F=0.5 2 0 1 2 4 8 16 R BCEs As Moore’s Law enables N to go from 16 to 256 BCEs, More cores? Enhance cores? Or both? 8/6/2008 30 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 256 BCEs 250 200 F1 F=0.999 R=1 (vs. 1) Cores=256 (vs. 16) 150 Speedup=204 (vs. 16) MORE CORES! 100 F=0.99 F=0.975 Symmetric Speedup Symmetric 50 F=0.99 F=0.9 R=3 (vs. 1) F=0.5 Cores=85 (vs.0 16) F=0.9 Speedup=80 (vs. 113.9) 2 4 8 16 32 R=2864 (vs.128 2) 256 Cores=9 (vs. 8) MORE CORES R BCEs Speedup=26.7 (vs. 6.7) & ENHANCE CORES! ENHANCE CORES! As Moore’s Law increases N, often need enhanced core designs Some arch.

Amdahl's Law in the Multicore

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Amdahl's Law Threading, Openmp

Parallel Generation of Image Layers Constructed by Edge Detection

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin

An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications

Computer Systems Architecture

Computer Architecture: Parallel Processing Basics

Your Speedup Is a Megaflop!

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Vector Processors

SIMD the Good, the Bad and the Ugly