Revisiting the Future ☺

Revisiting the Future ☺ Erik Hagersten Uppsala University Sweden DARK2 in a nutshell 1. Memory Systems (caches, VM, DRAM, microbenchmarks, …) 2. Multiprocessors (TLP, coherence, interconnects, scalability, clusters, …) 3. CPUs (pipelines, ILP, scheduling, Superscalars, VLIWs, embedded, …) 4. Future: (physical limitations, TLP+ILP DARK in the CPU,…) 2009 Dept of Information Technology| www.it.uu.se 2 ©Erik Hagersten| user.it.uu.se/~eh How do we get good performance? Creating and exploring: 1) Locality a) Spatial locality b) Temporal locality c) Geographical locality 2) Parallelism a) Instruction level (ILP) b) Thread level (TLP) DARK c) Memory level (MLP) 2009 Dept of Information Technology| www.it.uu.se 3 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ DARK 2009 Dept of Information Technology| www.it.uu.se 4 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ DARK 2009 Dept of Information Technology| www.it.uu.se 5 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ DARK 2009 Dept of Information Technology| www.it.uu.se 6 ©Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ DARK 2009 Dept of Information Technology| www.it.uu.se 7 ©Erik Hagersten| user.it.uu.se/~eh Doubling (or Halving) times Dynamic RAM Memory (bits per dollar) 1.5 years Average Transistor Price 1.6 years Microprocessor Cost per Transistor Cycle 1.1 years Total Bits Shipped 1.1 years Processor Performance in MIPS 1.8 years Transistors in Intel Microprocessors 2.0 years Microprocessor Clock Speed 2.7 years DARK 2009 Dept of Information Technology| www.it.uu.se 8 ©Erik Hagersten| user.it.uu.se/~eh Old Trend1: Deeper pipelines Exploring ILP (instruction-level parallelism) I R Regs PC … Data Dependence DARK 1GB 2009 150cycles Mem Dept of Information Technology| www.it.uu.se 9 ©Erik Hagersten| user.it.uu.se/~eh Old Trend2: Wider pipelines Exploring more ILP I More pipelines + Deeper pipelines I Thread 1 Issue I Need more logic independent instructions PC … I Regs DARK 1GB 2009 150cycles Mem Dept of Information Technology| www.it.uu.se 10 ©Erik Hagersten| user.it.uu.se/~eh Old Trend3: Deeper memory hierarchy Exploring access locality I R B M M W I R B M M W Thread 1 Issue logic I R B M M W PC … I R B M M W Regs 2 cycles kr 2kB 10 cycles € 64kB 30 cycles £ 2MB DARK 1GB 2009 150cycles Mem Dept of Information Technology| www.it.uu.se 11 ©Erik Hagersten| user.it.uu.se/~eh Are we hitting the wall now? Pop: Can the transistors be made even smaller and faster? Performance Possible path, [log] but requires a paradigm shift 1000 ? Business as 100 ” w usual… La s re 10 o Well uhm … o M The transistors can “ p: be made smaller and DARK 1 Po 2009 faster, but there are other problems Dept of Information Technology| www.it.uu.se Now 12 ©ErikYear Hagersten| user.it.uu.se/~eh Microprocessors today: Whatever it takes to run Exploring ILP (instruction-level parallelism): one Faster clocks Superscalar Pipelines Branch Prediction program fast. Out-of-Order Execution Trace Cache Speculation Deep pipelines Predicate Execution Advanced Load Address Table Return Address Stack DARK …. 2009 Dept of Information Technology Bad News #1: We have already explored most ILP (instruction-level parallelism) | www.it.uu.se 13 ©Erik Hagersten | user.it.uu.se/~eh Bad News #2 Looong#2: Future wire delay Technology slow CPUs cycle time 830 ps (1.2GHz) 170ps Latency for a 5mm trace (RC model 390ps DARK 2009 Quantitative data and2000 trends 2005 according 2010 to 2015 V. Agarwal et al., ISCA70 ps (14 2000 GHz) Based on SIA Dept of Information Technology(Semiconductor Industry Assoc | www.it.uu.se iation) prediction, 1999 14 År ©Erik Hagersten | user.it.uu.se/~eh #2:Bad How News much of#2 the chip area can you reachLooong in one wire cycle? delay slow CPUs Span -- Fraction of chip reachable in one cycle 1.0 100% 0.75 0.50 SIA 0.25 0.4 - 1.4 % 2000 2005 2010 2015 Year DARK .25 .18 .13 .10 .070 .050 .0035 Feature size(micron) 2009 Dept of Information Technology| www.it.uu.se 15 ©Erik Hagersten| user.it.uu.se/~eh Bad News #3: Memory latency/bandwidth is the bottleneck… Slooow Memory C C CPU A B + B L1 cache L3 Ctrl L2 Cache A = B + C: Latency Read B 0.3 - 100 ns DARK Read C 0.3 - 100 ns 2009 Add B & C 0.3 ns WriteA 0.3 - 100 ns Dept of Information Technology| www.it.uu.se 16 ©Erik Hagersten| user.it.uu.se/~eh Bad News #4: Power is the limit Power consumption is the bottleneck Cooling servers is hard Battery lifetime for mobile computers Energy is money Dynamic effect is proportional to: P_dyn ~ Capacitance * Frequency * Voltage2 DARK 2009 Dept of Information Technology| www.it.uu.se 17 ©Erik Hagersten| user.it.uu.se/~eh Now What? #1: Running out of ILP #2: Wire delay is starting to hurt #3: Memory is the bottleneck DARK #4: Power is the limit 2009 Dept of Information Technology| www.it.uu.se 18 ©Erik Hagersten| user.it.uu.se/~eh Solving all the problems: exploring threads parallelism #1: Running out of ILP feed one CPU with instr. from many threads #2: Wire delay is starting to hurt Multiple small CPUs with private L1$ #3: Memory is the bottleneck memory accesses from many threads (MLP) #4: Power is the limit DARK 2009 Lower the frequency lower voltage Dept of Information Technology| www.it.uu.se 19 ©Erik Hagersten| user.it.uu.se/~eh Bad News #1: Not enough ILP 1(2) feed one CPU with instr. from many threads Slooow Memory CPU L1 Issue Logic LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 DARK LD X, R2 2009 ST R1, R2, R3 Dept of Information Technology| www.it.uu.se 20 ©Erik Hagersten| user.it.uu.se/~eh Bad News #1: Not enough ILP 2(2) feed one CPU with instr. from many threads CPU L1 Issue Logic LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 LD X, R2 LD X, R2 ST R1, R2, R3 ST R1, R2, R3 DARK LD X, R2 LD X, R2 2009 ST R1, R2, R3 ST R1, R2, R3 Dept of Information Technology| www.it.uu.se 21 ©Erik Hagersten| user.it.uu.se/~eh SMT: Simultaneous Multithreading ”Combine TLP&ILP to find independent instr.” Thread 1 I R B M M W I R B M M W PC … Issue logic I R B M M W Thread N I R B M M W Regs 1 … PC Regs N SEK € $ DARK 2009 Mem Dept of Information Technology| www.it.uu.se 22 ©Erik Hagersten| user.it.uu.se/~eh Thread-interleaved Historical Examples: Denelcor, HEP, Tera Computers [B. Smith] 1984 Each thread executes every n:th cycle in a round-robin fashion -- Poor single-thread performance -- Expensive (due to early adoption) Intel’s Hyperthreading (2002??) -- Poor implementation DARK 2009 Dept of Information Technology| www.it.uu.se 23 ©Erik Hagersten| user.it.uu.se/~eh Bad News #2:wire delay Multiple small CPUs with private L1$ CPUCPUCPUCPU L1L1L1 L1L1 cache cachecache Externa L1 cache L3 L3 L2 Cache Externa Ctrl L3 DARK 2009 Dept of Information Technology| www.it.uu.se 24 ©Erik Hagersten| user.it.uu.se/~eh CMP: Chip Multiprocessor more TLP & geographical locality Thread 1 I R B M M W Issue logic I R B M M W PC Regs … SEK Thread N I R B M M W Issue logic I R B M M W PC Regs SEK € $ DARK 2009 Mem Dept of Information Technology| www.it.uu.se 25 ©Erik Hagersten| user.it.uu.se/~eh Bad News #3: memory latency/bandwidth memory accesses from many threads (MLP) Sloooow memory C C Thread+ 2 A B B C Thread 1 C + A B B DARK TLP MLP 2009 Thread-Level Parallellism Memory-Level Parallelism (MLP) Dept of Information Technology| www.it.uu.se 26 ©Erik Hagersten| user.it.uu.se/~eh Bad News #4: Power consumption Lower the frequency lower voltage 2 2 Pdyn = C * f * V ≈ area * freq * voltage CPU CPU CPU freq=f vs. freq=f/2 freq=f/2 2 2 Pdyn(C, f, V) = CfV Pdyn(2C, f/2, <V) < CfV CPU CPU CPU vs. freq=f CPU CPU freq = f/2 2 2 Pdyn(C, f, V) = CfV Pdyn(C, f/2, <V) < ½ CfV DARK 2009 Throughput < 2x Dept of Information Technology| www.it.uu.se 27 ©Erik Hagersten| user.it.uu.se/~eh Example: Freq. Scaling speed 0.87 pow 0.51 freq speed pow freq speed freq speed 1.20 1.13 1.51 0.80 0.87 pow 0.80 0.87 pow 0.51 0.51 20% higher freq. 20% lower freq. 20% lower freq. Two cores DARK 2009 Dept of Information Technology| www.it.uu.se 28 ©Erik Hagersten| user.it.uu.se/~eh Multicore CPUs: Ageia PhysX, a multi-core physics processing unit. Ambric Am2045, a 336-core Massively Parallel Processor Array (MPPA) AMD Athlon 64, Athlon 64 FX and Athlon 64 X2 family, dual-core desktop processors.

Revisiting the Future ☺

HELLAS: a Specialized Architecture for Interactive Deformable Object Modeling

Multiprocessing Contents

Graphics Hardware

Corporate Backgrounder

Introduction

Multicore: Why Is It Happening Now? Eller Hur Mår Moore’S Lag?

Gputerasort: High Performance Graphics Co-Processor Sorting For

The Hardware Accelerated Physics Engine with Operating Parameters Controlling the Numerical Error Tolerance 1

F06 Physics Processor

PC Hardware Contents

A Comprehensive Survey of Various Processor Types & Latest

System-On-A-Chip