The End of Moore’s Law & Faster General Purpose Computing, and a Road Forward

John Hennessy Stanford University March 2019 The End of an Era

• 40 years of stunning progress in microprocessor design • 1.4x annual performance improvement for 40+ years ≈ 106 x faster (throughput)! • Three architectural innovations: • Width: 8->16->64 bit (~4x) • Instruction level parallelism: • 4-10 cycles per instruction to 4+ instructions per cycle (~10-20x) • Multicore: one processor to 32 cores (~32x)

• Clock rate: 3 MHz to 4 GHz (through technology & architecture)

• Made possible by IC technology: • Moore’s Law: growth in • Dennard Scaling: power/transistor shrinks as speed & density increase • Power = frequency x CV2 • Energy expended per computation was reducing

Future processors 1 THREE CHANGES CONVERGE

• Technology • End of Dennard scaling: power becomes the key constraint • Slowdown in Moore’s Law: transistors cost (even unused)

• Architectural • Limitation and inefficiencies in exploiting instruction level parallelism end the uniprocessor era. • Amdahl’s Law and its implications end the “easy” multicore era

• Application focus shifts • From desktop to individual, mobile devices and ultrascale cloud computing, IoT: new constraints.

Future processors 2 UNIPROCESSOR PERFORMANCE (SINGLE CORE)

Performance = highest SPECInt by year; from Hennessy & Patterson [2018]. Future processors 3 MOORE’S LAW IN DRAMS

4 THE TECHNOLOGY SHIFTS MOORE’S LAW SLOWDOWN IN PROCESSORS

10X

Cost/transisto r slowing down faster, due to fab costs.

Future processors 5 TECHNOLOGY, POWER, AND DENNARD SCALING

200 4.5

180 4 Technology (nm) 160 3.5 Energy/nm^2 140 3 120 2.5 100 2 80 1.5 60 Namometers 1 40

20 0.5 nm^2 per Power Relative

0 0 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Power consumption Energy scaling for fixed task is better, since more & faster xistors. based on models in Esmaeilzadeh [2011]. 6 ENERGY EFFICIENCY IS THE NEW METRIC

Battery lifetime determines effectiveness!

LCD is biggest; CPU close behind.

“Always on” assistants likely to increase CPU

Future processors demand.7 AND IN THE CLOUD

Effective Operating Cost Capital Costs (with amortization) Shell and Amortized land servers

Amortized Power = network cooling Power infrastructure Servers + electricity Other amortized infrastructure Networking Staff equipment

Future processors 8 END OF DENNARD SCALING IS A CRISIS

• Energy consumption has become more important to users • For mobile, IoT, and for large clouds

• Processors have reached their power limit • Thermal dissipation is maxed out (chips turn off to avoid overheating!) • Even with better packaging: heat and battery are limits.

• Architectural advances must increase energy efficiency • Reduce power or improve performance for the same power

• But, most architectural techniques have reached limits in energy efficiency! • 1982-2005: Instruction level parallelism • Compiler and processor find parallelism • 2005-2017: Multicore • Programmer identifies parallelism • Caches: diminishing returns (small incremental improvements).

Future processors 9 Instruction Level Parallelism Era 1982-2005

• Instruction level parallelism achieves significant performance advantages

• Pipelining: 5 stages to 15+ stages to allow faster clock rates (energy neutralized by Dennard scaling) • Multiple issue: <1 instruction/clock to 4+ instructions/clock • Significant increase in transistors to increase issue rate • Why did it end? • Diminishing returns in efficiency

Future processors 10 Getting More ILP

• Branches and memory aliasing are a major limit: • 4 instructions/clock x 15 deep pipelineè need more than 60 instructions “in flight” • Speculation was introduced to allow this • Speculation involves predicting program behavior • Predict branches & predict matching memory addresses • If prediction is accurate can proceed • If the prediction is inaccurate, undo the work and restart • How good must branch prediction be—very good! • 15-deep pipeline: ~4 branches 94% correct = 98.7% • 60-instructions in flight: ~15 branches 90% = 99%

Future processors 11 WASTED WORK ON THE I7

40%

35%

30%

25%

20%

15% Work wasted / Total work Total / Work wasted

10%

5%

0%

GCC MCF LBM BZIP2 SJENG MILC ASTAR NAMD DEALII SOPLEX GOBMKHMMER H264REF POVRAY SPHINX3 OMNETPP PERLBENCH XALANCBMK LIBQUANTUM

Future processors Data collected by Professor Lu Peng and student Ying Zhang at LSU. 12 The Multicore Era 2005-2017

• Make the programmer responsible for identifying parallelism via threads

• Exploit the threads on multiple cores • Increase cores if more transistors: easy scaling! • Energy ≈ Transistor count ≈ Active cores • So, we need Performance ≈ Active cores • But, Amdahl’s Law says that this is highly unlikely

Future processors 13 AMDAHL’S LAW LIMITS PERFORMANCE GAINS FROM PARALLEL PROCESSING

Speedup versus % ”Serial” Processing Time

64 60 56 52 48 44 40 1%

up 36 - 32 28 2% Speed 24 20 4% 16 6% 12 8% 8 10% 4 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 Processor Count

Future processors 14 SECOND CHALLENGE TO HIGHER PERFORMANCE FROM MULTICORE: END OF DENNARD SCALING

• End of Dennard scaling means multicore scaling ends • Full scaling will mean “dark ,” with cores OFF. • Example • Today: 14 nm process, largest Intel multicore • Intel E7-8890: 24-core, 2.2 GHz, TDP = 165W (power limited) • Turbo (one core): 3.4 GHz. All cores @ 3.4 GHz = 255 W. • A 7 nm process could yield (estimates) • 64 cores; power unconstrained: 6 GHz & 365 W. • 64 cores; power constrained: 4 GHz & 250 W.

Power Limit Active Cores 180 W 46/64 200 W 51/64 220 W 56/64 Future processors PUTTING THE CHALLENGES TOGETHER DENNARD SCALING + AMDAHL’S LAW

Speedup versus % ”Serial” Processing Time

64 60 1% 56 52 2% S 48 4% p 44 40 e 8% e 36 32 d 28 - 24 u 20 16 p 12 8 4 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 Processors (Cores)

Future processors 16 SIDEBAR: INSTRUCTION SET EFFICIENCY

• RISC ideas were about improving efficiency: • 1980s: efficiency in use of transistors • Less significant in CPUs in 1990s: processors dominated by other things & Moore/Dennard in full operation • RISC comeback: driven by mobile world in 2000s: • Energy efficiency crucial: small batteries, all day operation • Si efficiency important for cost! • 2020 and on: Add Design Efficiency • With growing spectrum of designs targeted to specific applications, efficiency of design/verification increasing.

Future processors 17 What OPPORTUNITIES Left?

▪ SW-centric - Modern scripting languages are interpreted, dynamically-typed and encourage reuse - Efficient for programmers but not for execution ▪ HW-centric - Only path left is Domain Specific Architectures - Just do a few tasks, but extremely well ▪ Combination - Domain Specific Languages & Architectures

18 WHAT’S THE OPPORTUNITY?

Matrix Multiply: relative speedup to a Python version (18 core Intel)

9X

20 X 63,000X 7X

50 X

from: “There’s Plenty of Room at the Top,” Leiserson, et. al., to appear.

19 DOMAIN SPECIFIC ARCHITECTURES (DSAS)

• Achieve higher efficiency by tailoring the architecture to characteristics of the domain • Not one application, but a domain of applications • Different from strict ASIC • Requires more domain-specific knowledge then GP processors need • Design DSAs and processors for targeted environments • More variability than in GP processors • Examples: • Neural network processors for machine learning • GPUs for graphics, virtual reality • Good news: demand for performance focused on such domains

Future processors 20 WHERE DOES THE ENERGY GO IN GP CPUS? CAN DSAS DO BETTER?

Function Energy in Picojoules 8-bit add 0.03 32-bit add 0.1 FP Multiply 16-bit 1.1 FP Multiply 32-bit 3.7 Register file access* 6 Control (per instruction, superscalar) 20-40 L1 cache access 10 L2 cache access 20 L3 cache access 100 Off-chip DRAM access 1,300-2,600

* Increasing the size or number of ports, increases energy roughly proportionally.

From Horowitz [2016]. Future processors 21 INSTRUCTION ENERGY BREAKDOWN

Load Register (from L1 Cache) FP Multiply (32-bit) from registers

FP Mult. L1 I-cache L1 D-cache 32-bit L1 cache access access 8% 18% access 18% Register file 20% access 12% Register file access 11%

Control 53% Control 60%

Future processors 22 WHY DSAS CAN WIN (NO MAGIC) TAILOR THE ARCHITECTURE TO THE DOMAIN

• Simpler parallelism for a specific domain (less control HW): • SIMD vs. MIMD • VLIW vs. Speculative, out-of-order • More effective use of memory bandwidth (on/off chip) • User controlled versus caches • Processor + memory structures versus traditional • Program prefetching to off-chip memory when needed • Eliminate unneeded accuracy • IEEE replaced by lower precision FP • 32-bit,64-bit integers to 8-16 bits • Domain specific programming model matches application to the processor architecture

Future processors 23 DOMAIN SPECIFIC LANGUAGES

DSAs require targeting high level operations to architecture ● Hard to start with C or Python-like language and recover structure ● Need matrix, vector, or sparse matrix operations ● Domain Specific Languages specify these operations: ○ OpenGL, TensorFlow, P4 ● If DSL programs retain architecture-independence, interesting compiler challenges will exist ○ XLA

“XLA - TensorFlow, Compiled”, XLA Team, March 6, 2017

24 Deep learning is causing a machine learning revolution

From “A New Golden Age in Computer Architecture: Empowering the Machine- Learning Revolution.” Dean, J., Patterson, D., & Young, C. (2018). IEEE Micro, 38(2), 21-29. TPU 1: High-level Chip Architecture for DNN Inference

▪ Matrix Unit: 65,536 (256x256) 8- bit multiply-accumulate units ▪ 700 MHz clock rate ▪ Peak: 92T operations/second ▪ 65,536 * 2 * 700M ▪ >25X as many MACs vs. GPU ▪ >100X as many MACs vs. CPU ▪ 4 MiB of Accumulator memory ▪ 24 MiB of on-chip Unified Buffer (activation memory) ▪ 3.5X as much on-chip memory vs. GPU HOW IS SILICON USED: TPU-1 VS. CPU?

TPU-1 (–pads) • Memory: 44% • Compute: 39% • Interface: 15% • Control: 2%

CPU (Skylake core) • Cache: 33% • Control: 30% • Compute: 21% • Mem Man:12% • Misc: 4%

Future processors 27 weights inputs

Matrix Unit (MXU) W W W X X X accumulation 11 21 31 31 21 11 W W W X X X 12 22 32 32 22 12 W W W X X X 13 23 33 33 23 13 MatrixSystolic Unit Array 28 (Kung & Computing Y = WX Systolicslides courtesyCliff of Young@Google. 3x3 systolic3x3 array W = 3x3 matrix Leiserson ) weights inputs

Matrix Unit (MXU) W W W X X X accumulation 11 21 31 31 21 11 W W W X X X 12 22 32 32 22 12 W W W X X X 13 23 33 33 23 13 Matrix Systolic Unit Array 29 weights inputs

Matrix Unit (MXU) W W W X X X accumulation 11 21 31 31 21 11 W W W W X X 2 11 12 1 + 22 32 32 22 X X 1 1 W W W X X X 23 33 13 33 23 13 Matrix Systolic Unit Array 30 with W = 3x3, batch3x3, = W with Computing Y = WX = Y Computing 3 - size(X) = = size(X) weights inputs

Matrix Unit (MXU) W W W X X X accumulation 11 21 31 31 21 11 W W W W W X 2 2 21 22 11 12 1 1 + + 32 32 X X X X 1 1 2 2 W W W X X 3 13 ... + 33 23 33 23 X 1 Matrix Systolic Unit Array 31 with W = 3x3, batch3x3, = W with Computing Y = WX = Y Computing 3 - size(X) = = size(X) weights inputs

Matrix Unit (MXU) W W W X accumulation 31 31 21 11 W W W W W 2 2 31 32 21 22 1 1 + + 12 X X X X 2 2 3 3 W W W 3 3 33 23 13 ...... + 3 + + X X X 1 2 3 Y Y 12 21 = W = W Matrix Systolic Unit Array 21 11 X X 11 21 + W + W 22 12 X X 12 22 32 + W + W with W = 3x3, batch3x3, = W with 23 13 Computing Y = WX = Y Computing X X 13 23 outputs Y 11 = W 3 11 X 11 + W - size(X) = = size(X) 12 X 12 + W 13 X 13 weights inputs

Matrix Unit (MXU) W W W accumulation 31 21 11 W W W W 2 31 32 1 + 22 12 X X 3 3 W W W 3 33 23 ...... + 3 + 13 X X 2 3 Y Y Y 22 13 31 = W = W = W Matrix Systolic Unit Array 21 31 11 X X X 21 11 31 + W + W + W 22 32 12 X X X 22 12 32 33 + W + W + W with W = 3x3, batch3x3, = W with 23 33 13 Computing Y = WX = Y Computing X X X 23 13 33 outputs Y Y 12 21 = W = W 3 21 11 X X 11 21 + W + W - size(X) = = size(X) 22 12 X X 12 22 + W + W 23 13 X X 13 23 Y 11 = W 11

X

11 + W 12 weights inputs

Matrix Unit (MXU) W W W accumulation 31 21 11 W W W 32 22 12 W W W 3 33 ... + 23 13 X 3 Y Y 32 23 = W = W Matrix Systolic Unit Array 21 31 X X 31 21 + W + W 22 32 X X 32 22 34 + W + W with W = 3x3, batch3x3, = W with 23 33 Computing Y = WX = Y Computing X X 33 23 outputs Y Y Y 22 13 31 = W = W = W 3 21 31 11 X X X 21 11 11 + W + W + W - size(X) = = size(X) 32 22 12 X X X 12 22 12 + W + W + W 33 23 13 X X X 13 23 13 Y Y 12 21 = W = W 21 11 X X 11 21 + W + W

22

12 weights

Matrix Unit (MXU) W W W accumulation • • • 31 21 11 Advantages: For 64K Matrix Unit: Nearest Each operand up to used is 256 times! • • W W W 32 22 12 Energy from eliminating register access > energy of MatrixUnit! of energy register> access from Energyeliminating Eliminate many reads/writes; reduce long wire delays - W W W neighbor communicationneighbor access: replaces RF 33 23 13 Y MatrixUnit Systolic Array 33 = W 31 X 31 + W 32 X 32 35 + W 33 X 33 outputs Y Y 32 23 = W = W 21 31 X X 31 21 + W + W 32 22 X X 22 32 + W + W in TPU in 33 23 X X 23 33 Y Y Y - 13 22 31 1 = W = W = W 31 21 11 X X X

11 21 11

+ W + W + W 32 22

12 Performance/Watt on Inference TPU-1 vs CPU & GPU

Important caveat: • TPU-1 uses 8-bit integer • GPU uses FP

36 Log Rooflines for CPU, GPU, TPU

TPU

GPU

CPU

37 Training: A Much More Intensive Problem

FP operations/second to train in 3.5 months

Future processors 38 Rapid Innovation

TPU v1 92 teraops (deployed 2015) Inference only

180 teraflops Cloud TPU 64 GB HBM (v2, Cloud GA 2017, Training and inference Pod Alpha 2018) Generally available (GA)

420 teraflops Cloud TPU 128 GB HBM (v3, Cloud Beta 2018) Training and inference Beta

Enabled by simpler design, compatibility at DSL level, ease of verification.

Future processors 39 Enabling Massive Computing Cycles for Training

11.5 petaflops (64xTPU v2) 4 TB HBM 2-D toroidal mesh network Training and inference Alpha

Cloud TPU Pod (v2, 2017)

> 100 petaflops! (256xTPU v3) 32 TB HBM Liquid cooled New chip architecture + larger-scale system

TPU v3 Pod (2018) 40 CHALLENGES AND OPPORTUNITIES

• Design of DSAs and DSLs • Optimizing the mapping to a DSA for portability & performance. • DSAs & DSLs for new fields • Open problem: dealing with sparse data ▪ Make HW development more like software: ▪ Prototyping, reuse, abstraction ▪ Open HW stacks (ISA to IP libraries) ▪ Role of ML in CAD? ▪ Technology: ▪ Silicon: Extend Dennard scaling and Moore’s Law ▪ Packaging: use optics, enhance cooling ▪ Beyond Si: Carbon nanotubes, Quantum?

Future processors 41 CONCLUDING THOUGHTS: EVERYTHING OLD IS NEW AGAIN

• Dave Kuck, software architect for Illiac IV (circa 1975) “What I was really frustrated about was the fact, with Iliac IV, programming the machine was very difficult and the architecture probably was not very well suited to some of the applications we were trying to run. The key idea was that I did not think we had a very good match in Iliac IV between applications and architecture.” • Achieving cost-performance in this era of DSAs will require matching the applications, languages , architecture, and reducing design cost.

Future processors 42 Dave Kuck, ACM Oral History WHY DRAMS ARE HARD AND CRITICAL!

Future processors 43 INTEL CORE I7: Theoretical CPI = 0.25 Achieved CPI

3

2.5

2

1.5 CPI

1

0.5

0

GCC MCF MILC LBM BZIP2 SJENG ASTAR NAMD DEALII SOPLEX GOBMK HMMER H264REF POVRAY SPHINX3 OMNETPP PERLBENCH XALANCBMK0.67161836 LIBQUANTUM

Future processors Data collected by Professor Lu Peng and student Ying Zhang at LSU. 44