The IEEE Rebooting Initiative: Lessons Learned and The Road Ahead

Tom Conte Co-Chair, IEEE Rebooting Computing Initiative Vice Chair, International Roadmap for Devices and Systems Schools of CS & ECE, Georgia Institute of Technology [email protected] A history of modern computing: How we got here

1945: Von Neumann’s EDVAC (draft) report 1955: Manchester , IBM 709T 1965: Software industry begins (IBM 360), Moore #1 1975: Moore’s Law update; Dennard’s geo. scaling rule 1985: “Killer micros”: HPC hitches a ride on Moore’s law 1995: Slowdown in CMOS wires: superscalar era begins

2 In 1995, wire delays impact pipelining: Superscalar begins

Processor performance

Moore’s law

3

Source: Sanjay Patel, UIUC (used with permission) We hid parallelism extraction with Microarchitectures

Branch Instruction predictor Instruction Fetch Cache ... Decode & Dispatch ... register file Schedule ... Issue N independent instructions Execute in parallel Data ALU ALU ... ALU Cache

... Reorder instructions ...

…Very few of these “tricks” are energy efficient 4 Hidden by Dennard scaling– until that ended How we got here, part 2

1945: Von Neumann’s EDVAC (draft) report 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360) 1975: Moore’s Law; Dennard’s geometric scaling rule 1985: “Killer micros”: HPC hitches a ride on Moore’s law 1995: Slowdown in CMOS wires: superscalar era begins 2005: The Power Wall: Single thread exponential scaling ends (Intel Prescott) …

5 Multicore era begins

Dilemma: Could not clock single core aggressively AND continued to get more /chip

Solution: Clock multiple cores conservatively

6 IEEE Rebooting Computing

Goal: Rethink Everything: Turing & Von Neumann to now Why IEEE? Encompasses the whole computing stack

Circuits & Systems Society

Council on Electronic Design Automation

Co-Chairs: Elie Track, Tom Conte, Erik DeBenedictis, Dejan Milojicic, Bruce Kraemer IEEE Rebooting Computing .Summit 1: 2013 Dec. 12-13 (summary online) – Invitation only

– Three Pillars: Rebooting Computing – Energy Efficiency – Security – Applications/HCI Security Energy Efficiency Applications/HCI

8 IEEE Rebooting Computing

. Summit 2: 2014 May 14-16 – Engines of Computation . Adiabatic/Reversible Computing Rebooting Computing . Approximate Computing . Neuromorphic Computing . Augmentation of CMOS Security Energy Efficiency Applications/HCI Engine Room

9 IEEE Rebooting Computing .Summit 3: 2014 Oct. 23-24 – Algorithms and Architectures . ITRS joins forces with RCI Algorithms & RebootingArchitectures Computing Security Energy Efficiency Applications/HCI Engine Room

10 IEEE Rebooting Computing .Summit 4: 2015 Dec. 10-11 Goal: coordinating efforts between: –Industry (HP, Intel, NVIDIA) Algorithms & –US: DOE, DARPA, IARPA, NSF RebootingArchitectures Computing Goal 2: How to roadmap the future Security Energy Efficiency Applications/HCI Engine Room

11 Moving forward…

12 1/22/2018 RCI: “Software drives the computer industry” Questions for computer industry: – How valuable is legacy software? – What computing resources do the emerging applications need? – How long and how much investment will it take to train new generation of programmers? Degrees of Pain Vs. Gain…

13 Potential Approaches vs. Disruption in Computing Stack

Algorithm

Language Non von Neumann API computing

Architecture Architectural changes ISA

Microarchitecture

FU Hidden logic changes

device “More Moore”

Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 1: More Moore

Software impact: Legacy code works without issue New switch candidates: – Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM, …

15 16

More Moore: A better switch?

Courtesy Dimitri Nikonov and Ian Young CMOS Device structure evolution – IRDS 2017 MM chapter

N7: 2017-2019 N7: 2019-2021 N5: 2021-2024 >N3: >2024 Vertical GAA (VGAA) FinFET FinFET Lateral GAA (LGAA) Drain Gate Gate Gate Gate Gate Source Bulk Si Bulk Si Bulk Si Bulk Si Bulk Si L-Nanowire L-Nanosheet Sequential 3D

FDSOI Lateral Gate-All-Around (LGAA) FinFET Vertical GAA (VGAA) Gate DrainEpi SiSource Drain Gate Gate Gate Gate Gate DrainEpi SiSource Thin Si Gate Gate Source Drain Source Bulk Si Bulk Si Bulk Si TBOX L-Nanowire L-Nanosheet Bulk Si Bulk Si FinFET – still the leading device option until 2021 Lateral-Gate All Around (LGAA) is expected to be introduced in 2021 Beyond 2024 – 3D stacking needed for functional scaling

17 Level 1: More Moore

Software impact: Legacy code works without issue New switch candidates: – Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM Predictions: Industry will go to monolithic 3D Moore’s law* won’t end for a while

(*if correctly defined)

18 Potential Approaches vs. Disruption in Computing Stack

Algorithm

Language Non von Neumann API computing

Architecture Architectural changes ISA

Microarchitecture

FU Hidden logic changes

device “Moore More”

Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 2: Not CMOS, but hidden

Software impact: Legacy code works, but may require performance tuning Lessons learned from superscalar in 1995 Next: Microarchitectural changes to – Use unreliable switch logic, and/or – Use cryogenic superconducting – Reversible computing

20 CPU Trends

• Power • Therefore,𝟐𝟐 reduce∝ 𝑽𝑽supply𝒇𝒇 voltage. • But…

ITRS / Asif Khan, PhD Thesis, University of Berkley, 2015 Vdd hasn’t reduced much below 1V because devices become unreliable

21 Computational Error Correction

• Traditional coding fixes errors in data stored or transmitted, not in computation • Redundancy can be in space and/or time. Tradeoffs. • What if there are errors in the control- path? • Bypass logic, instruction decode Proof-of-concept RRNS Core

23 Superconducting: smaller, lower power, same performance

same scale comparison 2’ x 2’ Titan at ORNL - #2 of Top500 Superconducting Supercomputer

Performance 17.6 PFLOP/s (#2 in world*) 20 PFLOP/s ~1x

Memory 710 TB (0.04 B/FLOPS) 5 PB (0.25 B/FLOPS) 7x

Power 8,200 kW avg. (not included: cooling, storage memory) 80 kW total power (includes cooling) 0.01x

Space 4,350 ft2 (404 m2, not including cooling) ~200 ft2 (includes cooling) 0.05x

Cooling additional power, space and infrastructure required All cooling shown

Courtesy of M. Manheimer, IARPA Cryogenic Computing Complexity (C3) Program 24 Level 2: Not CMOS, but hidden

Software impact: Legacy code works, but may require performance tuning Lessons learned from superscalar in 1995 Next: Microarchitectural changes to – Use unreliable switch logic, and/or – Use cryogenic superconducting – Reversible computing Potential to make exascale orders of magnitude lower power Key is co-design of devices and architectures

25 Potential Approaches vs. Disruption in Computing Stack

Algorithm

Language Non von Neumann API computing

Architecture Architectural changes ISA

Microarchitecture

FU Hidden logic changes

device “Moore More”

Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 3: Architectural changes

Software impact: new programming required GPU already an example of this – Inexpensive parallelism available, but need to reprogram to use it Use special purpose accelerators for Critical kernels, Digital neuromorphic, etc. Approximate computing And/or use memory-centric (e.g., Emu, The Machine) to move the computation to the data

27 Accelerators (and reconfigurable)

Idea has been around for a long time – IBM 7030 Project STRETCH attached stream processor (Harvest) in 1961 – Various FP accelerators for in 70s/80s (FP-164) Speedup via “gate-level parallelism” – Hardware duplication to support computation Energy savings via elimination of instruction fetch & decode Programming options: Compiler extraction, APIs, DSLs

1/22/2018 Performance Trends in From: IRDS Applications Benchmarking chapter Trendline: 1.9x per year Approximate computing Building acceptable systems out of unreliable/inaccurate hardware and software components Efficiency and performance Output accuracy

Many uses: – Most start and/or end with human perception (Images, video, control, etc.) or near-optimal search

30 Approximate computing challenges

Algorithms & programming languages – Work continues here Ensuring quality of output – Step function: great…good…good-ish…ok… unacceptable

31 Level 3: Architectural changes

Software impact: new programming required GPU already an example of this – Inexpensive parallelism available, but need to reprogram to use it Use special purpose accelerators for Critical kernels, Digital neuromorphic, etc. Approximate computing And/or use memory-centric (e.g., Emu, The Machine) to move the computation to the data Architectures can be built now-- Software and programmers are the challenge

32 Potential Approaches vs. Disruption in Computing Stack

Algorithm

Language Non von Neumann API computing

Architecture Architectural changes ISA

Microarchitecture

FU Hidden logic changes

device “Moore More”

Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 4: Non-von Neumann

1. Quantum- Gate-based or quantum annealing 2. Analog neuromorphic 3. Others: coupled oscillators, stateful devices (memristors, spintronics, etc.), analog computing

34 Native Neuromorphic

Direct analog (memristor, etc.) neuromorphic has orders of magnitude better energy efficiency over digital approaches Virtuous cycle of neuroscience informing neuromorphic, and neuromorphic serving as modeling platform to advance neuroscience

Neuromorphic Neuroscience computing research

35 Neuromorphic challenges

Guarantees – Quality of results, quality of service, reliability, etc. Security concerns: – For example: neuro used for authentication and intruder identity trained into network . Virtually impossible to detect tampering Learning – Supervised learning (today) has two phases: training and inferencing (use) . Training is highly computationally expensive – Unsupervised learning is maturing

36 Quantum Two varieties: gate-based and quantum annealing Quantum annealing (e.g., Dwave) – Convergence time is a function of noise floor – Classical annealing may be more power efficient Gate-level quantum – Many proposed qubit devices: quantum dots, Transmon, Ion trap, etc. – Current coherence times: 10-100usec . Need to be several of orders of magnitude longer – Solution: Redundancy- 1 virtual qubit = 1000 physical qubits – Power needs per virtual qubit ~ 10kW . Most of the power for waveform generators, interfacing . Cooling is a small percentage of the power

37 1/22/2018 Is there a third way?

Non-neuromorphic, non-quantum, non- von-Neumann computing? Potentials: – Massive memorization (eg, HPE The Machine) – Analog(-ous) computing / Thermodynamic computing

1/22/2018 Courtesy: Todd Hylton, UCSD

1/23/2018 Level 4: Non-von Neumann

1. Quantum- Gate-based or quantum annealing 2. Analog neuromorphic 3. Others: coupled oscillators, stateful devices (memristors, spintronics, etc.), analog computing

System software nonexistent Very immature, risky technology Large investments needed

40 IEEE Rebooting Computing: Summary Levels of RC: 1. More Moore (New switch/3D) 2. Microarchitecture changes 3. Architecture changes 4. Non-von Neumann Direct pain / gain tradeoffs New software R&D desperately needed IRDS: Applications-driven Roadmapping is identifying needed devices

41 rebootingcomputing.ieee.org irds.ieee.org

icrc.ieee.org

42