The IEEE Rebooting Computing Initiative: Lessons Learned and The Road Ahead
Tom Conte Co-Chair, IEEE Rebooting Computing Initiative Vice Chair, International Roadmap for Devices and Systems Schools of CS & ECE, Georgia Institute of Technology [email protected] A history of modern computing: How we got here
1945: Von Neumann’s EDVAC (draft) report 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360), Moore #1 1975: Moore’s Law update; Dennard’s geo. scaling rule 1985: “Killer micros”: HPC hitches a ride on Moore’s law 1995: Slowdown in CMOS wires: superscalar era begins
2 In 1995, wire delays impact pipelining: Superscalar begins
Processor performance
Moore’s law
3
Source: Sanjay Patel, UIUC (used with permission) We hid parallelism extraction with Superscalar Processor Microarchitectures
Branch Instruction predictor Instruction Fetch Cache ... Decode & Dispatch ... register file Schedule ... Issue N independent instructions Execute in parallel Data ALU ALU ... ALU Cache
... Reorder instructions ...
…Very few of these “tricks” are energy efficient 4 Hidden by Dennard scaling– until that ended How we got here, part 2
1945: Von Neumann’s EDVAC (draft) report 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360) 1975: Moore’s Law; Dennard’s geometric scaling rule 1985: “Killer micros”: HPC hitches a ride on Moore’s law 1995: Slowdown in CMOS wires: superscalar era begins 2005: The Power Wall: Single thread exponential scaling ends (Intel Prescott) …
5 Multicore era begins
Dilemma: Could not clock single core aggressively AND continued to get more transistors/chip
Solution: Clock multiple cores conservatively
6 IEEE Rebooting Computing
Goal: Rethink Everything: Turing & Von Neumann to now Why IEEE? Encompasses the whole computing stack
Circuits & Systems Society
Council on Electronic Design Automation
Co-Chairs: Elie Track, Tom Conte, Erik DeBenedictis, Dejan Milojicic, Bruce Kraemer IEEE Rebooting Computing .Summit 1: 2013 Dec. 12-13 (summary online) – Invitation only
– Three Pillars: Rebooting Computing – Energy Efficiency – Security – Applications/HCI Security Energy Efficiency Applications/HCI
8 IEEE Rebooting Computing
. Summit 2: 2014 May 14-16 – Engines of Computation . Adiabatic/Reversible Computing Rebooting Computing . Approximate Computing . Neuromorphic Computing . Augmentation of CMOS Security Energy Efficiency Applications/HCI Engine Room
9 IEEE Rebooting Computing .Summit 3: 2014 Oct. 23-24 – Algorithms and Architectures . ITRS joins forces with RCI Algorithms & RebootingArchitectures Computing Security Energy Efficiency Applications/HCI Engine Room
10 IEEE Rebooting Computing .Summit 4: 2015 Dec. 10-11 Goal: coordinating efforts between: –Industry (HP, Intel, NVIDIA) Algorithms & –US: DOE, DARPA, IARPA, NSF RebootingArchitectures Computing Goal 2: How to roadmap the future Security Energy Efficiency Applications/HCI Engine Room
11 Moving forward…
12 1/22/2018 RCI: “Software drives the computer industry” Questions for computer industry: – How valuable is legacy software? – What computing resources do the emerging applications need? – How long and how much investment will it take to train new generation of programmers? Degrees of Pain Vs. Gain…
13 Potential Approaches vs. Disruption in Computing Stack
Algorithm
Language Non von Neumann API computing
Architecture Architectural changes ISA
Microarchitecture
FU Hidden logic changes
device “More Moore”
Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 1: More Moore
Software impact: Legacy code works without issue New switch candidates: – Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM, …
15 16
More Moore: A better switch?
Courtesy Dimitri Nikonov and Ian Young CMOS Device structure evolution – IRDS 2017 MM chapter
N7: 2017-2019 N7: 2019-2021 N5: 2021-2024 >N3: >2024 Vertical GAA (VGAA) FinFET FinFET Lateral GAA (LGAA) Drain Gate Gate Gate Gate Gate Source Bulk Si Bulk Si Bulk Si Bulk Si Bulk Si L-Nanowire L-Nanosheet Sequential 3D
FDSOI Lateral Gate-All-Around (LGAA) FinFET Vertical GAA (VGAA) Gate DrainEpi SiSource Drain Gate Gate Gate Gate Gate DrainEpi SiSource Thin Si Gate Gate Source Drain Source Bulk Si Bulk Si Bulk Si TBOX L-Nanowire L-Nanosheet Bulk Si Bulk Si FinFET – still the leading device option until 2021 Lateral-Gate All Around (LGAA) is expected to be introduced in 2021 Beyond 2024 – 3D stacking needed for functional scaling
17 Level 1: More Moore
Software impact: Legacy code works without issue New switch candidates: – Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM Predictions: Industry will go to monolithic 3D Moore’s law* won’t end for a while
(*if correctly defined)
18 Potential Approaches vs. Disruption in Computing Stack
Algorithm
Language Non von Neumann API computing
Architecture Architectural changes ISA
Microarchitecture
FU Hidden logic changes
device “Moore More”
Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 2: Not CMOS, but hidden
Software impact: Legacy code works, but may require performance tuning Lessons learned from superscalar in 1995 Next: Microarchitectural changes to – Use unreliable switch logic, and/or – Use cryogenic superconducting – Reversible computing
20 CPU Trends
• Power • Therefore,𝟐𝟐 reduce∝ 𝑽𝑽supply𝒇𝒇 voltage. • But…
ITRS / Asif Khan, PhD Thesis, University of California Berkley, 2015 Vdd hasn’t reduced much below 1V because devices become unreliable
21 Computational Error Correction
• Traditional coding fixes errors in data stored or transmitted, not in computation • Redundancy can be in space and/or time. Tradeoffs. • What if there are errors in the control- path? • Bypass logic, instruction decode Proof-of-concept RRNS Core
23 Superconducting: smaller, lower power, same performance
same scale comparison 2’ x 2’ Supercomputer Titan at ORNL - #2 of Top500 Superconducting Supercomputer
Performance 17.6 PFLOP/s (#2 in world*) 20 PFLOP/s ~1x
Memory 710 TB (0.04 B/FLOPS) 5 PB (0.25 B/FLOPS) 7x
Power 8,200 kW avg. (not included: cooling, storage memory) 80 kW total power (includes cooling) 0.01x
Space 4,350 ft2 (404 m2, not including cooling) ~200 ft2 (includes cooling) 0.05x
Cooling additional power, space and infrastructure required All cooling shown
Courtesy of M. Manheimer, IARPA Cryogenic Computing Complexity (C3) Program 24 Level 2: Not CMOS, but hidden
Software impact: Legacy code works, but may require performance tuning Lessons learned from superscalar in 1995 Next: Microarchitectural changes to – Use unreliable switch logic, and/or – Use cryogenic superconducting – Reversible computing Potential to make exascale supercomputers orders of magnitude lower power Key is co-design of devices and architectures
25 Potential Approaches vs. Disruption in Computing Stack
Algorithm
Language Non von Neumann API computing
Architecture Architectural changes ISA
Microarchitecture
FU Hidden logic changes
device “Moore More”
Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 3: Architectural changes
Software impact: new programming required GPU already an example of this – Inexpensive parallelism available, but need to reprogram to use it Use special purpose accelerators for Critical kernels, Digital neuromorphic, etc. Approximate computing And/or use memory-centric (e.g., Emu, The Machine) to move the computation to the data
27 Accelerators (and reconfigurable)
Idea has been around for a long time – IBM 7030 Project STRETCH attached stream processor (Harvest) in 1961 – Various FP accelerators for minicomputers in 70s/80s (FP-164) Speedup via “gate-level parallelism” – Hardware duplication to support computation Energy savings via elimination of instruction fetch & decode Programming options: Compiler extraction, APIs, DSLs
1/22/2018 Performance Trends in Machine Learning From: IRDS Applications Benchmarking chapter Trendline: 1.9x per year Approximate computing Building acceptable systems out of unreliable/inaccurate hardware and software components Efficiency and performance Output accuracy
Many uses: – Most start and/or end with human perception (Images, video, control, etc.) or near-optimal search
30 Approximate computing challenges
Algorithms & programming languages – Work continues here Ensuring quality of output – Step function: great…good…good-ish…ok… unacceptable
31 Level 3: Architectural changes
Software impact: new programming required GPU already an example of this – Inexpensive parallelism available, but need to reprogram to use it Use special purpose accelerators for Critical kernels, Digital neuromorphic, etc. Approximate computing And/or use memory-centric (e.g., Emu, The Machine) to move the computation to the data Architectures can be built now-- Software and programmers are the challenge
32 Potential Approaches vs. Disruption in Computing Stack
Algorithm
Language Non von Neumann API computing
Architecture Architectural changes ISA
Microarchitecture
FU Hidden logic changes
device “Moore More”
Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 4: Non-von Neumann
1. Quantum- Gate-based or quantum annealing 2. Analog neuromorphic 3. Others: coupled oscillators, stateful devices (memristors, spintronics, etc.), analog computing
34 Native Neuromorphic
Direct analog (memristor, etc.) neuromorphic has orders of magnitude better energy efficiency over digital approaches Virtuous cycle of neuroscience informing neuromorphic, and neuromorphic serving as modeling platform to advance neuroscience
Neuromorphic Neuroscience computing research
35 Neuromorphic challenges
Guarantees – Quality of results, quality of service, reliability, etc. Security concerns: – For example: neuro used for authentication and intruder identity trained into network . Virtually impossible to detect tampering Learning – Supervised learning (today) has two phases: training and inferencing (use) . Training is highly computationally expensive – Unsupervised learning is maturing
36 Quantum Two varieties: gate-based and quantum annealing Quantum annealing (e.g., Dwave) – Convergence time is a function of noise floor – Classical annealing may be more power efficient Gate-level quantum – Many proposed qubit devices: quantum dots, Transmon, Ion trap, etc. – Current coherence times: 10-100usec . Need to be several of orders of magnitude longer – Solution: Redundancy- 1 virtual qubit = 1000 physical qubits – Power needs per virtual qubit ~ 10kW . Most of the power for waveform generators, interfacing . Cooling is a small percentage of the power
37 1/22/2018 Is there a third way?
Non-neuromorphic, non-quantum, non- von-Neumann computing? Potentials: – Massive memorization (eg, HPE The Machine) – Analog(-ous) computing / Thermodynamic computing
1/22/2018 Courtesy: Todd Hylton, UCSD
1/23/2018 Level 4: Non-von Neumann
1. Quantum- Gate-based or quantum annealing 2. Analog neuromorphic 3. Others: coupled oscillators, stateful devices (memristors, spintronics, etc.), analog computing
System software nonexistent Very immature, risky technology Large investments needed
40 IEEE Rebooting Computing: Summary Levels of RC: 1. More Moore (New switch/3D) 2. Microarchitecture changes 3. Architecture changes 4. Non-von Neumann Direct pain / gain tradeoffs New software R&D desperately needed IRDS: Applications-driven Roadmapping is identifying needed devices
41 rebootingcomputing.ieee.org irds.ieee.org
icrc.ieee.org
42