The IEEE Rebooting Computing Initiative: Lessons Learned and the Road Ahead

The IEEE Rebooting Computing Initiative: Lessons Learned and The Road Ahead Tom Conte Co-Chair, IEEE Rebooting Computing Initiative Vice Chair, International Roadmap for Devices and Systems Schools of CS & ECE, Georgia Institute of Technology [email protected] A history of modern computing: How we got here 1945: Von Neumann’s EDVAC (draft) report 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360), Moore #1 1975: Moore’s Law update; Dennard’s geo. scaling rule 1985: “Killer micros”: HPC hitches a ride on Moore’s law 1995: Slowdown in CMOS wires: superscalar era begins 2 In 1995, wire delays impact pipelining: Superscalar begins Processor performance Moore’s law 3 Source: Sanjay Patel, UIUC (used with permission) We hid parallelism extraction with Superscalar Processor Microarchitectures Branch Instruction predictor Instruction Fetch Cache ... Decode & Dispatch ... register file Schedule ... Issue N independent instructions Execute in parallel Data ALU ALU ... ALU Cache ... Reorder instructions ... …Very few of these “tricks” are energy efficient 4 Hidden by Dennard scaling– until that ended How we got here, part 2 1945: Von Neumann’s EDVAC (draft) report 1955: Manchester Transistor Computer, IBM 709T 1965: Software industry begins (IBM 360) 1975: Moore’s Law; Dennard’s geometric scaling rule 1985: “Killer micros”: HPC hitches a ride on Moore’s law 1995: Slowdown in CMOS wires: superscalar era begins 2005: The Power Wall: Single thread exponential scaling ends (Intel Prescott) … 5 Multicore era begins Dilemma: Could not clock single core aggressively AND continued to get more transistors/chip Solution: Clock multiple cores conservatively 6 IEEE Rebooting Computing Goal: Rethink Everything: Turing & Von Neumann to now Why IEEE? Encompasses the whole computing stack Circuits & Systems Society Council on Electronic Design Automation Co-Chairs: Elie Track, Tom Conte, Erik DeBenedictis, Dejan Milojicic, Bruce Kraemer IEEE Rebooting Computing .Summit 1: 2013 Dec. 12-13 (summary online) – Invitation only – Three Pillars: Rebooting Computing – Energy Efficiency – Security – Applications/HCI Security Energy Efficiency Applications/HCI 8 IEEE Rebooting Computing . Summit 2: 2014 May 14-16 – Engines of Computation . Adiabatic/Reversible Computing Rebooting Computing . Approximate Computing . Neuromorphic Computing . Augmentation of CMOS Security Energy Efficiency Applications/HCI Engine Room 9 IEEE Rebooting Computing .Summit 3: 2014 Oct. 23-24 – Algorithms and Architectures . ITRS joins forces with RCI Algorithms & RebootingArchitectures Computing Security Energy Efficiency Applications/HCI Engine Room 10 IEEE Rebooting Computing .Summit 4: 2015 Dec. 10-11 Goal: coordinating efforts between: –Industry (HP, Intel, NVIDIA) Algorithms & –US: DOE, DARPA, IARPA, NSF RebootingArchitectures Computing Goal 2: How to roadmap the future Security Energy Efficiency Applications/HCI Engine Room 11 Moving forward… 12 1/22/2018 RCI: “Software drives the computer industry” Questions for computer industry: – How valuable is legacy software? – What computing resources do the emerging applications need? – How long and how much investment will it take to train new generation of programmers? Degrees of Pain Vs. Gain… 13 Potential Approaches vs. Disruption in Computing Stack Algorithm Language Non von Neumann API computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “More Moore” Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 1: More Moore Software impact: Legacy code works without issue New switch candidates: – Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM, … 15 16 More Moore: A better switch? Courtesy Dimitri Nikonov and Ian Young CMOS Device structure evolution – IRDS 2017 MM chapter N7: 2017-2019 N7: 2019-2021 N5: 2021-2024 >N3: >2024 Vertical GAA (VGAA) FinFET FinFET Lateral GAA (LGAA) Drain Gate Gate Gate Gate Gate Source Bulk Si Bulk Si Bulk Si Bulk Si Bulk Si L-Nanowire L-Nanosheet Sequential 3D FDSOI Lateral Gate-All-Around (LGAA) FinFET Vertical GAA (VGAA) Gate DrainEpi SiSource Drain Gate Gate Gate Gate Gate DrainEpi SiSource Thin Si Gate Gate Source Drain Source Bulk Si Bulk Si Bulk Si TBOX L-Nanowire L-Nanosheet Bulk Si Bulk Si FinFET – still the leading device option until 2021 Lateral-Gate All Around (LGAA) is expected to be introduced in 2021 Beyond 2024 – 3D stacking needed for functional scaling 17 Level 1: More Moore Software impact: Legacy code works without issue New switch candidates: – Logic examples: Tunneling FET,CNFET, superconducting electronics – Memory examples: MRAM, memristor, PCM Predictions: Industry will go to monolithic 3D Moore’s law* won’t end for a while (*if correctly defined) 18 Potential Approaches vs. Disruption in Computing Stack Algorithm Language Non von Neumann API computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “Moore More” Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 2: Not CMOS, but hidden Software impact: Legacy code works, but may require performance tuning Lessons learned from superscalar in 1995 Next: Microarchitectural changes to – Use unreliable switch logic, and/or – Use cryogenic superconducting – Reversible computing 20 CPU Trends • Power • Therefore, reduce∝ supply voltage. • But… ITRS / Asif Khan, PhD Thesis, University of California Berkley, 2015 Vdd hasn’t reduced much below 1V because devices become unreliable 21 Computational Error Correction • Traditional coding fixes errors in data stored or transmitted, not in computation • Redundancy can be in space and/or time. Tradeoffs. • What if there are errors in the control- path? • Bypass logic, instruction decode Proof-of-concept RRNS Core 23 Superconducting: smaller, lower power, same performance same scale comparison 2’ x 2’ Supercomputer Titan at ORNL - #2 of Top500 Superconducting Supercomputer Performance 17.6 PFLOP/s (#2 in world*) 20 PFLOP/s ~1x Memory 710 TB (0.04 B/FLOPS) 5 PB (0.25 B/FLOPS) 7x Power 8,200 kW avg. (not included: cooling, storage memory) 80 kW total power (includes cooling) 0.01x Space 4,350 ft2 (404 m2, not including cooling) ~200 ft2 (includes cooling) 0.05x Cooling additional power, space and infrastructure required All cooling shown Courtesy of M. Manheimer, IARPA Cryogenic Computing Complexity (C3) Program 24 Level 2: Not CMOS, but hidden Software impact: Legacy code works, but may require performance tuning Lessons learned from superscalar in 1995 Next: Microarchitectural changes to – Use unreliable switch logic, and/or – Use cryogenic superconducting – Reversible computing Potential to make exascale supercomputers orders of magnitude lower power Key is co-design of devices and architectures 25 Potential Approaches vs. Disruption in Computing Stack Algorithm Language Non von Neumann API computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “Moore More” Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 3: Architectural changes Software impact: new programming required GPU already an example of this – Inexpensive parallelism available, but need to reprogram to use it Use special purpose accelerators for Critical kernels, Digital neuromorphic, etc. Approximate computing And/or use memory-centric (e.g., Emu, The Machine) to move the computation to the data 27 Accelerators (and reconfigurable) Idea has been around for a long time – IBM 7030 Project STRETCH attached stream processor (Harvest) in 1961 – Various FP accelerators for minicomputers in 70s/80s (FP-164) Speedup via “gate-level parallelism” – Hardware duplication to support computation Energy savings via elimination of instruction fetch & decode Programming options: Compiler extraction, APIs, DSLs 1/22/2018 Performance Trends in Machine Learning From: IRDS Applications Benchmarking chapter Trendline: 1.9x per year Approximate computing Building acceptable systems out of unreliable/inaccurate hardware and software components Efficiency and performance Output accuracy Many uses: – Most start and/or end with human perception (Images, video, control, etc.) or near-optimal search 30 Approximate computing challenges Algorithms & programming languages – Work continues here Ensuring quality of output – Step function: great…good…good-ish…ok… unacceptable 31 Level 3: Architectural changes Software impact: new programming required GPU already an example of this – Inexpensive parallelism available, but need to reprogram to use it Use special purpose accelerators for Critical kernels, Digital neuromorphic, etc. Approximate computing And/or use memory-centric (e.g., Emu, The Machine) to move the computation to the data Architectures can be built now-- Software and programmers are the challenge 32 Potential Approaches vs. Disruption in Computing Stack Algorithm Language Non von Neumann API computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “Moore More” Level 1 2 3 4 LEGEND: No Disruption Total Disruption Level 4: Non-von Neumann 1. Quantum- Gate-based or quantum annealing 2. Analog neuromorphic 3. Others: coupled oscillators, stateful devices (memristors, spintronics, etc.), analog computing 34 Native Neuromorphic Direct analog (memristor, etc.) neuromorphic has orders of magnitude better energy efficiency over digital approaches Virtuous cycle of neuroscience informing neuromorphic, and neuromorphic serving as modeling platform to advance neuroscience Neuromorphic

Load more