Exploring Emerging Memory Technologies in Extreme Scale High Performance Computing

Exploring Emerging Memory Technologies in Extreme Scale High Performance Computing Jeffrey S. Vetter Presented to ISC Third International Workshop on Communication Architectures for HPC, Big Data, Deep Learning and Clouds at Extreme Scale Frankfurt 22 Jun 2017 ORNL is managed by UT-Battelle for the US Department of Energy http://ft.ornl.gov [email protected] Oak Ridge National Laboratory is the DOE Office of Science’s Largest Lab 7 Managed by UT-Battelle 7 for the Department of Energy ORNL is Characterized by Diverse Scientific Portfolio Nation’s 3,000 $500M largest $1.6B 4,400 research modernization materials budget employees guests investment research annually portfolio Most World’s Managing powerful open World-class Nation’s most intense billion-dollar scientific research most diverse neutron U.S. ITER computing reactor energy portfolio source project facility 8 8 Overview_1305 What Does 1000x Increase Provide? Mission Impact Superconducting Molecular Cosmology Combustion Materials Science Fusion Jacqueline Chen Salman Habib Sandia National Argonne National Laboratory Laboratory Michael Klein Temple University C.S. Chang Paul Kent PPPL Habib and Chen and ORNL collaborators used its collaborators for Researchers at Procter Chang and collaborators HACC Code on the first time Paul Kent and & Gamble (P&G) and used the XGC1 code on Titan’s CPU–GPU performed direct collaborators Temple University Titan to obtain system to conduct numerical performed the first ab delivered a fundamental today’s largest simulation of a jet initio simulation of a comprehensive picture understanding of the cosmological flame burning cuprate. They were in full atomistic detail divertor heat-load width structure simulation dimethyl ether also the first team to of the molecular physics and its at resolutions (DME) at new validate quantum properties that drive dependence on the needed for modern- turbulence scales Monte Carlo skin barrier disruption. plasma current in present- day galactic surveys. over space and simulations for high- M. Paloncyova, et al. day tokamak devices. time. temperature 2014. Langmuir 30 K. Heitmann, 2014. superconductor C. S. Chang, et al. 2014. C. M. MacDermaid, et arXiv.org, 1411.3396 A. Bhagatwala, et simulations. Proceedings of the 25th al. 2014. Proc. al. 2014. J. Chem. Fusion Energy Combust. Inst. 35. K. Foyevtsova, et al. Phys. 141 Conference, IAEA, 2014. Phys. Rev. X 4 October 13–18, 2014. 13 Highlights • Recent trends in extreme-scale HPC paint an ambiguous future – Contemporary systems provide evidence that power constraints are driving architectures to change rapidly (e.g., Dennard, Moore) – Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O • Memory systems are changing now! – New devices – New integration – New configurations – Vast (local) capacities • Programming systems must provide performance portability (in addition to functional portability)!! – We need new programming systems to effectively use these architectures – NVL-C – Papyrus(KV) • Changes in memory systems will alter communication and storage requirements dramatically 21 Major Trends in Computing Sixth Wave of Computing http://www.kurzweilai.net/exponential-growth-of-computing 23 Contemporary devices are approaching fundamental limits Economist, Mar 2016 Dennard scaling has already ended. Dennard observed that voltage and current should be proportional to the linear dimensions of a transistor: 2x transistor count implies 40% faster and 50% more efficient. I.L. Markov, “Limits on fundamental limits to computation,” Nature, 512(7513):147-54, 24 R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc, “Design of ion-implanted 2014, doi:10.1038/nature13570. MOSFET's with very small physical dimensions,” IEEE Journal of Solid-State Circuits, 9(5):256-68, 1974, Semiconductors are taking longer and cost more to design and produce Economist, Mar 2016 25 http://www.anandtech.com/show/10183/intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization Semiconductor business is highly process-oriented, optimized, and growing extremely capital intensive By 2020, current cost trends will lead to an average cost of between $15 billion and $20 billion for a leading-edge fab, according to the report. By 2016, the minimum capital expenditure budget needed to justify the building of a new fab will range from $8 billion to $10 billion for logic, $3.5 billion to $4.5 billion for DRAM and $6 billion to $7 billion for NAND flash, according to the report. Context: Intel Reports Full-Year Revenue of $55.4 Billion, Net Income of $11.4 Billion (Intel SEC Filing for FY2015) 26 EE Times http://www.icinsights.com/ Business climate reflects this uncertainty, cost, complexity, consolidation 27 End of Moore’s Law • Device level physics will prevent much smaller feature size of current transistor technologies • Business trends indicate asymptotic limits of both manufacturing capability and economics • What now? Economist, Mar 2016 “The number of people predicting the death of Moore’s 30 Law doubles every two years.” – Peter Lee, Microsoft Sixth Wave of Computing 6th wave Transition Period http://www.kurzweilai.net/exponential-growth-of-computing 31 Our Transition Period Predictions Optimize Software and Architectural Expose New Specialization and Emerging Technologies Hierarchical Parallelism Integration • Redesign software to • Use CMOS more • Investigate new boost performance on efficiently for our computational upcoming workloads paradigms architectures • Integrate components • Quantum • Exploit new levels of to boost performance • Neuromorphic parallelism and and eliminate • Advanced Digital efficient data inefficiencies • Emerging Memory movement Devices 32 Our Transition Period Predictions Optimize Software and Architectural Expose New Specialization and Emerging Technologies Hierarchical Parallelism Integration • Redesign software to • Use CMOS more • Investigate new boost performance on efficiently for our computational upcoming workloads paradigms architectures • Integrate components • Quantum • Exploit new levels of to boost performance • Neuromorphic parallelism and and eliminate • Advanced Digital efficient data inefficiencies • Emerging Memory movement Devices 34 Architectural specialization will accelerate • Vendors, lacking Moore’s Law, will need to continue to differentiate products (to stay in business) • Grant that advantage of better CMOS process stalls • Use the same transistors differently to enhance performance • Architectural design will become extremely important, critical – Dark Silicon http://www.theinquirer.net/inquirer/news/2477796/intels-nervana-ai- – Address new parameters for benefits/curse of Moore’s Law platform-takes-aim-at-nvidias-gpu-techology http://www.wired.com/2016/05/google-tpu-custom-chips/ https://fossbytes.com/nvidia-volta-gddr6-2018/ https://www.thebroadcastbridge.com/content/entry/1094/altera-announces-arria-10-2666mbps-ddr4-memory-fpga-interface 35 D.E. Shaw, M.M. Deneroff, R.O. Dror et al., “Anton, a special-purpose machine for molecular dynamics simulation,” Communications of the ACM, 51(7):91-7, 2008. Tighter integration and manufacturing of components will provide some benefits: components with different processes, functionality; local bandwidth HotChips 27, Altera Stratix 10 http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification 36 Our Transition Period Predictions Optimize Software and Architectural Expose New Specialization and Emerging Technologies Hierarchical Parallelism Integration • Redesign software to • Use CMOS more • Investigate new boost performance on efficiently for our computational upcoming workloads paradigms architectures • Integrate components • Quantum • Exploit new levels of to boost performance • Neuromorphic parallelism and and eliminate • Advanced Digital efficient data inefficiencies • Emerging Memory movement Devices 37 Exploration of emerging technologies is different this time – I promise • Three decades of alternative technologies have fallen victim to ‘curse of Moore's law’: general CPU performance improvements without any software changes – Weitek Floating Point accelerator (circa 1988) – Piles of other types of processors: clearspeed, – FPGAs https://micro.magnet.fsu.edu/optics/olympusmicd/galleries/chips/weitekmathmedium.html • Some of these technologies found a specific market to serve – But most failed • Now, the context and parameters have changed! http://www.clearspeed.com 38 Transition Period will be Disruptive • New devices and architectures may not be hidden in traditional levels of abstraction Layer Switch, 3D NVM Approximate Neuro Quantum – A new type of CNT transistor may be Application 1 1 2 2 3 completely hidden from higher levels Algorithm 1 1 2 3 3 – A new paradigm like quantum may require Language 1 2 2 3 3 new architectures, programming models, and API 1 2 2 3 3 Arch 1 2 2 3 3 algorithmic approaches ISA 1 2 2 3 3 Microarch 2 3 2 3 3 FU 2 3 2 3 3 Logic 3 3 2 3 3 • Solutions need a co-design framework to Device 3 3 2 3 3 evaluate and mature specific technologies Adapted from IEEE Rebooting Computing Chart 39 Quantum computing: Qubit design and fabrication have made recent progress Science 354, 1091 (2016) – 2 December • Technological progress – Demonstrated qubits • IBM, Google, Microsoft, Intel, D- Wave, etc investing in quantum – IBM has 17-qubit quantum computer on cloud • https://www.research.ibm.com/ibm- q/ • DOE has released to RFPs for quantum computing 40 New Memory

Exploring Emerging Memory Technologies in Extreme Scale High Performance Computing

Download The

Diana Marculescu

Arch2030: a Vision of Computer Architecture Research Over

Da Li July 2016

Sensible Machine

IEEE Future Directions Newsletter

RCS 2 2Nd Rebooting Computing Summit Summary Report

Rebooting Computing Week 2017 Impressions by D. Scott Holmes, SNF Co-Editor Electronics January 28, 2017

Software > Blockchain > Augmented

October Edition

IEEE Rebooting Computing Initiative & International Roadmap of Devices

(Bud) Lawson Written