Efficient Methods and Hardware for Deep Learning

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Methods and Hardware for Deep Learning Efficient Methods and Hardware for Deep Learning Song Han Stanford University May 25, 2017 Intro Song Han Bill Dally PhD Candidate Chief Scientist Stanford NVIDIA Professor Stanford 2 Deep Learning is Changing Our Lives Self-Driving Car Machine Translation This image is licensed under CC-BY 2.0 This image is in the public domain This image is in the public domain This image is licensed under CC-BY 2.0 3 AlphaGo Smart Robots 3 Models are Getting Larger IMAGE RECOGNITION SPEECH RECOGNITION Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~3.5% error ~5% Error more computation 8 layers 80 GFLOP 1.4 GFLOP 7,000 hrs of Data (Better algorithms, new insights and ~16% Error ~8% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2 Microsoft Baidu Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks 4 The first Challenge: Model Size Hard to distribute large models through over-the-air update App icon is in the public domain This image is licensed under CC-BY 2.0 Phone image is licensed under CC-BY 2.0 5 The Second Challenge: Speed Error rate Training time ResNet18: 10.76% 2.5 days ResNet50: 7.02% 5 days ResNet101: 6.21% 1 week ResNet152: 6.16% 1.5 weeks Such long training time limits ML researcher’s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 6 The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs, $3000 electric bill per game This image is in the public domain This image is in the public domain on mobile: drains battery on data-center: increases TCO Phone image is licensed under CC-BY 2.0 This image is licensed under CC-BY 2.0 7 Where is the Energy Consumed? larger model => more memory reference => more energy 8 Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7]. MemoryCMOS process access [ is7]. 3 Memory orders of access magnitude is 3 orders more of magnitude more energy expensive thanenergy simple1 expensive arithmetic. than simple arithmetic.= 1000 This image is in the public domain To achieve this goal, weTo achievepresent athis method goal, we to prune present network a method connections to prune network in a manner connections that preserves in a manner the that preserves the original accuracy. Afteroriginal an initial accuracy. training After phase, an initial we remove training all phase, connections we remove whose all connections weight is lower whose weight is lower than a threshold. Thisthan pruning a threshold. converts This a dense, pruning fully-connected converts a dense, layer fully-connected to a sparse layer. layer This to a first sparse layer. This9 first phase learns the topologyphase of learns the networks the topology — learning of the networks which connections — learning are which important connections and removing are important and removing the unimportant connections.the unimportant We then connections. retrain the sparse We then network retrain so the the sparse remaining network connections so the remaining can connections can compensate for the connectionscompensate that for the have connections been removed. that have The beenphases removed. of pruning The and phases retraining of pruning may and retraining may be repeated iterativelybe to repeated further iteratively reduce network to further complexity. reduce network In effect, complexity. this training In effect, process this learns training process learns the network connectivitythe network in addition connectivity to the weights in addition - much to as the in weights the mammalian - much as brain in the [8][9], mammalian where brain [8][9], where synapses are created insynapses the first are few created months in of the a first child’s few development, months of a child’s followed development, by gradual followed pruning byof gradual pruning of little-used connections,little-used falling to connections, typical adult falling values. to typical adult values. 2 Related Work2 Related Work Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- ing models [10]. This results in a waste of both computation and memory. There have been various ing models [10]. This results in a waste of both computation and memory. There have been various proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation proposals to remove the redundancy: Vanhoucke et al. [11] explored a fixed-point implementation with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear with 8-bit integer (vs 32-bit floating point) activations. Denton et al. [12] exploited the linear structure of the neural network by finding an appropriate low-rank approximation of the parameters structure of the neural network by finding an appropriate low-rank approximation of the parameters and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gong et al. and keeping the accuracy[13] compressed within 1% of deep the convnets original model. using vector With quantization.similar accuracy These loss, approximation Gong et al. and quantization [13] compressed deeptechniques convnets are using orthogonal vector quantization. to network pruning, These andapproximation they can be and used quantization together to obtain further gains techniques are orthogonal[14]. to network pruning, and they can be used together to obtain further gains [14]. There have been other attempts to reduce the number of parameters of neural networks by replacing There have been otherthe attempts fully connected to reduce layer the number with global of parameters average pooling. of neural The networks Network by in replacing Network architecture [15] the fully connected layerand GoogLenet with global [16 average] achieves pooling. state-of-the-art The Network results in Network on several architecture benchmarks [15 by] adopting this idea. and GoogLenet [16]However, achieves state-of-the-arttransfer learning, results i.e. reusing on several features benchmarks learned on by the adopting ImageNet this dataset idea. and applying them However, transfer learning,to new i.e. tasks reusing by only features fine-tuning learned the onfully the connected ImageNet layers, dataset is andmore applying difficult them with this approach. This to new tasks by onlyproblem fine-tuning is noted the fully by Szegedyconnectedet layers, al. [16 is] and more motivates difficult them with thisto add approach. a linear Thislayer on the top of their problem is noted by Szegedynetworkset to al. enable[16] transfer and motivates learning. them to add a linear layer on the top of their networks to enable transfer learning. Network pruning has been used both to reduce network complexity and to reduce over-fitting. An Network pruning hasearly been approach used both to to pruning reduce was network biased complexity weight decay and [17 to]. reduce Optimal over-fitting. Brain Damage An [18] and Optimal early approach to pruningBrain was Surgeon biased [19 weight] prune decay networks [17]. to Optimal reduce the Brain number Damage of connections [18] and Optimal based on the Hessian of the Brain Surgeon [19] pruneloss function networks and to reduce suggest the that number such pruning of connections is more accurate based on than the Hessian magnitude-based of the pruning such as loss function and suggestweight that decay. such However, pruning is second more accurate order derivative than magnitude-based needs additional pruning computation. such as weight decay. However, second order derivative needs additional computation. HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly HashedNets [20] is agroup recent connection technique weightsto reduce into model hash sizes buckets, by using so that a hash all connections function to withinrandomly the same hash bucket group connection weightsshare a into single hash parameter buckets, value. so that This all technique connections may within benefit the from same pruning. hash Asbucket pointed out in Shi et al. share a single parameter[21] value. and Weinberger This techniqueet al. may[22], benefit sparsity from will pruning. minimize As hash pointed collision out in making Shi et al. feature hashing even [21] and Weinbergermoreet al. effective.[22], sparsity HashedNets will minimize may be hashused together collision with making pruning feature to give hashing even better even parameter savings. more effective. HashedNets may be used together with pruning to give even better parameter savings. 2 2 Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost Relative Energy Cost OperationOperation Energy [pJ] Relative Energy Cost [pJ] RelativeRelative Cost Energy Cost 32 bit int ADD32 bit int 0.1 ADD 1 0.1 1 32 bit float ADD32 bit float 0.9 ADD 9 0.9 9 32 bit Register File32 bit Register 1 File 10 1 10 32 bit int MULT32 bit int 3.1 MULT 31 3.1 31 32 bit float MULT32 bit float 3.7 MULT 37 3.7 37 32 bit SRAM Cache32 bit SRAM 5 Cache 50 5 50 32 bit DRAM Memory32 bit 640DRAM Memory 6400 640 6400 1 10 100 11000 1010000 100 1000 10000 how to make deep learning more efficient? Figure 1: Energy tableFigure for 45nm 1: Energy CMOS table process for 45nm [7].
Recommended publications
  • Artificial Intelligence in Health Care: the Hope, the Hype, the Promise, the Peril
    Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril Michael Matheny, Sonoo Thadaney Israni, Mahnoor Ahmed, and Danielle Whicher, Editors WASHINGTON, DC NAM.EDU PREPUBLICATION COPY - Uncorrected Proofs NATIONAL ACADEMY OF MEDICINE • 500 Fifth Street, NW • WASHINGTON, DC 20001 NOTICE: This publication has undergone peer review according to procedures established by the National Academy of Medicine (NAM). Publication by the NAM worthy of public attention, but does not constitute endorsement of conclusions and recommendationssignifies that it is the by productthe NAM. of The a carefully views presented considered in processthis publication and is a contributionare those of individual contributors and do not represent formal consensus positions of the authors’ organizations; the NAM; or the National Academies of Sciences, Engineering, and Medicine. Library of Congress Cataloging-in-Publication Data to Come Copyright 2019 by the National Academy of Sciences. All rights reserved. Printed in the United States of America. Suggested citation: Matheny, M., S. Thadaney Israni, M. Ahmed, and D. Whicher, Editors. 2019. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. NAM Special Publication. Washington, DC: National Academy of Medicine. PREPUBLICATION COPY - Uncorrected Proofs “Knowing is not enough; we must apply. Willing is not enough; we must do.” --GOETHE PREPUBLICATION COPY - Uncorrected Proofs ABOUT THE NATIONAL ACADEMY OF MEDICINE The National Academy of Medicine is one of three Academies constituting the Nation- al Academies of Sciences, Engineering, and Medicine (the National Academies). The Na- tional Academies provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions.
    [Show full text]
  • AI Computer Wraps up 4-1 Victory Against Human Champion Nature Reports from Alphago's Victory in Seoul
    The Go Files: AI computer wraps up 4-1 victory against human champion Nature reports from AlphaGo's victory in Seoul. Tanguy Chouard 15 March 2016 SEOUL, SOUTH KOREA Google DeepMind Lee Sedol, who has lost 4-1 to AlphaGo. Tanguy Chouard, an editor with Nature, saw Google-DeepMind’s AI system AlphaGo defeat a human professional for the first time last year at the ancient board game Go. This week, he is watching top professional Lee Sedol take on AlphaGo, in Seoul, for a $1 million prize. It’s all over at the Four Seasons Hotel in Seoul, where this morning AlphaGo wrapped up a 4-1 victory over Lee Sedol — incidentally, earning itself and its creators an honorary '9-dan professional' degree from the Korean Baduk Association. After winning the first three games, Google-DeepMind's computer looked impregnable. But the last two games may have revealed some weaknesses in its makeup. Game four totally changed the Go world’s view on AlphaGo’s dominance because it made it clear that the computer can 'bug' — or at least play very poor moves when on the losing side. It was obvious that Lee felt under much less pressure than in game three. And he adopted a different style, one based on taking large amounts of territory early on rather than immediately going for ‘street fighting’ such as making threats to capture stones. This style – called ‘amashi’ – seems to have paid off, because on move 78, Lee produced a play that somehow slipped under AlphaGo’s radar. David Silver, a scientist at DeepMind who's been leading the development of AlphaGo, said the program estimated its probability as 1 in 10,000.
    [Show full text]
  • Fault Tolerance and Re-Training Analysis on Neural Networks
    FAULT TOLERANCE AND RE-TRAINING ANALYSIS ON NEURAL NETWORKS by ABHINAV KURIAN GEORGE B.Tech Electronics and Communication Engineering Amrita Vishwa Vidhyapeetham, Kerala, 2012 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science, Computer Engineering, College of Engineering and Applied Science, University of Cincinnati, Ohio 2019 Thesis Committee: Chair: Wen-Ben Jone, Ph.D. Member: Carla Purdy, Ph.D. Member: Ranganadha Vemuri, Ph.D. ABSTRACT In the current age of big data, artificial intelligence and machine learning technologies have gained much popularity. Due to the increasing demand for such applications, neural networks are being targeted toward hardware solutions. Owing to the shrinking feature size, number of physical defects are on the rise. These growing number of defects are preventing designers from realizing the full potential of the on-chip design. The challenge now is not only to find solutions that balance high-performance and energy-efficiency but also, to achieve fault-tolerance of a computational model. Neural computing, due to its inherent fault tolerant capabilities, can provide promising solutions to this issue. The primary focus of this thesis is to gain deeper understanding of fault tolerance in neural network hardware. As a part of this work, we present a comprehensive analysis of fault tolerance by exploring effects of faults on popular neural models: multi-layer perceptron model and convolution neural network. We built the models based on conventional 64-bit floating point representation. In addition to this, we also explore the recent 8-bit integer quantized representation. A fault injector model is designed to inject stuck-at faults at random locations in the network.
    [Show full text]
  • In-Datacenter Performance Analysis of a Tensor Processing Unit
    In-Datacenter Performance Analysis of a Tensor Processing Unit Presented by Josh Fried Background: Machine Learning Neural Networks: ● Multi Layer Perceptrons ● Recurrent Neural Networks (mostly LSTMs) ● Convolutional Neural Networks Synapse - each edge, has a weight Neuron - each node, sums weights and uses non-linear activation function over sum Propagating inputs through a layer of the NN is a matrix multiplication followed by an activation Background: Machine Learning Two phases: ● Training (offline) ○ relaxed deadlines ○ large batches to amortize costs of loading weights from DRAM ○ well suited to GPUs ○ Usually uses floating points ● Inference (online) ○ strict deadlines: 7-10ms at Google for some workloads ■ limited possibility for batching because of deadlines ○ Facebook uses CPUs for inference (last class) ○ Can use lower precision integers (faster/smaller/more efficient) ML Workloads @ Google 90% of ML workload time at Google spent on MLPs and LSTMs, despite broader focus on CNNs RankBrain (search) Inception (image classification), Google Translate AlphaGo (and others) Background: Hardware Trends End of Moore’s Law & Dennard Scaling ● Moore - transistor density is doubling every two years ● Dennard - power stays proportional to chip area as transistors shrink Machine Learning causing a huge growth in demand for compute ● 2006: Excess CPU capacity in datacenters is enough ● 2013: Projected 3 minutes per-day per-user of speech recognition ○ will require doubling datacenter compute capacity! Google’s Answer: Custom ASIC Goal: Build a chip that improves cost-performance for NN inference What are the main costs? Capital Costs Operational Costs (power bill!) TPU (V1) Design Goals Short design-deployment cycle: ~15 months! Plugs in to PCIe slot on existing servers Accelerates matrix multiplication operations Uses 8-bit integer operations instead of floating point How does the TPU work? CISC instructions, issued by host.
    [Show full text]
  • Abstractions for Programming Graphics Processors in High-Level Programming Languages
    Abstracties voor het programmeren van grafische processoren in hoogniveau-programmeertalen Abstractions for Programming Graphics Processors in High-Level Programming Languages Tim Besard Promotor: prof. dr. ir. B. De Sutter Proefschrift ingediend tot het behalen van de graad van Doctor in de ingenieurswetenschappen: computerwetenschappen Vakgroep Elektronica en Informatiesystemen Voorzitter: prof. dr. ir. K. De Bosschere Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2018 - 2019 ISBN 978-94-6355-244-8 NUR 980 Wettelijk depot: D/2019/10.500/52 Examination Committee Prof. Filip De Turck, chair Department of Information Technology Faculty of Engineering and Architecture Ghent University Prof. Koen De Bosschere, secretary Department of Electronics and Information Systems Faculty of Engineering and Architecture Ghent University Prof. Bjorn De Sutter, supervisor Department of Electronics and Information Systems Faculty of Engineering and Architecture Ghent University Prof. Jutho Haegeman Department of Physics and Astronomy Faculty of Sciences Ghent University Prof. Jan Lemeire Department of Electronics and Informatics Faculty of Engineering Vrije Universiteit Brussel Prof. Christophe Dubach School of Informatics College of Science & Engineering The University of Edinburgh Prof. Alan Edelman Computer Science & Artificial Intelligence Laboratory Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology ii Dankwoord Ik wist eigenlijk niet waar ik aan begon, toen ik in 2012 in de cata- comben van het Technicum op gesprek ging over een doctoraat. Of ik al eens met LLVM gewerkt had. Ondertussen zijn we vele jaren verder, werk ik op een bureau waar er wel daglicht is, en is het eindpunt van deze studie zowaar in zicht. Dat mag natuurlijk wel, zo vertelt men mij, na 7 jaar.
    [Show full text]
  • P1360R0: Towards Machine Learning for C++: Study Group 19
    P1360R0: Towards Machine Learning for C++: Study Group 19 Date: 2018-11-26(Post-SAN mailing): 10 AM ET Project: ISO JTC1/SC22/WG21: Programming Language C++ Audience SG19, WG21 Authors : Michael Wong (Codeplay), Vincent Reverdy (University of Illinois at Urbana-Champaign, Paris Observatory), Robert Douglas (Epsilon), Emad Barsoum (Microsoft), Sarthak Pati (University of Pennsylvania) Peter Goldsborough (Facebook) Franke Seide (MS) Contributors Emails: [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Reply to: [email protected] Introduction 2 Motivation 2 Scope 5 Meeting frequency and means 6 Outreach to ML/AI/Data Science community 7 Liaison with other groups 7 Future meetings 7 Conclusion 8 Acknowledgements 8 References 8 Introduction This paper proposes a WG21 SG for Machine Learning with the goal of: ● Making Machine Learning a first-class citizen in ISO C++ It is the collaboration of a number of key industry, academic, and research groups, through several connections in CPPCON BoF[reference], LLVM 2018 discussions, and C++ San Diego meeting. The intention is to support such an SG, and describe the scope of such an SG. This is in terms of potential work resulting in papers submitted for future C++ Standards, or collaboration with other SGs. We will also propose ongoing teleconferences, meeting frequency and locations, as well as outreach to ML data scientists, conferences, and liaison with other Machine Learning groups such as at Khronos, and ISO. As of the SAN meeting, this group has been officially created as SG19, and we will begin teleconferences immediately, after the US thanksgiving, and after NIPS.
    [Show full text]
  • Understanding & Generalizing Alphago Zero
    Under review as a conference paper at ICLR 2019 UNDERSTANDING &GENERALIZING ALPHAGO ZERO Anonymous authors Paper under double-blind review ABSTRACT AlphaGo Zero (AGZ) (Silver et al., 2017b) introduced a new tabula rasa rein- forcement learning algorithm that has achieved superhuman performance in the games of Go, Chess, and Shogi with no prior knowledge other than the rules of the game. This success naturally begs the question whether it is possible to develop similar high-performance reinforcement learning algorithms for generic sequential decision-making problems (beyond two-player games), using only the constraints of the environment as the “rules.” To address this challenge, we start by taking steps towards developing a formal understanding of AGZ. AGZ includes two key innovations: (1) it learns a policy (represented as a neural network) using super- vised learning with cross-entropy loss from samples generated via Monte-Carlo Tree Search (MCTS); (2) it uses self-play to learn without training data. We argue that the self-play in AGZ corresponds to learning a Nash equilibrium for the two-player game; and the supervised learning with MCTS is attempting to learn the policy corresponding to the Nash equilibrium, by establishing a novel bound on the difference between the expected return achieved by two policies in terms of the expected KL divergence (cross-entropy) of their induced distributions. To extend AGZ to generic sequential decision-making problems, we introduce a robust MDP framework, in which the agent and nature effectively play a zero-sum game: the agent aims to take actions to maximize reward while nature seeks state transitions, subject to the constraints of that environment, that minimize the agent’s reward.
    [Show full text]
  • AI for Broadcsaters, Future Has Already Begun…
    AI for Broadcsaters, future has already begun… By Dr. Veysel Binbay, Specialist Engineer @ ABU Technology & Innovation Department 0 Dr. Veysel Binbay I have been working as Specialist Engineer at ABU Technology and Innovation Department for one year, before that I had worked at TRT (Turkish Radio and Television Corporation) for more than 20 years as a broadcast engineer, and also as an IT Director. I have wide experience on Radio and TV broadcasting technologies, including IT systems also. My experience includes to design, to setup, and to operate analogue/hybrid/digital radio and TV broadcast systems. I have also experienced on IT Networks. 1/25 What is Artificial Intelligence ? • Programs that behave externally like humans? • Programs that operate internally as humans do? • Computational systems that behave intelligently? 2 Some Definitions Trials for AI: The exciting new effort to make computers think … machines with minds, in the full literal sense. Haugeland, 1985 3 Some Definitions Trials for AI: The study of mental faculties through the use of computational models. Charniak and McDermott, 1985 A field of study that seeks to explain and emulate intelligent behavior in terms of computational processes. Schalkoff, 1990 4 Some Definitions Trials for AI: The study of how to make computers do things at which, at the moment, people are better. Rich & Knight, 1991 5 It’s obviously hard to define… (since we don’t have a commonly agreed definition of intelligence itself yet)… Lets try to focus to benefits, and solve definition problem later… 6 Brief history of AI 7 Brief history of AI . The history of AI begins with the following article: .
    [Show full text]
  • Efficiently Mastering the Game of Nogo with Deep Reinforcement
    electronics Article Efficiently Mastering the Game of NoGo with Deep Reinforcement Learning Supported by Domain Knowledge Yifan Gao 1,*,† and Lezhou Wu 2,† 1 College of Medicine and Biological Information Engineering, Northeastern University, Liaoning 110819, China 2 College of Information Science and Engineering, Northeastern University, Liaoning 110819, China; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: Computer games have been regarded as an important field of artificial intelligence (AI) for a long time. The AlphaZero structure has been successful in the game of Go, beating the top professional human players and becoming the baseline method in computer games. However, the AlphaZero training process requires tremendous computing resources, imposing additional difficulties for the AlphaZero-based AI. In this paper, we propose NoGoZero+ to improve the AlphaZero process and apply it to a game similar to Go, NoGo. NoGoZero+ employs several innovative features to improve training speed and performance, and most improvement strategies can be transferred to other nonspecific areas. This paper compares it with the original AlphaZero process, and results show that NoGoZero+ increases the training speed to about six times that of the original AlphaZero process. Moreover, in the experiment, our agent beat the original AlphaZero agent with a score of 81:19 after only being trained by 20,000 self-play games’ data (small in quantity compared with Citation: Gao, Y.; Wu, L. Efficiently 120,000 self-play games’ data consumed by the original AlphaZero). The NoGo game program based Mastering the Game of NoGo with on NoGoZero+ was the runner-up in the 2020 China Computer Game Championship (CCGC) with Deep Reinforcement Learning limited resources, defeating many AlphaZero-based programs.
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]
  • ELF Opengo: an Analysis and Open Reimplementation of Alphazero
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero Yuandong Tian 1 Jerry Ma * 1 Qucheng Gong * 1 Shubho Sengupta * 1 Zhuoyuan Chen 1 James Pinkerton 1 C. Lawrence Zitnick 1 Abstract However, these advances in playing ability come at signifi- The AlphaGo, AlphaGo Zero, and AlphaZero cant computational expense. A single training run requires series of algorithms are remarkable demonstra- millions of selfplay games and days of training on thousands tions of deep reinforcement learning’s capabili- of TPUs, which is an unattainable level of compute for the ties, achieving superhuman performance in the majority of the research community. When combined with complex game of Go with progressively increas- the unavailability of code and models, the result is that the ing autonomy. However, many obstacles remain approach is very difficult, if not impossible, to reproduce, in the understanding of and usability of these study, improve upon, and extend. promising approaches by the research commu- In this paper, we propose ELF OpenGo, an open-source nity. Toward elucidating unresolved mysteries reimplementation of the AlphaZero (Silver et al., 2018) and facilitating future research, we propose ELF algorithm for the game of Go. We then apply ELF OpenGo OpenGo, an open-source reimplementation of the toward the following three additional contributions. AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate First, we train a superhuman model for ELF OpenGo. Af- superhuman performance with a perfect (20:0) ter running our AlphaZero-style training software on 2,000 record against global top professionals. We ap- GPUs for 9 days, our 20-block model has achieved super- ply ELF OpenGo to conduct extensive ablation human performance that is arguably comparable to the 20- studies, and to identify and analyze numerous in- block models described in Silver et al.(2017) and Silver teresting phenomena in both the model training et al.(2018).
    [Show full text]
  • Shinjae Yoo Computational Science Initiative Outline
    Lightning Overview of Machine Learning Shinjae Yoo Computational Science Initiative Outline • Why is Machine Learning important? • Machine Learning Concepts • Big Data and Machine Learning • Potential Research Areas 2 Why is Machine Learning important? 3 ML Application to Physics 4 ML Application to Biology 5 Machine Learning Concepts 6 What is Machine Learning (ML) • One of Machine Learning definitions • “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?” Tom Mitchell, 2006 • Statistics: What conclusions can be inferred from data • ML incorporates additionally • What architectures and algorithms can be used to effectively handle data • How multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability 7 Machine Learning Components Algorithm Machine Learning HW+SW Data 8 Brief History of Machine Learning Speech Web Deep Blue N-gram Hadoop ENIAC Perceptron Recognition Browser Chess based MT 0.1.0 K-Means KNN 1950 1960 1970 1980 1990 2000 2010 1ST Chess WWW Data Driven Softmargin Image HMM GPGPU Learning Prog. invented ML SVM Net 2010 2011 2012 2013 2014 2015 2016 NN won ImageNet NN outperform IBM Watson AlphaGo with large margin human on ImageNet 9 Supervised Learning Pipeline Training Labeling Preprocessing Validate Data Cleansing Feature Engineering Normalization Testing Preprocessing Predict 10 Unsupervised Learning Pipeline Preprocessing Data Cleansing Feature Engineering Normalization Predict 11 Types of Learning 12 Types of Learning • Generative Learning 13 Types of Learning • Discriminative Learning 14 Types of Learning • Active Learning • How to select training data? 15 Types of Learning • Multi-task Learning 16 Types of Learning • Transfer Learning 17 Types of Learning • Kernel Learning • Metric Learning 18 Types of Learning • Kernel Learning • Metric Learning 19 Types of Learning • Kernel Learning • Metric Learning • Dimensionality Reduction 20 Types of Learning • Feature Learning 21 • Lee, et al.
    [Show full text]