Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (Preliminary White Paper, November 9, 2015) Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane,´ Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas,´ Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng Google Research∗ Abstract sequence prediction [47], move selection for Go [34], pedestrian detection [2], reinforcement learning [38], TensorFlow [1] is an interface for expressing machine learn- and other areas [17, 5]. In addition, often in close collab- ing algorithms, and an implementation for executing such al- oration with the Google Brain team, more than 50 teams gorithms. A computation expressed using TensorFlow can be at Google and other Alphabet companies have deployed executed with little or no change on a wide variety of hetero- deep neural networks using DistBelief in a wide variety geneous systems, ranging from mobile devices such as phones of products, including Google Search [11], our advertis- and tablets up to large-scale distributed systems of hundreds ing products, our speech recognition systems [50, 6, 46], of machines and thousands of computational devices such as Google Photos [43], Google Maps and StreetView [19], GPU cards. The system is flexible and can be used to express Google Translate [18], YouTube, and many others. a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been Based on our experience with DistBelief and a more used for conducting research and for deploying machine learn- complete understanding of the desirable system proper- ing systems into production across more than a dozen areas of ties and requirements for training and using neural net- computer science and other fields, including speech recogni- works, we have built TensorFlow, our second-generation tion, computer vision, robotics, information retrieval, natural system for the implementation and deployment of large- language processing, geographic information extraction, and scale machine learning models. TensorFlow takes com- computational drug discovery. This paper describes the Ten- putations described using a dataflow-like model and sorFlow interface and an implementation of that interface that maps them onto a wide variety of different hardware we have built at Google. The TensorFlow API and a reference platforms, ranging from running inference on mobile implementation were released as an open-source package under device platforms such as Android and iOS to modest- the Apache 2.0 license in November, 2015 and are available at sized training and inference systems using single ma- www.tensorflow.org. chines containing one or many GPU cards to large-scale training systems running on hundreds of specialized machines with thousands of GPUs. Having a single system 1 Introduction that can span such a broad range of platforms signifi- cantly simplifies the real-world use of machine learning The Google Brain project started in 2011 to explore the system, as we have found that having separate systems use of very-large-scale deep neural networks, both for for large-scale training and small-scale deployment leads research and for use in Google’s products. As part of to significant maintenance burdens and leaky abstrac- the early work in this project, we built DistBelief, our tions. TensorFlow computations are expressed as stateful first-generation scalable distributed training and infer- dataflow graphs (described in more detail in Section 2), ence system [14], and this system has served us well. We and we have focused on making the system both flexible and others at Google have performed a wide variety of re- enough for quickly experimenting with new models for search using DistBelief including work on unsupervised research purposes and sufficiently high performance and learning [31], language representation [35, 52], models robust for production training and deployment of ma- for image classification and object detection [16, 48], chine learning models. For scaling neural network train- video classification [27], speech recognition [56, 21, 20], ing to larger deployments, TensorFlow allows clients to ∗Corresponding authors: Jeffrey Dean and Rajat Monga: easily express various kinds of parallelism through repli- fjeff,[email protected] cation and parallel execution of a core model dataflow 1 graph, with many different computational devices all col- structures within the graph in a manner similar to Naiad laborating to update a set of shared parameters or other [36]. Clients typically construct a computational graph state. Modest changes in the description of the com- using one of the supported frontend languages (C++ or putation allow a wide variety of different approaches Python). An example fragment to construct and then ex- to parallelism to be achieved and tried with low effort ecute a TensorFlow graph using the Python front end is [14, 29, 42]. Some TensorFlow uses allow some flexibil- shown in Figure 1, and the resulting computation graph ity in terms of the consistency of parameter updates, and in Figure 2. we can easily express and take advantage of these relaxed In a TensorFlow graph, each node has zero or more in- synchronization requirements in some of our larger de- puts and zero or more outputs, and represents the instan- ployments. Compared to DistBelief, TensorFlow’s pro- tiation of an operation. Values that flow along normal gramming model is more flexible, its performance is sig- edges in the graph (from outputs to inputs) are tensors, nificantly better, and it supports training and using a arbitrary dimensionality arrays where the underlying el- broader range of models on a wider variety of hetero- ement type is specified or inferred at graph-construction geneous hardware platforms. time. Special edges, called control dependencies, can Dozens of our internal clients of DistBelief have al- also exist in the graph: no data flows along such edges, ready switched to TensorFlow. These clients rely on but they indicate that the source node for the control de- TensorFlow for research and production, with tasks as pendence must finish executing before the destination diverse as running inference for computer vision mod- node for the control dependence starts executing. Since els on mobile phones to large-scale training of deep our model includes mutable state, control dependencies neural networks with hundreds of billions of parame- can be used directly by clients to enforce happens before ters on hundreds of billions of example records using relationships. Our implementation also sometimes in- many hundreds of machines [11, 47, 48, 18, 53, 41]. serts control dependencies to enforce orderings between Although these applications have concentrated on ma- otherwise independent operations as a way of, for exam- chine learning and deep neural networks in particular, ple, controlling the peak memory usage. we expect that TensorFlow’s abstractions will be useful in a variety of other domains, including other kinds of machine learning algorithms, and possibly other kinds Operations and Kernels of numerical computations. We have open-sourced the TensorFlow API and a reference implementation under An operation has a name and represents an abstract com- the Apache 2.0 license in November, 2015, available at putation (e.g., “matrix multiply”, or “add”). An opera- www.tensorflow.org. tion can have attributes, and all attributes must be pro- The rest of this paper describes TensorFlow in more vided or inferred at graph-construction time in order to detail. Section 2 describes the programming model and instantiate a node to perform the operation. One com- basic concepts of the TensorFlow interface, and Section 3 mon use of attributes is to make operations polymorphic describes both our single machine and distributed imple- over different tensor element types (e.g., add of two ten- mentations. Section 4 describes several extensions to sors of type float versus add of two tensors of type int32). the basic programming model, and Section 5 describes A kernel is a particular implementation of an operation several optimizations to the basic implementations. Sec- that can be run on a particular type of device (e.g., CPU tion 6 describes some of our experiences in using Ten- or GPU). A TensorFlow binary defines the sets of opera- sorFlow, Section 7 describes several programming id- tions and kernels available via a registration mechanism, ioms we have found helpful when using TensorFlow, and and this set can be extended by linking in additional op- Section 9 describes several auxiliary tools we have built eration and/or kernel definitions/registrations. Table 1 around the core TensorFlow system. Sections 10 and 11 shows some of the kinds of operations built into the core discuss future and related work, respectively, and Sec- TensorFlow library. tion 12 offers concluding thoughts. Sessions 2 Programming Model and Basic Concepts Clients interact with the TensorFlow system by creating A TensorFlow computation is described by a directed a Session. To create a computation graph, the Session

Large-Scale Machine Learning on Heterogeneous Distributed Systems

Improved Policy Networks for Computer Go

Lecture Notes Geoffrey Hinton

CSC321 Lecture 23: Go

The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches

Uncertainty in Deep Learning

Mesh-Tensorflow: Deep Learning for Supercomputers

Combining Tactical Search and Deep Learning in the Game of Go

Arxiv:2102.01293V1 [Cs.LG] 2 Feb 2021

Deep Learning for Go

Incorporating Nesterov Momentum Into Adam

Large-Scale Deep Learning for Intelligent Computer Systems

CSC 311: Introduction to Machine Learning Lecture 12 - Alphago and Game-Playing