Machine Learning Parallelism Could Be Adaptive, Composable and Automated

Total Page:16

File Type:pdf, Size:1020Kb

Machine Learning Parallelism Could Be Adaptive, Composable and Automated Machine Learning Parallelism Could Be Adaptive, Composable and Automated Hao Zhang CMU-RI-TR-20-54 October 8, 2020 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Greg R. Ganger Jinyang Li (NYU) Deva Ramanan Christopher Re´ (Stanford) Eric P. Xing, Chair Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright © 2020 Hao Zhang. The research in this thesis was jointly sponsored by: the National Science Foundation awards IIS-1447676 and CCF- 1629559, the Defense Advanced Research Project Agency award FA872105C0003, and Petuum Inc. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: Scalable Machine Learning, Parallelization, Machine Learning Parallelism, Dis- tributed Machine Learning, Machine Learning System, Compiler, Automatic Parallelization, SysML, AutoML, Parameter Server, Composability To my fiancee,´ Luna. iv Abstract In recent years, the pace of innovations in the fields of machine learning (ML) has accelerated, researchers in SysML have created algorithms and systems that par- allelize ML training over multiple devices or computational nodes. As ML models become more structurally complex, many systems have struggled to provide all- round performance on a variety of models. Particularly, ML scale-up is usually un- derestimated in terms of the amount of knowledge and time required to map from an appropriate distribution strategy to the model. Applying parallel training systems to complex models adds nontrivial development overheads in addition to model proto- typing, and often results in lower-than-expected performance. This thesis identifies and addresses research challenges in both usability and performance in parallel ML techniques and system implementations. The first part of this thesis presents a simple design principle, adaptive paral- lelism, that applies suitable parallelization techniques to model building blocks (e.g., layers) according to their specific ML properties. Following it, we derive a series of optimizations and implementations optimizing different aspects of ML paralleliza- tion. We examine them and show that they significantly boost the efficiency or scal- ability of ML training on clusters 2-10x in their applicable scenarios. Generalizing this methodology, this second part of this thesis formulates the ML parallelization as an end-to-end optimization problem, and seeks to solve it automat- ically, for two broad paradigms of ML parallelization tasks: single-node dynamic batching and distributed ML parallelisms. We present principled representations to express the two classes of ML parallelisms, along with composable system archi- tectures, Cavs and AutoDist, respectively. They enable rapid compositions of par- allelization strategies for unseen models, improve parallelization performance, and simplify parallel ML programming. On top of them, the third part of this thesis presents an automatic paralleliza- tion framework, AutoSync, to automatically optimize synchronization strategies in data-parallel distributed training. AutoSync achieves high performance “out-of-the- box” – it navigates the space spanned by the proposed representation, and auto- matically identifies synchronization strategies that report 1.2 - 1.6x speedups over existing hand-optimized systems, lowering the technical barrier of distributed ML and helping make it accessible to a larger community of users. Collectively, the set of techniques and systems developed in this thesis lead to the proof of the concept and the prototype implementation of an end-to-end compiler system for large-scale ML training on distributed environments. vi Acknowledgments First and foremost, I thank my advisor Eric Xing, for giving me a huge amount of freedom and trust during my graduate studies. In my early years, Eric was instru- mental in teaching me to identity the right problems to work on. Eric also offered me the chance to work at Petuum during my PhD, during which I have received tremen- dous support to address new challenges that I would’ve otherwise never been able to see in a blue-sky research environment. This academic entrepreneurship experience has helped me identify my lifelong goal. I am extremely grateful to my thesis committee members: Greg Ganger, Deva Ramanan, Jinyang Li, and Christopher Re,´ for their feedback to make this thesis hap- pen. As a member of the BigLearning group, Greg has always been a good mentor and collaborator – during the BigLearning meetings, Greg’s sharp questions always helped me refine the characterizations of the problems, solutions, and systems. Dur- ing my 1-day visit at NYU, Jinyang inspired me a lot to maintain an ML-system balanced perspective. While Deva is not a SysML insider, his scientific insight and intuition have been extremely important in shaping this thesis. Chris’ path of trans- forming ML research into useful technologies and values have been an example, and I hope to collaborate with him in the future. Research is a team effort and I felt super grateful to be a part of amazing teams at CMU and Petuum. I appreciate the helpful discussion and the camaraderie with my friends and collaborators at CMU: Henggang Cui, Wei Dai, Zhijie Deng, Qirong Ho, Zhiting Hu, Gunhee Kim, Jin Kyu Kim, Christy Li, Aurick Qiao, Jinliang Wei, Shizhen Xu, Zhicheng Yan, Zeyu Zheng, Xun Zheng, Yuntian Deng, Qi Guo, Willie Neiswanger, Hector Li, Xiaodan Liang, Bin Zhao. I was extremely lucky to lead a small team at Petuum but do BIG things during my PhD. I thank my colleagues Peng Wu, Hong Wu, Trevin Gandhi, Zeya Wang, Tairui Wang, Xin Gao, Sean Chen, Luke Lu, Jayesh Gada, Henry Guo, Liz Yi, Cathy Serventi for their helps in various cases ranging from debugging a line a code to standing at the ICML EXPO. During my graduate studies, I also had the opportunity to interact with many other faculty members at CMU: Phillip Gibbons, Graham Neubig, Garth Gibson, Kris Kitani, Ruslan Salakhutdinov, Min Xu, Yaser Sheikh, Abhinav Gupta, David Wettergreen, Martial Hebert. Their nice comments and feedback have been always helpful in shaping me as an independent researcher. I want to give a special ac- knowledgement to Graham Neubig, Kris Kitani, and Min Xu for helping me with my speaking skill requirements. Finally, I would like to thank my family for always supporting and believing in me. Especially, I am deeply indebted to my dear girlfriend Luna Yang for her unconditional love, understanding, and encouragement during my graduate years. I can clearly remember every moment in the past 5 years of her accompany during all my ups and downs – when we were skiing the Rockies, diving Cenotes, watching sea otters. This thesis would not be possible without your continual support. Thank you, viii Contents 1 Introduction1 1.1 Example: Distributed Pretraining of Language Models on Web-scale Text....3 1.2 Thesis Outline, Contributions, and Key Results..................7 2 Background 10 2.1 Machine Learning: A Computational Perspective................. 10 2.1.1 “The Master Equation”........................... 10 2.1.2 Notable ML Models and Applications................... 11 2.1.3 Parallel Machine Learning......................... 13 2.1.4 Trends in Machine Learning........................ 14 2.2 Hardware Environment for Parallel Machine Learning............... 18 2.2.1 Trends in ML Hardware Infrastructure................... 21 2.3 Machine Learning Frameworks........................... 22 2.3.1 Earlier ML systems............................. 22 2.3.2 Modern DL Frameworks.......................... 23 2.3.3 Trends in ML Frameworks......................... 24 2.4 Distributed Systems for Large-scale ML: A Review................ 26 2.4.1 Strategies for ML Parallelization...................... 26 2.4.2 Distributed ML Programming APIs.................... 33 2.4.3 Trends in Distributed ML Systems..................... 35 I Aspects of ML Parallelization 37 3 Scheduling and Communication 40 3.1 Background..................................... 40 3.1.1 Distributed Training of Neural Networks................. 40 3.1.2 Communication Architectures....................... 42 3.1.3 Communication Challenges on GPU Clusters: An Example....... 43 3.2 Scheduling..................................... 44 3.2.1 The Sequential Structure of DL Programs................. 44 3.2.2 Wait-free Backpropagation......................... 45 3.3 Adaptive Communication.............................. 46 ix 3.4 Poseidon: An Efficient Communication Architecture for Distributed DL on GPU Clusters....................................... 48 3.4.1 System Architecture............................ 49 3.4.2 Integrate Poseidon with DL Frameworks................. 51 3.5 Evaluation...................................... 53 3.5.1 Experiment Setup.............................. 53 3.5.2 Scalability................................. 54 3.5.3 Bandwidth Experiments.......................... 59 3.5.4 Comparisons to Other Methods...................... 61 3.5.5 Application: Scaling Up Image Classification on ImageNet 22K..... 61 3.6 Additional Related Work.............................. 63 4 Memory Management 66 4.1 Introduction..................................... 67 4.1.1 Deep Learning on GPUs: a Memory Management Perspective...... 67 4.1.2 Memory Challenges for Distributed DL on GPU Clusters......... 67 4.2 Memory Swapping................................
Recommended publications
  • Deep Learning Frameworks | NVIDIA Developer
    4/10/2017 Deep Learning Frameworks | NVIDIA Developer Deep Learning Frameworks The NVIDIA Deep Learning SDK accelerates widely­used deep learning frameworks such as Caffe, CNTK, TensorFlow, Theano and Torch as well as many other deep learning applications. Choose a deep learning framework from the list below, download the supported version of cuDNN and follow the instructions on the framework page to get started. Caffe is a deep learning framework made with expression, speed, and modularity in mind. Caffe is developed by the Berkeley Vision and Learning Center (BVLC), as well as community contributors and is popular for computer vision. Caffe supports cuDNN v5 for GPU acceleration. Supported interfaces: C, C++, Python, MATLAB, Command line interface Learning Resources Deep learning course: Getting Started with the Caffe Framework Blog: Deep Learning for Computer Vision with Caffe and cuDNN Download Caffe Download cuDNN The Microsoft Cognitive Toolkit —previously known as CNTK— is a unified deep­learning toolkit from Microsoft Research that makes it easy to train and combine popular model types across multiple GPUs and servers. Microsoft Cognitive Toolkit implements highly efficient CNN and RNN training for speech, image and text data. Microsoft Cognitive Toolkit supports cuDNN v5.1 for GPU acceleration. Supported interfaces: Python, C++, C# and Command line interface Download CNTK Download cuDNN TensorFlow is a software library for numerical computation using data flow graphs, developed by Google’s Machine Intelligence research organization. TensorFlow supports cuDNN v5.1 for GPU acceleration. Supported interfaces: C++, Python Download TensorFlow Download cuDNN https://developer.nvidia.com/deep­learning­frameworks 1/3 4/10/2017 Deep Learning Frameworks | NVIDIA Developer Theano is a math expression compiler that efficiently defines, optimizes, and evaluates mathematical expressions involving multi­dimensional arrays.
    [Show full text]
  • Theano: a Python Framework for Fast Computation of Mathematical Expressions (The Theano Development Team)∗
    Theano: A Python framework for fast computation of mathematical expressions (The Theano Development Team)∗ Rami Al-Rfou,6 Guillaume Alain,1 Amjad Almahairi,1 Christof Angermueller,7, 8 Dzmitry Bahdanau,1 Nicolas Ballas,1 Fred´ eric´ Bastien,1 Justin Bayer, Anatoly Belikov,9 Alexander Belopolsky,10 Yoshua Bengio,1, 3 Arnaud Bergeron,1 James Bergstra,1 Valentin Bisson,1 Josh Bleecher Snyder, Nicolas Bouchard,1 Nicolas Boulanger-Lewandowski,1 Xavier Bouthillier,1 Alexandre de Brebisson,´ 1 Olivier Breuleux,1 Pierre-Luc Carrier,1 Kyunghyun Cho,1, 11 Jan Chorowski,1, 12 Paul Christiano,13 Tim Cooijmans,1, 14 Marc-Alexandre Cotˆ e,´ 15 Myriam Cotˆ e,´ 1 Aaron Courville,1, 4 Yann N. Dauphin,1, 16 Olivier Delalleau,1 Julien Demouth,17 Guillaume Desjardins,1, 18 Sander Dieleman,19 Laurent Dinh,1 Melanie´ Ducoffe,1, 20 Vincent Dumoulin,1 Samira Ebrahimi Kahou,1, 2 Dumitru Erhan,1, 21 Ziye Fan,22 Orhan Firat,1, 23 Mathieu Germain,1 Xavier Glorot,1, 18 Ian Goodfellow,1, 24 Matt Graham,25 Caglar Gulcehre,1 Philippe Hamel,1 Iban Harlouchet,1 Jean-Philippe Heng,1, 26 Balazs´ Hidasi,27 Sina Honari,1 Arjun Jain,28 Sebastien´ Jean,1, 11 Kai Jia,29 Mikhail Korobov,30 Vivek Kulkarni,6 Alex Lamb,1 Pascal Lamblin,1 Eric Larsen,1, 31 Cesar´ Laurent,1 Sean Lee,17 Simon Lefrancois,1 Simon Lemieux,1 Nicholas Leonard,´ 1 Zhouhan Lin,1 Jesse A. Livezey,32 Cory Lorenz,33 Jeremiah Lowin, Qianli Ma,34 Pierre-Antoine Manzagol,1 Olivier Mastropietro,1 Robert T. McGibbon,35 Roland Memisevic,1, 4 Bart van Merrienboer,¨ 1 Vincent Michalski,1 Mehdi Mirza,1 Alberto Orlandi, Christopher Pal,1, 2 Razvan Pascanu,1, 18 Mohammad Pezeshki,1 Colin Raffel,36 Daniel Renshaw,25 Matthew Rocklin, Adriana Romero,1 Markus Roth, Peter Sadowski,37 John Salvatier,38 Franc¸ois Savard,1 Jan Schluter,¨ 39 John Schulman,24 Gabriel Schwartz,40 Iulian Vlad Serban,1 Dmitriy Serdyuk,1 Samira Shabanian,1 Etienne´ Simon,1, 41 Sigurd Spieckermann, S.
    [Show full text]
  • Practical Deep Learning
    Practical deep learning December 12-13, 2019 CSC – IT Center for Science Ltd., Espoo Markus Koskela Mats Sjöberg All original material (C) 2019 by CSC – IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0 All other material copyrighted by their respective authors. Course schedule Thursday Friday 9.00-10.30 Lecture 1: Introduction 9.00-9.45 Lecture 5: Deep to deep learning learning frameworks 10.30-10.45 Coffee break 9.45-10.15 Lecture 6: GPUs and batch jobs 10.45-11.00 Exercise 1: Introduction to Notebooks, Keras 10.15-10.30 Coffee break 11.00-11.30 Lecture 2: Multi-layer 10.30-12.00 Exercise 5: Image perceptron networks classification: dogs vs. cats; traffic signs 11.30-12.00 Exercise 2: Classifica- 12.00-13.00 Lunch break tion with MLPs 13.00-14.00 Exercise 6: Text catego- 12.00-13.00 Lunch break riZation: 20 newsgroups 13.00-14.00 Lecture 3: Images and convolutional neural 14.00-14.45 Lecture 7: Cloud, GPU networks utiliZation, using multiple GPU 14.00-14.30 Exercise 3: Image classification with CNNs 14.45-15.00 Coffee break 14.30-14.45 Coffee break 15.00-16.00 Exercise 7: Using multiple GPUs 14.45-15.30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15.30-16.00 Exercise 4: Text sentiment classification with CNNs and RNNs Up-to-date agenda and lecture slides can be found at https://tinyurl.com/r3fd3st Exercise materials are at GitHub: https://github.com/csc-training/intro-to-dl/ Wireless accounts for CSC-guest network behind the badges.
    [Show full text]
  • Comparative Study of Deep Learning Software Frameworks
    Comparative Study of Deep Learning Software Frameworks Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Research and Technology Center, Robert Bosch LLC {Soheil.Bahrampour, Naveen.Ramakrishnan, fixed-term.Lukas.Schott, Mohak.Shah}@us.bosch.com ABSTRACT such as dropout and weight decay [2]. As the popular- Deep learning methods have resulted in significant perfor- ity of the deep learning methods have increased over the mance improvements in several application domains and as last few years, several deep learning software frameworks such several software frameworks have been developed to have appeared to enable efficient development and imple- facilitate their implementation. This paper presents a com- mentation of these methods. The list of available frame- parative study of five deep learning frameworks, namely works includes, but is not limited to, Caffe, DeepLearning4J, Caffe, Neon, TensorFlow, Theano, and Torch, on three as- deepmat, Eblearn, Neon, PyLearn, TensorFlow, Theano, pects: extensibility, hardware utilization, and speed. The Torch, etc. Different frameworks try to optimize different as- study is performed on several types of deep learning ar- pects of training or deployment of a deep learning algorithm. chitectures and we evaluate the performance of the above For instance, Caffe emphasises ease of use where standard frameworks when employed on a single machine for both layers can be easily configured without hard-coding while (multi-threaded) CPU and GPU (Nvidia Titan X) settings. Theano provides automatic differentiation capabilities which The speed performance metrics used here include the gradi- facilitates flexibility to modify architecture for research and ent computation time, which is important during the train- development. Several of these frameworks have received ing phase of deep networks, and the forward time, which wide attention from the research community and are well- is important from the deployment perspective of trained developed allowing efficient training of deep networks with networks.
    [Show full text]
  • Comparative Study of Caffe, Neon, Theano, and Torch
    Workshop track - ICLR 2016 COMPARATIVE STUDY OF CAFFE,NEON,THEANO, AND TORCH FOR DEEP LEARNING Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Bosch Research and Technology Center fSoheil.Bahrampour,Naveen.Ramakrishnan, fixed-term.Lukas.Schott,[email protected] ABSTRACT Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is per- formed on several types of deep learning architectures and we evaluate the per- formance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important dur- ing the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algo- rithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best perfor- mance on GPU for training and deployment of LSTM networks. Finally Caffe is the easiest for evaluating the performance of standard deep architectures.
    [Show full text]
  • Torch7: a Matlab-Like Environment for Machine Learning
    Torch7: A Matlab-like Environment for Machine Learning Ronan Collobert1 Koray Kavukcuoglu2 Clement´ Farabet3;4 1 Idiap Research Institute 2 NEC Laboratories America Martigny, Switzerland Princeton, NJ, USA 3 Courant Institute of Mathematical Sciences 4 Universite´ Paris-Est New York University, New York, NY, USA Equipe´ A3SI - ESIEE Paris, France Abstract Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua’s light interface. 1 Torch7 Overview With Torch7, we aim at providing a framework with three main advantages: (1) it should ease the development of numerical algorithms, (2) it should be easily extended (including the use of other libraries), and (3) it should be fast. We found that a scripting (interpreted) language with a good C API appears as a convenient solu- tion to “satisfy” the constraint (2). First, a high-level language makes the process of developing a program simpler and more understandable than a low-level language. Second, if the programming language is interpreted, it becomes also easier to quickly try various ideas in an interactive manner. And finally, assuming a good C API, the scripting language becomes the “glue” between hetero- geneous libraries: different structures of the same concept (coming from different libraries) can be hidden behind a unique structure in the scripting language, while keeping all the functionalities coming from all the different libraries.
    [Show full text]
  • Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction
    TensorFlow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction TensorFlow Google Brain, 2015 (rewritten DistBelief) Theano University of Montréal, 2009 Keras François Chollet, 2015 (now at Google) Torch Facebook AI Research, Twitter, Google DeepMind Caffe Berkeley Vision and Learning Center (BVLC), 2013 Outline 1. Introduction of each framework a. TensorFlow b. Theano c. Keras d. Torch e. Caffe 2. Further comparison a. Code + models b. Community and documentation c. Performance d. Model deployment e. Extra features 3. Which framework to choose when ..? Introduction of each framework TensorFlow architecture 1) Low-level core (C++/CUDA) 2) Simple Python API to define the computational graph 3) High-level API (TF-Learn, TF-Slim, soon Keras…) TensorFlow computational graph - auto-differentiation! - easy multi-GPU/multi-node - native C++ multithreading - device-efficient implementation for most ops - whole pipeline in the graph: data loading, preprocessing, prefetching... TensorBoard TensorFlow development + bleeding edge (GitHub yay!) + division in core and contrib => very quick merging of new hotness + a lot of new related API: CRF, BayesFlow, SparseTensor, audio IO, CTC, seq2seq + so it can easily handle images, videos, audio, text... + if you really need a new native op, you can load a dynamic lib - sometimes contrib stuff disappears or moves - recently introduced bells and whistles are barely documented Presentation of Theano: - Maintained by Montréal University group. - Pioneered the use of a computational graph. - General machine learning tool -> Use of Lasagne and Keras. - Very popular in the research community, but not elsewhere. Falling behind. What is it like to start using Theano? - Read tutorials until you no longer can, then keep going.
    [Show full text]
  • Introduction to Deep Learning Framework 1. Introduction 1.1
    Introduction to Deep Learning Framework 1. Introduction 1.1. Commonly used frameworks The most commonly used frameworks for deep learning include Pytorch, Tensorflow, Keras, caffe, Apache MXnet, etc. PyTorch: open source machine learning library; developed by Facebook AI Rsearch Lab; based on the Torch library; supports Python and C++ interfaces. Tensorflow: open source software library dataflow and differentiable programming; developed by Google brain team; provides stable Python & C APIs. Keras: an open-source neural-network library written in Python; conceived to be an interface; capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Caffe: open source under BSD licence; developed at University of California, Berkeley; written in C++ with a Python interface. Apache MXnet: an open-source deep learning software framework; supports a flexible programming model and multiple programming languages (including C++, Python, Java, Julia, Matlab, JavaScript, Go, R, Scala, Perl, and Wolfram Language.) 1.2. Pytorch 1.2.1 Data Tensor: the major computation unit in PyTorch. Tensor could be viewed as the extension of vector (one-dimensional) and matrix (two-dimensional), which could be defined with any dimension. Variable: a wrapper of tensor, which includes creator, value of variable (tensor), and gradient. This is the core of the automatic derivation in Pytorch, as it has the information of both the value and the creator, which is very important for current backward process. Parameter: a subset of variable 1.2.2. Functions: NNModules: NNModules (torch.nn) is a combination of parameters and functions, and could be interpreted as layers. There some common modules such as convolution layers, linear layers, pooling layers, dropout layers, etc.
    [Show full text]
  • Fashionable Modelling with Flux
    Fashionable Modelling with Flux Michael J Innes Elliot Saba Keno Fischer Julia Computing, Inc. Julia Computing, Inc. Julia Computing, Inc. Edinburgh, UK Cambridge, MA, USA Cambridge, MA, USA [email protected] [email protected] [email protected] Dhairya Gandhi Marco Concetto Rudilosso Julia Computing, Inc. University College London Bangalore, India London, UK [email protected] [email protected] Neethu Mariya Joy Tejan Karmali Birla Institute of Technology and Science National Institute of Technology Pilani, India Goa, India [email protected] [email protected] Avik Pal Viral B. Shah Indian Institute of Technology Julia Computing, Inc. Kanpur, India Cambridge, MA, USA [email protected] [email protected] Abstract Machine learning as a discipline has seen an incredible surge of interest in recent years due in large part to a perfect storm of new theory, superior tooling, renewed interest in its capabilities. We present in this paper a framework named Flux that shows how further refinement of the core ideas of machine learning, built upon the foundation of the Julia programming language, can yield an environment that is simple, easily modifiable, and performant. We detail the fundamental principles of Flux as a framework for differentiable programming, give examples of models that are implemented within Flux to display many of the language and framework-level features that contribute to its ease of use and high productivity, display internal compiler techniques used to enable the acceleration and performance that lies at arXiv:1811.01457v3 [cs.PL] 10 Nov 2018 the heart of Flux, and finally give an overview of the larger ecosystem that Flux fits inside of.
    [Show full text]
  • Enhancing OMB with Python Benchmarks Presentation at MUG ‘21
    MVAPICH2 Meets Python: Enhancing OMB with Python Benchmarks Presentation at MUG ‘2 Nawras Alnaasan Network-Based Computing Laboratory Dept. of Computer Science and Engineering !he #hio State $ni%ersity alnaasan.&'osu.edu O$tline • #S$ (icro Benc"marks )#(B* • +"y ,yt"on- • (,. for ,yt"on • .ntroducing #(B-,y • .nitial /esults • Conclusion !etwork Base" Com#$ting %a&oratory MUG ‘2 2 O)U Micro Benchmarks (OMB+ • #(B is a widely used package to measure performance of (,. operations. • .t helps in optimi0ing (,I applications on different HPC systems. • #1ers a series of benchmarks including but not limited to point-to-point, blocking and non-blocking collecti%es, and one-sided tests. • A large set of options and flags for users to create customi0able tests. • Supports different platforms like /#Cm/CUDA to run on AR(4N5IDIA GPUs. • +ritten in C !etwork Base" Com#$ting %a&oratory MUG ‘2 ( -hy Python? • Second most popular programming language according to the !.#BE inde7 as of August 8021. Adopted from= "ttps=44www.tiobe.com4tiobe-inde74 • 6reat community and support around (achine Learning Big Data and Cloud Computing – (any deep learning frameworks like ,yTorch, !ensor3ow :eras etc. – #ne of the most important tools for data science and analytics. • 6reat ,otential to use (,. and distributed computing tec"niques. !etwork• (any Base" other Com#$ting ,yt"on libraries and frameworks like Sci,y Numpy D<ango etc. %a&oratory MUG ‘2 , -hy Python? 5ery flexible and simplified synta7. Adopted from= !etwork Base" Com#$ting "ttps:44www."ardikp.com489&>4&84?94pyt"on-cpp4 %a&oratory MUG ‘2 / MPI for Python • (,.
    [Show full text]
  • DLL: a Fast Deep Neural Network Library
    Published in Artificial Neural Networks in Pattern Recognition. ANNPR 2018, Lecture Notes in Computer Science, vol. 11081, which should be cited to refer to this work. DOI: https://doi.org/10.1007/978-3-319-99978-4_4 DLL: A Fast Deep Neural Network Library Baptiste Wicht, Andreas Fischer, and Jean Hennebert 1 HES-SO, University of Applied Science of Western Switzerland 2 University of Fribourg, Switzerland Abstract. Deep Learning Library (DLL) is a library for machine learn- ing with deep neural networks that focuses on speed. It supports feed- forward neural networks such as fully-connected Artificial Neural Net- works (ANNs) and Convolutional Neural Networks (CNNs). Our main motivation for this work was to propose and evaluate novel software engineering strategies with potential to accelerate runtime for training and inference. Such strategies are mostly independent of the underlying deep learning algorithms. On three different datasets and for four dif- ferent neural network models, we compared DLL to five popular deep learning libraries. Experimentally, it is shown that the proposed library is systematically and significantly faster on CPU and GPU. In terms of classification performance, similar accuracies as the other libraries are reported. 1 Introduction In recent years, neural networks have regained a large deal of attention with deep learning approaches. Such approaches rely on the use of bigger and deeper networks, typically by using larger input dimensions to incorporate more con- text and by increasing the number of layers to extract information at different levels of granularity. The success of deep learning can be attributed mainly to three factors. First, there is the advent of big data, meaning the availability of larger quantities of training data.
    [Show full text]
  • The Big Picture
    The Bigger Picture John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2021 Building Blocks So far, we have used Fully Connected and Convolutional layers. These are ubiquitous, but there are many others: • Fully Connected (FC) • Convolutional (CNN) • Residual (ResNet) [Feed forward] • Recurrent (RNN), [Feedback, but has vanishing gradients so...] • Long Short Term Memory (LSTM) • Transformer (Attention based) • Bidirectional RNN • Restricted Boltzmann Machine • • Several of these are particularly common... Wikipedia Commons Residual Neural Nets We've mentioned that disappearing gradients can be an issue, and we know that deeper networks are more powerful. How do we reconcile these two phenomenae? One, very successful, method is to use some feedforward. Courtesy: Chris Olah• Helps preserve reasonable gradients for very deep networks • Very effective at imagery • Used by AlphaGo Zero (40 residual CNN layers) in place of previous complex dual network • 100s of layers common, Pushing 1000 #Example: input 3-channel 256x256 image x = Input(shape=(256, 256, 3)) y = Conv2D(3, (3, 3))(x) z = keras.layers.add([x, y]) Haven't all of our Keras networks been built as strict layers in a sequential method? Indeed, but Keras supports a functional API that provides the ability to define network that branch in other ways. It is easy and here (https://www.tensorflow.org/guide/keras/functional) is an MNIST example with a 3 dense layers. More to our current point, here (https://www.kaggle.com/yadavsarthak/residual-networks-and-mnist) is a neat experiment that uses 15(!) residual layers to do MNIST. Not the most effective approach, but it works and illustrates the concept beautifully.
    [Show full text]