Scaling DL Training Workloads with the Cray PE Plugin Luiz Derose Sr

Total Page:16

File Type:pdf, Size:1020Kb

Scaling DL Training Workloads with the Cray PE Plugin Luiz Derose Sr Scaling DL Training Workloads with the Cray PE Plugin Luiz DeRose Sr. Principal Engineer Programming Environments Director COM PUTE | STORE | ANALYZE 33 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Autonomous Driving L5-Full Aut. Dr. Training data CNN Labeled Data Training No Driver Results Supervised Learning Backend request Frontend Real-time data Inference Mission and Trajectory Planning Sensors Sensor fusing Perception Cognition Action COM PUTE | STORE | ANALYZE 34 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Deep Learning in Science - NERSC Opportunities to apply DL widely in support of classic HPC simulation and modelling. COM PUTE | STORE | ANALYZE 35 NERSC - June 14, 2018 The Deep Learning Part in Autonomous Driving ● Model training is the most crucial and challenging aspect ● Many tasks to train for in autonomous driving ● Stereo ● Optical flow ● Visual odometry ● Structure-from-motion ● Object detection ● Recognition and tracking ● 3D scene understanding ● Many NNs to train for each task ● Retraining (or transfer learning) happens when new data is available ● A trained neural network can also be a powerful tool for ● Pattern recognition ● Classification ● Clustering ● Others… COM PUTE | STORE | ANALYZE 36 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. The Deep Learning Software Landscape 60000 56… 50000 40000 30000 17824 20000 10681 9635 6847 6640 6271 10000 4845 4628 1731 Github Stars as of May 2017 May as of Stars Github 0 TensorFlow Caffe CNTK MXNet Torch DeepLearning4j Theano PaddlePaddle Caffe2 BigDL Deep Learning Toolkits TensorFlow is nearly as popular as all other frameworks combined COM PUTE | STORE | ANALYZE 37 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Scaling Deep Learning - Motivation ● Scaling Deep Learning training is a tool for ● Models that take a very long time to train ● and have a very large training dataset ● Increasing the frequency at which models can be retrained with new or improved data ● It is critical to be able to update the models (when new data arrives) in a matter of minutes ● Hyper-Parameter Optimization ● For problems and datasets where baseline accuracy is not known ● learning rate schedule ● momentum ● batch size ● Evolve topologies if good architecture is unknown (common with novel datasets / mappings) ● Layer types, width, number filters ● Activation functions, drop-out rates COM PUTE | STORE | ANALYZE 38 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. HPC Attributes ● DL training is a classic high-performance computing problem which demands: ● Large compute capacity in terms of FLOPs, memory capacity and bandwidth ● A performant interconnect for fast communication of gradients and model parameters ● Parallel I/O and storage with sufficient bandwidth to keep the compute fed at scale COM PUTE | STORE | ANALYZE 39 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. The Cray PE DL Scalability Plugin Project Overview ● Cray’s primary goals were: ● Design a solution for scaling TensorFlow (specifically synchronous SGD) to significantly larger node counts than existing methods allowed ● Should require minimal changes to user training scripts and provide a more friendly user experience ● Achieve the best possible TensorFlow performance on Cray Systems ● Maintain accuracy for a given number of steps and hyper-parameter setup allowing for significantly reduced time-to-accuracy through scaling ● Ideally have a portable solution that would work with other deep learning frameworks COM PUTE | STORE | ANALYZE 40 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Data Parallelism - Collective-based Synchronous SGD ● Data parallel training divides a global mini-batch of examples across processes ● Each process computes gradients from their local mini-batch ● Average gradients across processes ● All processes update their local model with averaged gradients ● all processes have the same model Compute intensive Communication intensive Typically not much compute ● Not shown is the I/O activity of reading training samples and possible augmentation) COM PUTE | STORE | ANALYZE 41 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Data Parallel Synchronous SGD ● Operations in a “step” of SGD Read/prepare local mini-batch ● Parallel operations highlighted with light blue box ● Communication highlighted in Compute gradients light red box ● Maps to an allreduce Reduce gradients across ● Non-parallel work is the remainder all process (local mini-batches) ● For a fixed local mini-batch size per process, the fraction of time in Update weights and parallel work is constant as more biases with gradients processes are added COM PUTE | STORE | ANALYZE 42 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Distributed TensorFlow ● TensorFlow has a native method for parallelism across nodes ● ClusterSpec API ● Uses gRPC layer in TensorFlow based on sockets ● Can be difficult to use and optimize ● User must specify ● hostnames and ports for all worker processes ● hostnames and ports for all parameter server processes (see next slide) ● # of workers ● # of parameter server processes ● Chief process of workers COM PUTE | STORE | ANALYZE 43 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Distributed TensorFlow No Parameter devices needed in the Cray method Resources dedicated to gradient calculation Cray Method ● Parameter deviceDevice is 1a separateDevice process 2 possibly requiring additional nodes Device n ● Number of parameteradd devicesadd processes toScalable use is Globalnot clear Add add model model model ● Communication isinput an allreduceinputbut not. implemented. .that . .way . input P P P ● Several gradientUpdate “aggregation”Update methods possible in TensorFlow Update Client Client Client ● All implementΔP an allreduce as a set of gather/bcast operations with point-to-point over sockets COM PUTE | STORE | ANALYZE 44 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Training Script Modifications ● The Cray PE DL Plugin require the following modifications to a serial training script 1. Importing the Python module 2. Initialize the module ● Possibly configure the thread team(s) for specific uses 3. Broadcast initial model parameters 4. Incorporate gradient aggregation between gradient computation and model update 5. Finalize the Python module COM PUTE | STORE | ANALYZE 45 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Cray PE DL Scalability Plugin Performance Inception v3 Throughput ResNet50 Throughput CSCS Piz Daint P100 NERSC Cori KNL Batch Size = 64 per Process Batch Size = 32 per Process 80032 40032 91% efficient at 512 nodes 89% efficient at 1024 nodes 70032 1.8X faster than gRPC at 128 35032 60032 nodes 30032 50032 25032 See Peter Mendygral’s talk on Thursday at 1:30 40032 20032 30032 15032 samples / secsamples samples / secsamples 20032 10032 10032 5032 32 32 8 16 32 64 128 256 512 8 16 32 64 128 256 512 1024 P100 GPUs (Nodes) KNL Nodes Ideal derived from single node with no gRPC CPE DL Plugin Ideal CPE DL Plugin Ideal communication SW COM PUTE | STORE | ANALYZE 46 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Horovod / CPE DL Plugin – Throughput Scaling ResNet50 Performance on XC40 (Cori KNL at NERSC) Inception v3 Performance on XC50 (Piz Daint at Horovod and CPE DL Plugin CSCS) – CPE DL Plugin ONLY 65536 32768 131072 CPE DL Plugin 16384 1.8X faster than 8192 32768 gRPC at 128 nodes 4096 8192 2048 1024 2048 1.4X faster than 512 Horovod at 128 512 256 nodes, 3.2X at 1024 nodes 128 128 (aggregate)Samples/sec Samples/sec (aggregate)Samples/sec 64 32 32 1 4 16 64 256 1024 2 8 32 128 512 Nodes (GPUs) Nodes MBS=4 MBS=16 MBS=32 CPE MLDL Plugin - MBS=128 CPE MLDL Plugin - MBS=32 Horovod - MBS=32 MBS=64 MBS=64 (gRPC) 200 x N COM PUTE | STORE | ANALYZE 47 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Cray PE DL Scalability Plugin ● Users can easily achieve ideal scaling performance across DL frameworks utilizing stochastic gradient descent 1. Load a module 2. Plug in a few simple lines to your serial Python or C-based training script 3. Scale up your training workload to hundreds of nodes or more on Cray systems ● Delivers high performance across a variety of Cray node architectures ● Cray customers can leverage their existing compute nodes to scale DL training COM PUTE | STORE | ANALYZE 48 NERSC - June 14, 2018 Luiz DeRose © 2018 Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publicly announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA and YARCDATA. The following are trademarks of Cray Inc.: CHAPEL, CLUSTER CONNECT, CLUSTERSTOR, CRAYDOC, CRAYPAT, CRAYPORT, DATAWARP, ECOPHLEX, LIBSCI, NODEKARE, REVEAL. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used on this website are the property of their respective owners.
Recommended publications
  • Deep Learning Frameworks | NVIDIA Developer
    4/10/2017 Deep Learning Frameworks | NVIDIA Developer Deep Learning Frameworks The NVIDIA Deep Learning SDK accelerates widely­used deep learning frameworks such as Caffe, CNTK, TensorFlow, Theano and Torch as well as many other deep learning applications. Choose a deep learning framework from the list below, download the supported version of cuDNN and follow the instructions on the framework page to get started. Caffe is a deep learning framework made with expression, speed, and modularity in mind. Caffe is developed by the Berkeley Vision and Learning Center (BVLC), as well as community contributors and is popular for computer vision. Caffe supports cuDNN v5 for GPU acceleration. Supported interfaces: C, C++, Python, MATLAB, Command line interface Learning Resources Deep learning course: Getting Started with the Caffe Framework Blog: Deep Learning for Computer Vision with Caffe and cuDNN Download Caffe Download cuDNN The Microsoft Cognitive Toolkit —previously known as CNTK— is a unified deep­learning toolkit from Microsoft Research that makes it easy to train and combine popular model types across multiple GPUs and servers. Microsoft Cognitive Toolkit implements highly efficient CNN and RNN training for speech, image and text data. Microsoft Cognitive Toolkit supports cuDNN v5.1 for GPU acceleration. Supported interfaces: Python, C++, C# and Command line interface Download CNTK Download cuDNN TensorFlow is a software library for numerical computation using data flow graphs, developed by Google’s Machine Intelligence research organization. TensorFlow supports cuDNN v5.1 for GPU acceleration. Supported interfaces: C++, Python Download TensorFlow Download cuDNN https://developer.nvidia.com/deep­learning­frameworks 1/3 4/10/2017 Deep Learning Frameworks | NVIDIA Developer Theano is a math expression compiler that efficiently defines, optimizes, and evaluates mathematical expressions involving multi­dimensional arrays.
    [Show full text]
  • Comparative Study of Deep Learning Software Frameworks
    Comparative Study of Deep Learning Software Frameworks Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Research and Technology Center, Robert Bosch LLC {Soheil.Bahrampour, Naveen.Ramakrishnan, fixed-term.Lukas.Schott, Mohak.Shah}@us.bosch.com ABSTRACT such as dropout and weight decay [2]. As the popular- Deep learning methods have resulted in significant perfor- ity of the deep learning methods have increased over the mance improvements in several application domains and as last few years, several deep learning software frameworks such several software frameworks have been developed to have appeared to enable efficient development and imple- facilitate their implementation. This paper presents a com- mentation of these methods. The list of available frame- parative study of five deep learning frameworks, namely works includes, but is not limited to, Caffe, DeepLearning4J, Caffe, Neon, TensorFlow, Theano, and Torch, on three as- deepmat, Eblearn, Neon, PyLearn, TensorFlow, Theano, pects: extensibility, hardware utilization, and speed. The Torch, etc. Different frameworks try to optimize different as- study is performed on several types of deep learning ar- pects of training or deployment of a deep learning algorithm. chitectures and we evaluate the performance of the above For instance, Caffe emphasises ease of use where standard frameworks when employed on a single machine for both layers can be easily configured without hard-coding while (multi-threaded) CPU and GPU (Nvidia Titan X) settings. Theano provides automatic differentiation capabilities which The speed performance metrics used here include the gradi- facilitates flexibility to modify architecture for research and ent computation time, which is important during the train- development. Several of these frameworks have received ing phase of deep networks, and the forward time, which wide attention from the research community and are well- is important from the deployment perspective of trained developed allowing efficient training of deep networks with networks.
    [Show full text]
  • Reinforcement Learning in Videogames
    Reinforcement Learning in Videogames Alex` Os´esLaza Final Project Director: Javier B´ejarAlonso FIB • UPC 20 de junio de 2017 2 Acknowledgements First I would like and appreciate the work and effort that my director, Javier B´ejar, has put into me. Either solving a ton of questions that I had or guiding me throughout the whole project, he has helped me a lot to make a good final project. I would also love to thank my family and friends, that supported me throughout the entirety of the career and specially during this project. 3 4 Abstract (English) While there are still a lot of projects and papers focused on: given a game, discover and measure which is the best algorithm for it, I decided to twist things around and decided to focus on two algorithms and its parameters be able to tell which games will be best approachable with it. To do this, I will be implementing both algorithms Q-Learning and SARSA, helping myself with Neural Networks to be able to represent the vast state space that the games have. The idea is to implement the algorithms as general as possible.This way in case someone wanted to use my algorithms for their game, it would take the less amount of time possible to adapt the game for the algorithm. I will be using some games that are used to make Artificial Intelligence competitions so I have a base to work with, having more time to focus on the actual algorithm implementation and results comparison. 5 6 Abstract (Catal`a) Mentre ja existeixen molts projectes i estudis centrats en: donat un joc, descobrir i mesurar quin es el millor algoritme per aquell joc, he decidit donar-li la volta i centrar-me en donat dos algorismes i els seus par`ametres,ser capa¸cde trobar quin tipus de jocs es beneficien m´esde la configuraci´odonada.
    [Show full text]
  • Distributed Negative Sampling for Word Embeddings Stergios Stergiou Zygimantas Straznickas Yahoo Research MIT [email protected] [email protected]
    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Distributed Negative Sampling for Word Embeddings Stergios Stergiou Zygimantas Straznickas Yahoo Research MIT [email protected] [email protected] Rolina Wu Kostas Tsioutsiouliklis University of Waterloo Yahoo Research [email protected] [email protected] Abstract Training for such large vocabularies presents several chal- lenges. Word2Vec needs to maintain two d-dimensional vec- Word2Vec recently popularized dense vector word represen- tors of single-precision floating point numbers for each vo- tations as fixed-length features for machine learning algo- cabulary word. All vectors need to be kept in main memory rithms and is in widespread use today. In this paper we in- to achieve acceptable training latency, which is impractical vestigate one of its core components, Negative Sampling, and propose efficient distributed algorithms that allow us to scale for contemporary commodity servers. As a concrete exam- to vocabulary sizes of more than 1 billion unique words and ple, Ordentlich et al. (Ordentlich et al. 2016) describe an corpus sizes of more than 1 trillion words. application in which they use word embeddings to match search queries to online ads. Their dictionary comprises ≈ 200 million composite words which implies a main mem- Introduction ory requirement of ≈ 800 GB for d = 500. A complemen- tary challenge is that corpora sizes themselves increase. The Recently, Mikolov et al (Mikolov et al. 2013; Mikolov and largest reported dataset that has been used was 100 billion Dean 2013) introduced Word2Vec, a suite of algorithms words (Mikolov and Dean 2013) which required training for unsupervised training of dense vector representations of time in the order of days.
    [Show full text]
  • Comparative Study of Caffe, Neon, Theano, and Torch
    Workshop track - ICLR 2016 COMPARATIVE STUDY OF CAFFE,NEON,THEANO, AND TORCH FOR DEEP LEARNING Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Bosch Research and Technology Center fSoheil.Bahrampour,Naveen.Ramakrishnan, fixed-term.Lukas.Schott,[email protected] ABSTRACT Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is per- formed on several types of deep learning architectures and we evaluate the per- formance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important dur- ing the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algo- rithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best perfor- mance on GPU for training and deployment of LSTM networks. Finally Caffe is the easiest for evaluating the performance of standard deep architectures.
    [Show full text]
  • Torch7: a Matlab-Like Environment for Machine Learning
    Torch7: A Matlab-like Environment for Machine Learning Ronan Collobert1 Koray Kavukcuoglu2 Clement´ Farabet3;4 1 Idiap Research Institute 2 NEC Laboratories America Martigny, Switzerland Princeton, NJ, USA 3 Courant Institute of Mathematical Sciences 4 Universite´ Paris-Est New York University, New York, NY, USA Equipe´ A3SI - ESIEE Paris, France Abstract Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua’s light interface. 1 Torch7 Overview With Torch7, we aim at providing a framework with three main advantages: (1) it should ease the development of numerical algorithms, (2) it should be easily extended (including the use of other libraries), and (3) it should be fast. We found that a scripting (interpreted) language with a good C API appears as a convenient solu- tion to “satisfy” the constraint (2). First, a high-level language makes the process of developing a program simpler and more understandable than a low-level language. Second, if the programming language is interpreted, it becomes also easier to quickly try various ideas in an interactive manner. And finally, assuming a good C API, the scripting language becomes the “glue” between hetero- geneous libraries: different structures of the same concept (coming from different libraries) can be hidden behind a unique structure in the scripting language, while keeping all the functionalities coming from all the different libraries.
    [Show full text]
  • BUSEM at Semeval-2017 Task 4A Sentiment Analysis with Word
    BUSEM at SemEval-2017 Task 4 Sentiment Analysis with Word Embedding and Long Short Term Memory RNN Approaches Deger Ayata1, Murat Saraclar 1, Arzucan Ozgur2 1Electrical & Electronical Engineering Department, Bogaziçi University 2Computer Engineering Department, Bogaziçi University Istanbul , Turkey {deger.ayata, murat.saraclar, arzucan.ozgur}@boun.edu.tr Abstract uses word embeddings for feature representation and Support Vector Machine This paper describes our approach for (SVM), Random Forest (RF) and Naive Bayes SemEval-2017 Task 4: Sentiment Analysis in (NB) algorithms for classification Twitter Twitter. We have participated in Subtask A: messages into negative, neutral and positive Message Polarity Classification subtask and polarity. The second system is based on Long developed two systems. The first system Short Term Memory Recurrent Neural uses word embeddings for feature representation and Support Vector Machine, Networks (LSTM) and uses word indexes as Random Forest and Naive Bayes algorithms sequence of inputs for feature representation. for the classification of Twitter messages into The remainder of this article is structured as negative, neutral and positive polarity. The follows: Section 2 contains information about second system is based on Long Short Term the system description and Section 3 explains Memory Recurrent Neural Networks and methods, models, tools and software packages uses word indexes as sequence of inputs for used in this work. Test cases and datasets are feature representation. explained in Section 4. Results are given in Section 5 with discussions. Finally, section 6 summarizes the conclusions and future work. 1 Introduction 2 System Description Sentiment analysis is extracting subjective information from source materials, via natural We have developed two independent language processing, computational linguistics, systems.
    [Show full text]
  • Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction
    TensorFlow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction TensorFlow Google Brain, 2015 (rewritten DistBelief) Theano University of Montréal, 2009 Keras François Chollet, 2015 (now at Google) Torch Facebook AI Research, Twitter, Google DeepMind Caffe Berkeley Vision and Learning Center (BVLC), 2013 Outline 1. Introduction of each framework a. TensorFlow b. Theano c. Keras d. Torch e. Caffe 2. Further comparison a. Code + models b. Community and documentation c. Performance d. Model deployment e. Extra features 3. Which framework to choose when ..? Introduction of each framework TensorFlow architecture 1) Low-level core (C++/CUDA) 2) Simple Python API to define the computational graph 3) High-level API (TF-Learn, TF-Slim, soon Keras…) TensorFlow computational graph - auto-differentiation! - easy multi-GPU/multi-node - native C++ multithreading - device-efficient implementation for most ops - whole pipeline in the graph: data loading, preprocessing, prefetching... TensorBoard TensorFlow development + bleeding edge (GitHub yay!) + division in core and contrib => very quick merging of new hotness + a lot of new related API: CRF, BayesFlow, SparseTensor, audio IO, CTC, seq2seq + so it can easily handle images, videos, audio, text... + if you really need a new native op, you can load a dynamic lib - sometimes contrib stuff disappears or moves - recently introduced bells and whistles are barely documented Presentation of Theano: - Maintained by Montréal University group. - Pioneered the use of a computational graph. - General machine learning tool -> Use of Lasagne and Keras. - Very popular in the research community, but not elsewhere. Falling behind. What is it like to start using Theano? - Read tutorials until you no longer can, then keep going.
    [Show full text]
  • Introduction to Deep Learning Framework 1. Introduction 1.1
    Introduction to Deep Learning Framework 1. Introduction 1.1. Commonly used frameworks The most commonly used frameworks for deep learning include Pytorch, Tensorflow, Keras, caffe, Apache MXnet, etc. PyTorch: open source machine learning library; developed by Facebook AI Rsearch Lab; based on the Torch library; supports Python and C++ interfaces. Tensorflow: open source software library dataflow and differentiable programming; developed by Google brain team; provides stable Python & C APIs. Keras: an open-source neural-network library written in Python; conceived to be an interface; capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Caffe: open source under BSD licence; developed at University of California, Berkeley; written in C++ with a Python interface. Apache MXnet: an open-source deep learning software framework; supports a flexible programming model and multiple programming languages (including C++, Python, Java, Julia, Matlab, JavaScript, Go, R, Scala, Perl, and Wolfram Language.) 1.2. Pytorch 1.2.1 Data Tensor: the major computation unit in PyTorch. Tensor could be viewed as the extension of vector (one-dimensional) and matrix (two-dimensional), which could be defined with any dimension. Variable: a wrapper of tensor, which includes creator, value of variable (tensor), and gradient. This is the core of the automatic derivation in Pytorch, as it has the information of both the value and the creator, which is very important for current backward process. Parameter: a subset of variable 1.2.2. Functions: NNModules: NNModules (torch.nn) is a combination of parameters and functions, and could be interpreted as layers. There some common modules such as convolution layers, linear layers, pooling layers, dropout layers, etc.
    [Show full text]
  • Enhancing OMB with Python Benchmarks Presentation at MUG ‘21
    MVAPICH2 Meets Python: Enhancing OMB with Python Benchmarks Presentation at MUG ‘2 Nawras Alnaasan Network-Based Computing Laboratory Dept. of Computer Science and Engineering !he #hio State $ni%ersity alnaasan.&'osu.edu O$tline • #S$ (icro Benc"marks )#(B* • +"y ,yt"on- • (,. for ,yt"on • .ntroducing #(B-,y • .nitial /esults • Conclusion !etwork Base" Com#$ting %a&oratory MUG ‘2 2 O)U Micro Benchmarks (OMB+ • #(B is a widely used package to measure performance of (,. operations. • .t helps in optimi0ing (,I applications on different HPC systems. • #1ers a series of benchmarks including but not limited to point-to-point, blocking and non-blocking collecti%es, and one-sided tests. • A large set of options and flags for users to create customi0able tests. • Supports different platforms like /#Cm/CUDA to run on AR(4N5IDIA GPUs. • +ritten in C !etwork Base" Com#$ting %a&oratory MUG ‘2 ( -hy Python? • Second most popular programming language according to the !.#BE inde7 as of August 8021. Adopted from= "ttps=44www.tiobe.com4tiobe-inde74 • 6reat community and support around (achine Learning Big Data and Cloud Computing – (any deep learning frameworks like ,yTorch, !ensor3ow :eras etc. – #ne of the most important tools for data science and analytics. • 6reat ,otential to use (,. and distributed computing tec"niques. !etwork• (any Base" other Com#$ting ,yt"on libraries and frameworks like Sci,y Numpy D<ango etc. %a&oratory MUG ‘2 , -hy Python? 5ery flexible and simplified synta7. Adopted from= !etwork Base" Com#$ting "ttps:44www."ardikp.com489&>4&84?94pyt"on-cpp4 %a&oratory MUG ‘2 / MPI for Python • (,.
    [Show full text]
  • DLL: a Fast Deep Neural Network Library
    Published in Artificial Neural Networks in Pattern Recognition. ANNPR 2018, Lecture Notes in Computer Science, vol. 11081, which should be cited to refer to this work. DOI: https://doi.org/10.1007/978-3-319-99978-4_4 DLL: A Fast Deep Neural Network Library Baptiste Wicht, Andreas Fischer, and Jean Hennebert 1 HES-SO, University of Applied Science of Western Switzerland 2 University of Fribourg, Switzerland Abstract. Deep Learning Library (DLL) is a library for machine learn- ing with deep neural networks that focuses on speed. It supports feed- forward neural networks such as fully-connected Artificial Neural Net- works (ANNs) and Convolutional Neural Networks (CNNs). Our main motivation for this work was to propose and evaluate novel software engineering strategies with potential to accelerate runtime for training and inference. Such strategies are mostly independent of the underlying deep learning algorithms. On three different datasets and for four dif- ferent neural network models, we compared DLL to five popular deep learning libraries. Experimentally, it is shown that the proposed library is systematically and significantly faster on CPU and GPU. In terms of classification performance, similar accuracies as the other libraries are reported. 1 Introduction In recent years, neural networks have regained a large deal of attention with deep learning approaches. Such approaches rely on the use of bigger and deeper networks, typically by using larger input dimensions to incorporate more con- text and by increasing the number of layers to extract information at different levels of granularity. The success of deep learning can be attributed mainly to three factors. First, there is the advent of big data, meaning the availability of larger quantities of training data.
    [Show full text]
  • Using Long Short-Term Memory Recurrent Neural Network in Land Cover Classification on Landsat and Cropland Data Layer Time Series
    Using Long Short-Term Memory Recurrent Neural Network in Land Cover Classification on Landsat and Cropland Data Layer time series Ziheng Suna, Liping Dia,*, Hui Fang a Center for Spatial Information Science and Systems, George Mason University, Fairfax, United States * [email protected]; Tel: +1-703-993-6114; Fax: +1-703-993-6127; mailing address: 4087 University Dr STE 3100, Fairfax, VA, 22030, United States. Dr. Ziheng Sun is a research assistant professor in Center for Spatial Information Science and Systems, George Mason University. Dr. Liping Di is a professor of Geography and Geoinformation Science, George Mason University and the director of Center for Spatial Information Science and Systems, George Mason University. Hui Fang receives her M.Sc degree from State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University. This manuscript contains 6328 words. Using Long Short-Term Memory Recurrent Neural Network in Land Cover Classification on Landsat Time Series and Cropland Data Layer Land cover maps are significant in assisting agricultural decision making. However, the existing workflow of producing land cover maps is very complicated and the result accuracy is ambiguous. This work builds a long short-term memory (LSTM) recurrent neural network (RNN) model to take advantage of the temporal pattern of crops across image time series to improve the accuracy and reduce the complexity. An end-to-end framework is proposed to train and test the model. Landsat scenes are used as Earth observations, and some field- measured data together with CDL (Cropland Data Layer) datasets are used as reference data. The network is thoroughly trained using state-of-the-art techniques of deep learning.
    [Show full text]