Deep Neural Networks for Physics Analysis on Low-Level Whole- Detector Data at the LHC

Total Page:16

File Type:pdf, Size:1020Kb

Deep Neural Networks for Physics Analysis on Low-Level Whole- Detector Data at the LHC Deep Neural Networks for Physics Analysis on low-level whole- detector data at the LHC Wahid Bhimji, Steve Farrell, Thorsten Kurth, Michela Paganini, Prabhat, Evan Racah Lawrence Berkeley National Laboratory ACAT 2017: 21st August 2017 - 1 - Introduction / Aims • Use Deep Neural Network (NN)s on ‘raw’ data directly for physics analysis: – Without reconstruction of physics objects like jets; without tuning of analysis variables; using data from whole calorimeter /detector – Cutting-edge methods: performance and interpretation • Run efficiently on NERSC supercomputers – Primarily Intel Knights Landing (KNL) XeonPhi CPU based – Distributed training (up to 10k KNL nodes) – Timings, optimisations and recipes - 2 - Physics Use-Case • Search for RPV SUSY gluino decays From ATLAS-CONF-2016-057: – Multi-jet final state – Analysis from ATLAS-CONF-2016-057 used as a benchmark – Classification problem: RPV Susy vs. QCD • Simulated samples – Pythia - event gen. (matching ATLAS config) • Cascade mg ̃= 1400 , mχ ̃0=850 default –explore other masses – Delphes detector simulation (ATLAS card) • Output calorimeter towers (and tracks) used in analysis - 3 - Data processing • Bin calorimeter tower energy in η/ɸ to form an ‘image’ – 64x64 bins (~0.1 η/ɸ towers) or 224x224 • Also try 3 ‘channels’ (à la RGB images)1: – Energy in Electromagnetic and Hadronic Calorimeters and no. of tracks in bin • Reconstruct jets using same algorithm as physics analysis (AntiKt R=1.0 trimmed) for benchmark comparison and pre-selection 1Similar to Komiske, Metodiev, and Schwartz arXiv:1612.01551 - 4 - Convolutional (CNN) Architecture • Popular architecture for natural images and now many HEP studies (so not explained here) Figure from Dumoulin, Vincent, and – Learn non-linear ‘filters’- slide across image: Francesco Visin. arXiv:1603.07285 shared filter reduces weights – Local structure/ translational invariance – Stacked layers respond to different scales • We use 3 alternating convolutional and pooling layers (or 4 for large images): with bias and/or batch normalization • QCD generated in PT ranges: – X-sec weight in training loss and evaluation Input Conv+Pool(1) Conv+Pool(2) Conv+Pool(3) (Conv+Pool(4)) Fully Connected (FC) FC Output 1(or3)x64x64 64x32x32 128x8x8 256x4x4 4096 512 1 - 5 - CNN Performance • Need good signal efficiency and (True Positive Rate (TPR)) high background rejection (low False Positive Rate (FPR)) TPR=0.77, AMS=4.2 – Compare to physics TPR=0.41, AMS=2.3 selections (see backup) – ROC curve (relative to preselection) • Increased signal efficiency at same background rejection without using jet variables Also compare AMS (approximate median significance) accounting for initial pre-selection and luminosity - 6 - Compare to shallow classifier • Try (gradient) boosted decision tree (GBDT) and 1 hidden-layer NN (MLP) – Input jet variables used in the physics analysis (Sum of Jet Mass, Number of Jets, Eta between leading 2 jets) and 4- mom of first 5 jets • These outperform selections but CNN performs better - 7 - Weights • Cross-section weights applied in training loss – Some QCD background events weighted 107 over RPV signal • Try log of weights – More stable implementation – More focussed on signal performance - 8 - Channels • Three channel CNN – Separate Energy in Electromagnetic and Hadronic Calorimeters – Number of tracks in the same Eta/Phi bin • Further improves performance - 9 - Further improving performance • Implementation with full weights and that with log weights focus differently on signal and background • Can ensemble these by taking mean of predictions - gives best performance - 10 - Robustness to different signals • Model trained on a specific cascade decay Gluino mass (MGlu) of 1400 GeV, and neutralino mass (MNeu) of 850 GeV • Apply this model to other signal samples without retraining • Still good performance - 11 - Pileup • Most studies here use Delphes without pileup • Repeat with Delphes pileup card (mu=20) • Physics selections have lower bkg rejection • CNN still performs well – 1 channel CNN shown - 12 - Comparing CNN to jet variables • Plot NN output (P(signal)) vs benchmark analysis variable • Clear correlation – (Signal cuts: NJets >= 4 /5 MJet >= 800/600 GeV) • Add jet variable to CNN output in a 1 layer NN – Little/no increase in performance - 13 - Running at NERSC - 14 - NERSC and Cori • NERSC at LBL, production HPC center for US Dept. of Energy – >7000 diverse users across science domains including many outside HEP • Cori – NERSC’s Newest Supercomputer – Cray XC40 (31.4 PF Peak) –Phase 1: 2388 Intel Haswell dual 16-core (2.3 GHz), 128 GB DDR4 DRAM –Phase 2: 9668 Intel Knights Landing (KNL) nodes: XeonPhi 68-core (1.4 GHz), 4 hardware threads; AVX-512 Vector pipelines; 16 GB MCDRAM, 96 GB DDR4 • Cray Aries high-speed “dragonfly” topology interconnect • Many popular deeplearning frameworks available – Caffe; Keras; Lasagne; PyTorch; Tensorflow; Theano – Working with Intel to improve CPU (KNL) performance - 15 - Timing RPV Susy CNN • Implemented CNN network in different frameworks: – (Pure) Tensorflow, Keras (Theano and TF), Lasagne (Theano), Caffe • Aim to drive multi-node Cori CPU performance to be comparable with GPU (for real use-cases): – Not aiming for exact comparison: implementation in frameworks differ slightly and some have been optimised • Compare training time (per batch ignoring I/O) for: – GPU: Titan X (Pascal) (10.2 TeraFlops (single-precision) peak) – CPU: Haswell E5-2698 v3 32 cores @ 2.3 GHz (2.4 TF) – KNL: Xeon Phi 7250 68 cores @1.4 GHz (6 TF) - 16 - Timings and Tensorflow • CPU performance of default TF Lasagne + Theano Keras + Theano Keras + Tensorflow Keras + TF (Intel) Keras + TF (Latest) Caffe 1.2 is poor 5 4.6 Batch Size: 512 4.5 ) s 4 ( • Intel optimisations with Intel h 3.5 c t a 3 B Math Kernel Library (MKL) e.g. 2.5 r e 2 P 1.4 Conv layers multi- e 1.5 m 0.9 i 1 0.6 T 0.4 0.4 0.4 0.4 0.3 threaded,vectorize channels/ 0.5 0.1 0.06 0 Lasagne + Keras + Keras + Keras + Keras + TF Keras + TF Keras + Keras + TF Keras + TF Caffe Caffe filters and cache blocking Theano Theano Tensorwflo Tensorwflo (Intel) (Latest) Tensorwflo (Intel) (Latest) GPU CPU - HSW CPU - KNL CPU - 8 – Now in main TF-repo Node KNL • Further optimisations (released soon): e.g. MKL element-wise • (Intel)Caffe similar optimizations and Multi- operations (avoid MKL->Eigen node with MLSL library e.g. scale to 8 nodes conversions) time 6x faster for this 64x64 network - 17 - Thorsten Kurth, Jian Zhang, Nadathur Satish, Ioannis Mitliagkas, Evan Racah, Mostofa Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, Srinivas Scaling up Sridharan, Prabhat, Pradeep Dubey, Deep Learning at 15PF (accepted for SC17) arXiv:1708.05256 Hybrid Architecture: • Train on 10 Million 224x224 3-channel images (7.4 TB) • Caffe implementation: multi-node - data parallel – Use Intel MLSL library (wraps comms - portable) • Sync/Async and Hybrid strategies: – Sync: barriers so nodes iterate together Layer N PS • can have straggler nodes and limit batch size Group 1 – Async: users parameter servers to scale better • can have old gradients so not converge faster. Layer N-1 PS Group 2 – Hybrid: Sync within a group and async across • Dedicated parameter servers for each layer of network Layer 2 PS Group G • Modify our CNN layers to reduce communication: remove batch norm. and replaced big (~200MB) fully Model update Layer 1 PS connected layers with convolutional layer New model - 18 - T Kurth et. al., Deep Learning at 15PF Scaling up - results (accepted for SC17) arXiv:1708.05256 • Single node 1.9 TF (⅓ peak) Strong • Strong scaling (overall batch size fixed): scaling – Hybrid approach reduces communication and straggler effects • Weak scaling (const batch per node): – Good scaling - though affected by variability from communication after fast convolutional layers Weak • Scaled to 9600 KNL nodes - 11.73 PF (6170x scaling 1-node) (single-precision) • Time to solution (a target loss) also scales (1024 node time 1/11 of 64-node time ) - 19 - Conclusions • Implemented deep CNN on large whole detector ‘images’ directly for physics analysis – Outperforms physics-variable based selections (and shallow classifiers) without jet reconstruction – Further improvements from adding 3 channels; modifying weights and ensemble of models – Network robust to pileup and to apply other signal masses, and appears to learn physics of interest • Used to benchmark and improve popular deep learning libraries on CPU including XeonPhi/KNL at NERSC – Demonstrated distributed training up to 9600 KNL nodes - 20 - Thanks: Ben Nachman and Brian Amadio (LBL) for discussions and physics input. Mustafa Mustafa (LBL) for help with Tensorflow optimisations. Code and sample datasets will be made available with proceedings - 21 - Backups - 22 - Benchmark Analysis Fat-jet object selection: • AntiKt R=1.0 trimmed (Rtrim= 0.2, PTfrac= 0.05) • PT > 200 GeV , |η| < 2.0 Preselection • Leading Fat-jet PT > 440 GeV • NFat-Jet > 2 Analysis Selection • |∆η12| between leading 2 Fat-jets < 1.4 • NFat-Jet >= 4 && Sum MFat-jet > 800 GeV • Or NFat-Jet>= 5 && Sum MFat-jet > 600 GeV - 23 - Interpretation - feature maps Background QCD event: Signal RPV event: - 24 - Scaling • Single node performance per layer for 224x224 Caffe implementation • Time to 0.05 loss (corresponds to fixed significance) • At 1024 nodes: time is 11x 64- node (scales as expected) and Hybrid time is 1.66x Sync. - 25 - Further work: (with G. Rochette, J. Bruna, G.Louppe, K. Cranmer, NYU) Exploring Graph CNNs • Using a list of clusters rather than an image • Hybrid between graph and CNN – Represent clusters as nodes of a graph with interactions/similarity as edge weights • Model the interaction, and achieve precision without sparsity - 26 -.
Recommended publications
  • Theano: a Python Framework for Fast Computation of Mathematical Expressions (The Theano Development Team)∗
    Theano: A Python framework for fast computation of mathematical expressions (The Theano Development Team)∗ Rami Al-Rfou,6 Guillaume Alain,1 Amjad Almahairi,1 Christof Angermueller,7, 8 Dzmitry Bahdanau,1 Nicolas Ballas,1 Fred´ eric´ Bastien,1 Justin Bayer, Anatoly Belikov,9 Alexander Belopolsky,10 Yoshua Bengio,1, 3 Arnaud Bergeron,1 James Bergstra,1 Valentin Bisson,1 Josh Bleecher Snyder, Nicolas Bouchard,1 Nicolas Boulanger-Lewandowski,1 Xavier Bouthillier,1 Alexandre de Brebisson,´ 1 Olivier Breuleux,1 Pierre-Luc Carrier,1 Kyunghyun Cho,1, 11 Jan Chorowski,1, 12 Paul Christiano,13 Tim Cooijmans,1, 14 Marc-Alexandre Cotˆ e,´ 15 Myriam Cotˆ e,´ 1 Aaron Courville,1, 4 Yann N. Dauphin,1, 16 Olivier Delalleau,1 Julien Demouth,17 Guillaume Desjardins,1, 18 Sander Dieleman,19 Laurent Dinh,1 Melanie´ Ducoffe,1, 20 Vincent Dumoulin,1 Samira Ebrahimi Kahou,1, 2 Dumitru Erhan,1, 21 Ziye Fan,22 Orhan Firat,1, 23 Mathieu Germain,1 Xavier Glorot,1, 18 Ian Goodfellow,1, 24 Matt Graham,25 Caglar Gulcehre,1 Philippe Hamel,1 Iban Harlouchet,1 Jean-Philippe Heng,1, 26 Balazs´ Hidasi,27 Sina Honari,1 Arjun Jain,28 Sebastien´ Jean,1, 11 Kai Jia,29 Mikhail Korobov,30 Vivek Kulkarni,6 Alex Lamb,1 Pascal Lamblin,1 Eric Larsen,1, 31 Cesar´ Laurent,1 Sean Lee,17 Simon Lefrancois,1 Simon Lemieux,1 Nicholas Leonard,´ 1 Zhouhan Lin,1 Jesse A. Livezey,32 Cory Lorenz,33 Jeremiah Lowin, Qianli Ma,34 Pierre-Antoine Manzagol,1 Olivier Mastropietro,1 Robert T. McGibbon,35 Roland Memisevic,1, 4 Bart van Merrienboer,¨ 1 Vincent Michalski,1 Mehdi Mirza,1 Alberto Orlandi, Christopher Pal,1, 2 Razvan Pascanu,1, 18 Mohammad Pezeshki,1 Colin Raffel,36 Daniel Renshaw,25 Matthew Rocklin, Adriana Romero,1 Markus Roth, Peter Sadowski,37 John Salvatier,38 Franc¸ois Savard,1 Jan Schluter,¨ 39 John Schulman,24 Gabriel Schwartz,40 Iulian Vlad Serban,1 Dmitriy Serdyuk,1 Samira Shabanian,1 Etienne´ Simon,1, 41 Sigurd Spieckermann, S.
    [Show full text]
  • Practical Deep Learning
    Practical deep learning December 12-13, 2019 CSC – IT Center for Science Ltd., Espoo Markus Koskela Mats Sjöberg All original material (C) 2019 by CSC – IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0 All other material copyrighted by their respective authors. Course schedule Thursday Friday 9.00-10.30 Lecture 1: Introduction 9.00-9.45 Lecture 5: Deep to deep learning learning frameworks 10.30-10.45 Coffee break 9.45-10.15 Lecture 6: GPUs and batch jobs 10.45-11.00 Exercise 1: Introduction to Notebooks, Keras 10.15-10.30 Coffee break 11.00-11.30 Lecture 2: Multi-layer 10.30-12.00 Exercise 5: Image perceptron networks classification: dogs vs. cats; traffic signs 11.30-12.00 Exercise 2: Classifica- 12.00-13.00 Lunch break tion with MLPs 13.00-14.00 Exercise 6: Text catego- 12.00-13.00 Lunch break riZation: 20 newsgroups 13.00-14.00 Lecture 3: Images and convolutional neural 14.00-14.45 Lecture 7: Cloud, GPU networks utiliZation, using multiple GPU 14.00-14.30 Exercise 3: Image classification with CNNs 14.45-15.00 Coffee break 14.30-14.45 Coffee break 15.00-16.00 Exercise 7: Using multiple GPUs 14.45-15.30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15.30-16.00 Exercise 4: Text sentiment classification with CNNs and RNNs Up-to-date agenda and lecture slides can be found at https://tinyurl.com/r3fd3st Exercise materials are at GitHub: https://github.com/csc-training/intro-to-dl/ Wireless accounts for CSC-guest network behind the badges.
    [Show full text]
  • Comparative Study of Caffe, Neon, Theano, and Torch
    Workshop track - ICLR 2016 COMPARATIVE STUDY OF CAFFE,NEON,THEANO, AND TORCH FOR DEEP LEARNING Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Bosch Research and Technology Center fSoheil.Bahrampour,Naveen.Ramakrishnan, fixed-term.Lukas.Schott,[email protected] ABSTRACT Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is per- formed on several types of deep learning architectures and we evaluate the per- formance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important dur- ing the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algo- rithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best perfor- mance on GPU for training and deployment of LSTM networks. Finally Caffe is the easiest for evaluating the performance of standard deep architectures.
    [Show full text]
  • Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction
    TensorFlow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction TensorFlow Google Brain, 2015 (rewritten DistBelief) Theano University of Montréal, 2009 Keras François Chollet, 2015 (now at Google) Torch Facebook AI Research, Twitter, Google DeepMind Caffe Berkeley Vision and Learning Center (BVLC), 2013 Outline 1. Introduction of each framework a. TensorFlow b. Theano c. Keras d. Torch e. Caffe 2. Further comparison a. Code + models b. Community and documentation c. Performance d. Model deployment e. Extra features 3. Which framework to choose when ..? Introduction of each framework TensorFlow architecture 1) Low-level core (C++/CUDA) 2) Simple Python API to define the computational graph 3) High-level API (TF-Learn, TF-Slim, soon Keras…) TensorFlow computational graph - auto-differentiation! - easy multi-GPU/multi-node - native C++ multithreading - device-efficient implementation for most ops - whole pipeline in the graph: data loading, preprocessing, prefetching... TensorBoard TensorFlow development + bleeding edge (GitHub yay!) + division in core and contrib => very quick merging of new hotness + a lot of new related API: CRF, BayesFlow, SparseTensor, audio IO, CTC, seq2seq + so it can easily handle images, videos, audio, text... + if you really need a new native op, you can load a dynamic lib - sometimes contrib stuff disappears or moves - recently introduced bells and whistles are barely documented Presentation of Theano: - Maintained by Montréal University group. - Pioneered the use of a computational graph. - General machine learning tool -> Use of Lasagne and Keras. - Very popular in the research community, but not elsewhere. Falling behind. What is it like to start using Theano? - Read tutorials until you no longer can, then keep going.
    [Show full text]
  • Fashionable Modelling with Flux
    Fashionable Modelling with Flux Michael J Innes Elliot Saba Keno Fischer Julia Computing, Inc. Julia Computing, Inc. Julia Computing, Inc. Edinburgh, UK Cambridge, MA, USA Cambridge, MA, USA [email protected] [email protected] [email protected] Dhairya Gandhi Marco Concetto Rudilosso Julia Computing, Inc. University College London Bangalore, India London, UK [email protected] [email protected] Neethu Mariya Joy Tejan Karmali Birla Institute of Technology and Science National Institute of Technology Pilani, India Goa, India [email protected] [email protected] Avik Pal Viral B. Shah Indian Institute of Technology Julia Computing, Inc. Kanpur, India Cambridge, MA, USA [email protected] [email protected] Abstract Machine learning as a discipline has seen an incredible surge of interest in recent years due in large part to a perfect storm of new theory, superior tooling, renewed interest in its capabilities. We present in this paper a framework named Flux that shows how further refinement of the core ideas of machine learning, built upon the foundation of the Julia programming language, can yield an environment that is simple, easily modifiable, and performant. We detail the fundamental principles of Flux as a framework for differentiable programming, give examples of models that are implemented within Flux to display many of the language and framework-level features that contribute to its ease of use and high productivity, display internal compiler techniques used to enable the acceleration and performance that lies at arXiv:1811.01457v3 [cs.PL] 10 Nov 2018 the heart of Flux, and finally give an overview of the larger ecosystem that Flux fits inside of.
    [Show full text]
  • The Big Picture
    The Bigger Picture John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2021 Building Blocks So far, we have used Fully Connected and Convolutional layers. These are ubiquitous, but there are many others: • Fully Connected (FC) • Convolutional (CNN) • Residual (ResNet) [Feed forward] • Recurrent (RNN), [Feedback, but has vanishing gradients so...] • Long Short Term Memory (LSTM) • Transformer (Attention based) • Bidirectional RNN • Restricted Boltzmann Machine • • Several of these are particularly common... Wikipedia Commons Residual Neural Nets We've mentioned that disappearing gradients can be an issue, and we know that deeper networks are more powerful. How do we reconcile these two phenomenae? One, very successful, method is to use some feedforward. Courtesy: Chris Olah• Helps preserve reasonable gradients for very deep networks • Very effective at imagery • Used by AlphaGo Zero (40 residual CNN layers) in place of previous complex dual network • 100s of layers common, Pushing 1000 #Example: input 3-channel 256x256 image x = Input(shape=(256, 256, 3)) y = Conv2D(3, (3, 3))(x) z = keras.layers.add([x, y]) Haven't all of our Keras networks been built as strict layers in a sequential method? Indeed, but Keras supports a functional API that provides the ability to define network that branch in other ways. It is easy and here (https://www.tensorflow.org/guide/keras/functional) is an MNIST example with a 3 dense layers. More to our current point, here (https://www.kaggle.com/yadavsarthak/residual-networks-and-mnist) is a neat experiment that uses 15(!) residual layers to do MNIST. Not the most effective approach, but it works and illustrates the concept beautifully.
    [Show full text]
  • Theano Tutorial
    Theano Tutorial Theano is a software package which allows you to write symbolic code and compile it onto different architectures (in particular, CPU and GPU). It was developed by machine learning researchers at the University of Montreal. Its use is not limited to machine learning applications, but it was designed with machine learning in mind. It's especially good for machine learning techniques which are CPU-intensive and benefit from parallelization (e.g. large neural networks). This tutorial will cover the basic principles of Theano, including some common mental blocks which come up. It will also cover a simple multi-layer perceptron example. A more thorough Theano tutorial can be found here: http://deeplearning.net/software/theano/tutorial/ Any comments or suggestions should be directed to me or feel free to submit a pull request. In [1]: %matplotlib inline In [2]: # Ensure python 3 forward compatibility from __future__ import print_function import numpy as np import matplotlib.pyplot as plt import theano # By convention, the tensor submodule is loaded as T import theano.tensor as T Basics Symbolic variables In Theano, all algorithms are defined symbolically. It's more like writing out math than writing code. The following Theano variables are symbolic; they don't have an explicit value. In [3]: # The theano.tensor submodule has various primitive symbolic variable types. # Here, we're defining a scalar (0-d) variable. # The argument gives the variable its name. foo = T.scalar('foo') # Now, we can define another variable bar which is just foo squared. bar = foo**2 # It will also be a theano variable.
    [Show full text]
  • Deep Learning Software Security and Fairness of Deep Learning SP18 Today
    Deep Learning Software Security and Fairness of Deep Learning SP18 Today ● HW1 is out, due Feb 15th ● Anaconda and Jupyter Notebook ● Deep Learning Software ○ Keras ○ Theano ○ Numpy Anaconda ● A package management system for Python Anaconda ● A package management system for Python Jupyter notebook ● A web application that where you can code, interact, record and plot. ● Allow for remote interaction when you are working on the cloud ● You will be using it for HW1 Deep Learning Software Deep Learning Software Caffe(UCB) Caffe2(Facebook) Paddle (Baidu) Torch(NYU/Facebook) PyTorch(Facebook) CNTK(Microsoft) Theano(U Montreal) TensorFlow(Google) MXNet(Amazon) Keras (High Level Wrapper) Deep Learning Software: Most Popular Caffe(UCB) Caffe2(Facebook) Paddle (Baidu) Torch(NYU/Facebook) PyTorch(Facebook) CNTK(Microsoft) Theano(U Montreal) TensorFlow(Google) MXNet(Amazon) Keras (High Level Wrapper) Deep Learning Software: Today Caffe(UCB) Caffe2(Facebook) Paddle (Baidu) Torch(NYU/Facebook) PyTorch(Facebook) CNTK(Microsoft) Theano(U Montreal) TensorFlow(Google) MXNet(Amazon) Keras (High Level Wrapper) Mobile Platform ● Tensorflow Lite: ○ Released last November Why do we use deep learning frameworks? ● Easily build big computational graphs ○ Not the case in HW1 ● Easily compute gradients in computational graphs ● GPU support (cuDNN, cuBLA...etc) ○ Not required in HW1 Keras ● A high-level deep learning framework ● Built on other deep-learning frameworks ○ Theano ○ Tensorflow ○ CNTK ● Easy and Fun! Keras: A High-level Wrapper ● Pass on a layer of instances in the constructor ● Or: simply add layers. Make sure the dimensions match. Keras: Compile and train! Epoch: 1 epoch means going through all the training dataset once Numpy ● The fundamental package in Python for: ○ Scientific Computing ○ Data Science ● Think in terms of vectors/Matrices ○ Refrain from using for loops! ○ Similar to Matlab Numpy ● Basic vector operations ○ Sum, mean, argmax….
    [Show full text]
  • A Simple Tutorial on Theano
    A Simple Tutorial on Theano Jiang Guo Outline • What’s Theano? • How to use Theano? – Basic Usage: How to write a theano program – Advanced Usage: Manipulating symbolic expressions • Case study 1: Logistic Regression • Case study 2: Multi-layer Perceptron • Case study 3: Recurrent Neural Network WHAT’S THEANO? Theano is many things • Programming Language • Linear Algebra Compiler • Python library – Define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays. • Note: Theano is not a machine learning toolkit, but a mathematical toolkit that makes building downstream machine learning models easier. – Pylearn2 Theano features • Tight integration with NumPy • Transparent use of a GPU • Efficient symbolic differentiation • Speed and stability optimizations • Dynamic C code generation Project Status • Theano has been developed and used since 2008, by LISA lab at the University of Montreal (leaded by Yoshua Bengio) – Citation: 202 (LTP: 88) • Deep Learning Tutorials • Machine learning library built upon Theano – Pylearn2 • Good user documentation – http://deeplearning.net/software/theano/ • Open-source on Github Basic Usage HOW TO USE THEANO? Python in 1 Slide • Interpreted language • OO and scripting language • Emphasizes code readability • Large and comprehensive standard library • Indentation for block delimiters • Dynamic type • Dictionary – d={‘key1’:‘val1’, ‘key2’:42, …} • List comprehension – [i+3 for i in range(10)] NumPy in 1 Slide • Basic scientific computing package in Python on the CPU • A powerful N-dimensional
    [Show full text]
  • Overview of Deep Learning Stack at NERSC
    Overview of Deep Learning Stack at NERSC Wahid Bhimji, Mustafa Mustafa User Training Jan/2019 Deep learning Stack 2 Deep Learning Stack on HPC Technologies Deep Learning Frameworks Neon, CNTK, MXNet, … Cray ML PE Horovod Multi Node libraries Plugin MLSL MPI GRPC Single Node libraries MKL-DNN CuDNN Hardware CPUs (KNL) GPUs FPGAs Accelerators Software Frameworks ● Different frameworks popularity has evolved rapidly ● Percentage of ML Papers that mention a particular framework: Source: https://twitter.com/karpathy/status/972295865187512320?lang=en ● Caffe and Theano most popular 3-4 years ago ● Then Google released TensorFlow which now dominates ● PyTorch is recently rising rapidly in popularity Framework overview (IMHO) • TensorFlow: – Reasonably easy to use directly within python (not as easy as with Keras) – Very nice tools for development like TensorBoard – Active development for features (e.g. dynamic graphs) and performance (e.g. for CPU/KNL) and ease (e.g. estimators) • Keras: – High-level framework sits on top of tensorflow (or theano) (and now part of TensorFlow). – Very easy to create standard and even advanced deep networks with a lot of templates/ examples Pytorch and Caffe (IMHO) • PyTorch – Relatively recent python adaption of ‘torch’ framework - heavily contributed to by FaceBook – More pythonic than tensorflow/keras – Dynamic graphs from the start - very flexible • Popular with (some) ML researchers – Experimental, some undocumented quirks • Version 1.0 coming soon! rc1 is out already. • Caffe – Optimised performance (still best for certain networks on CPUs) – Relatively difficult to develop new architectures TensorFlow (and Keras) @NERSC http://www.nersc.gov/users/data-analytics/data-analytics-2/deep-learning/using-tensorflow-at-nersc/ • Easiest is to use default anaconda python: module load python python >>> import tensorflow as tf • Active work by intel to optimize for CPU: – Available in anaconda.
    [Show full text]
  • Dynamic Control Flow in Large-Scale Machine Learning
    Dynamic Control Flow in Large-Scale Machine Learning Yuan Yu∗ Martín Abadi Paul Barham Microsoft Google Brain Google Brain [email protected] [email protected] [email protected] Eugene Brevdo Mike Burrows Andy Davis Google Brain Google Brain Google Brain [email protected] [email protected] [email protected] Jeff Dean Sanjay Ghemawat Tim Harley Google Brain Google DeepMind [email protected] [email protected] [email protected] Peter Hawkins Michael Isard Manjunath Kudlur∗ Google Brain Google Brain Cerebras Systems [email protected] [email protected] [email protected] Rajat Monga Derek Murray Xiaoqiang Zheng Google Brain Google Brain Google Brain [email protected] [email protected] [email protected] ABSTRACT that use control flow. Third, our choice of non-strict semantics en- Many recent machine learning models rely on fine-grained dy- ables multiple loop iterations to execute in parallel across machines, namic control flow for training and inference. In particular, models and to overlap compute and I/O operations. based on recurrent neural networks and on reinforcement learning We have done our work in the context of TensorFlow, and it depend on recurrence relations, data-dependent conditional execu- has been used extensively in research and production. We evalu- tion, and other features that call for dynamic control flow. These ate it using several real-world applications, and demonstrate its applications benefit from the ability to make rapid control-flow performance and scalability. decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learn- CCS CONCEPTS ing system must support dynamic control flow in distributed and • Software and its engineering → Data flow languages; Dis- heterogeneous environments.
    [Show full text]
  • Machine Learning Manual Revision: 58F161e
    Bright Cluster Manager 8.1 Machine Learning Manual Revision: 58f161e Date: Wed Sep 29 2021 ©2020 Bright Computing, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Bright Computing, Inc. Trademarks Linux is a registered trademark of Linus Torvalds. PathScale is a registered trademark of Cray, Inc. Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. SUSE is a registered trademark of Novell, Inc. PGI is a registered trademark of NVIDIA Corporation. FLEXlm is a registered trademark of Flexera Software, Inc. PBS Professional, PBS Pro, and Green Provisioning are trademarks of Altair Engineering, Inc. All other trademarks are the property of their respective owners. Rights and Restrictions All statements, specifications, recommendations, and technical information contained herein are current or planned as of the date of publication of this document. They are reliable as of the time of this writing and are presented without warranty of any kind, expressed or implied. Bright Computing, Inc. shall not be liable for technical or editorial errors or omissions which may occur in this document. Bright Computing, Inc. shall not be liable for any damages resulting from the use of this document. Limitation of Liability and Damages Pertaining to Bright Computing, Inc. The Bright Cluster Manager product principally consists of free software that is licensed by the Linux authors free of charge. Bright Computing, Inc. shall have no liability nor will Bright Computing, Inc. provide any warranty for the Bright Cluster Manager to the extent that is permitted by law.
    [Show full text]