Recurrent Neural Networks and Long-Short Term Memory (LSTM)

Total Page:16

File Type:pdf, Size:1020Kb

Recurrent Neural Networks and Long-Short Term Memory (LSTM) 11/19/2018 Recurrent neural networks and Long-short term memory (LSTM) Jeong Min Lee CS3750 Outline • RNN • RNN • Unfolding Computational Graph • Backpropagation and weight update • Explode / Vanishing gradient problem • LSTM • GRU • Tasks with RNN • Software Packages 1 11/19/2018 So far we are • Modeling sequence (time-series) and predicting future values by probabilistic models (AR, HMM, LDS, Particle Filtering, Hawkes Process, etc) • E.g. LDS • Observation 푥푡 is modeled as emission 푧 푧 푧 matrix 퐶, hidden state 푧푡 with Gaussian 푡−1 푡 푡+1 noise 푤푡 푥푡 = 퐶푧푡 + 푤푡 ; 푤푡~푁 푤 0, Σ • The hidden state is also probabilistically computed with transition matrix 퐴 and 푥푡−1 푥푡 푥푡+1 Gaussian noise 푣푡 푧푡 = 퐴푧푡−1 + 푣푡 ; 푣푡~푁(푤|0, Γ) Paradigm Shift to RNN • We are moving into a new world where no probabilistic component exists in a model • That is, you cannot do inference like in LDS and HMM • In RNN, hidden states bear no probabilistic form or assumption • Given fixed input and target from data, RNN is to learn intermediate association between them and also the real-valued vector representation 2 11/19/2018 RNN • RNN’s input, output, and internal representation (hidden states) are all real-valued vectors • ℎ푡: hidden states; real-valued vector ℎ푡 = tanh 푈푥푡 + 푊ℎ푡−1 • 푥푡: input vector (real-valued) • 푉ℎ푡: real-valued vector 푦ො = λ(푉ℎ푡) • 푦ො : output vector (real-valued) RNN • RNN consists of three parameter matrices (푈, 푊, 푉) with activation functions • 푈: input-hidden matrix ℎ = tanh 푈푥 + 푊ℎ 푡 푡 푡−1 • 푊: hidden-hidden matrix • 푉: hidden-output matrix 푦ො = λ(푉ℎ푡) 3 11/19/2018 RNN • tanh ∙ is a tangent hyperbolic function. It models non-linearity. ℎ푡 = tanh 푈푥푡 + 푊ℎ푡−1 (z) 푦ො = λ(푉ℎ푡) tanh z RNN • λ ∙ is output transformation function • It can be any function and selected for a task and type of target in data • It can be even another feed-forward neural network and it makes RNN to model anything without restriction ℎ = tanh 푈푥 + 푊ℎ • Sigmoid: binary probability distribution 푡 푡 푡−1 • Softmax: categorical probability distribution • ReLU: positive real-value output 푦ො = λ(푉ℎ푡) • Identity function: real-value output 4 11/19/2018 Make a prediction • Let’s see how it makes a prediction • In the beginning, initial hidden state ℎ0 is filled with zero or random value • Also we assume the model is already trained. (we will see how it is trained soon) ℎ0 푥1 Make a prediction • Assume we currently have observation 푥1 and want to predict 푥2 • We compute hidden states ℎ1 first ℎ1 = tanh 푈푥1 + 푊ℎ0 푊 ℎ0 ℎ1 푈 푥1 5 11/19/2018 Make a prediction • Then we generate prediction: • 푉ℎ1 is a real-valued vector or scalar value (depends on the size of output matrix 푉) ℎ1 = tanh 푈푥1 + 푊ℎ0 푥ො2 푉, λ() 푥ො2 = 푦ො = λ(푉ℎ1) 푊 ℎ0 ℎ1 푈 푥1 푥ො2 Make a prediction multiple steps • In prediction for multiple steps a head, predicted value 푥ො2 from previous step is considered as input 푥2 at time step 2 ℎ2 = tanh 푈푥ො2 + 푊ℎ1 푥ො2 푥ො3 푉, λ() 푉, λ() 푥ො3 = 푦ො = λ(푉ℎ2) 푊 푊 ℎ0 ℎ1 ℎ2 푈 푈 푥1 푥ො2 푥ො3 6 11/19/2018 Make a prediction multiple steps • Same mechanism applies forward in time.. ℎ3 = tanh 푈푥ො3 + 푊ℎ2 푥ො4 = 푦ො = λ(푉ℎ3) 푥ො2 푥ො3 푥ො4 푉, λ() 푉, λ() 푉, λ() 푊 푊 푊 ℎ0 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥ො2 푥ො3 푥ො4 RNN Characteristic • You might observed that… • Parameters 푈, 푉, 푊 are shared across all time steps • No probabilistic component (random number generation) is involved • So, everything is deterministic 푥ො2 푥ො3 푥ො4 푉, λ() 푉, λ() 푉, λ() 푊 푊 푊 ℎ0 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥ො2 푥ො3 푥ො4 7 11/19/2018 Another way to see RNN • RNN is a type of neural network Neural Network • Cascading several linear weights with nonlinear 푦 activation functions in between them 푉 • 푦: output ℎ • 푉: Hidden-Output matrix 푈 • ℎ: hidden units (states) 푥 • 푈: Input-Hidden matrix • 푥: input 8 11/19/2018 Neural Network • In traditional NN, it is assumed that every input is 푦 independent each other 푉 • But with sequential data, input in current time step is ℎ highly likely depends on input in previous time step 푈 푥 • We need some additional structure that can model dependencies of inputs over time Recurrent Neural Network • A type of a neural network that has a recurrence structure • The recurrence structure allows us to operate over a sequence of vectors 푦 푉 푊 ℎ 푈 푥 9 11/19/2018 RNN as an Unfolding Computational Graph 푦 푦ො푡−1 푦ො푡 푦ො푡+1 푉 푉 푉 푉 푊 푊 푊 푊 푊 … … ℎ Unfold ℎ ℎ푡−1 ℎ푡 ℎ푡+1 ℎ 푈 푈 푈 푈 푥 푥푡−1 푥푡 푥푡+1 RNN as an Unfolding Computational Graph 푦 푦ො푡−1 푦ො푡 푦ො푡+1 푉 푉 푉 푉 푊 푊 푊 푊 푊 … … ℎ Unfold ℎ ℎ푡−1 ℎ푡 ℎ푡+1 ℎ 푈 푈 푈 푈 푥 푥푡−1 푥푡 푥푡+1 RNN can be converted into a feed-forward neural network by unfolding over time 10 11/19/2018 How to train RNN? • Before make train happen, we need to define these: • 푦푡 : true target • 푦ො푡 : output of RNN (=prediction for true target) • 퐸푡: error (loss); difference between the true target and the output • As the output transformation function 휆 is selected by the task and data, so does the loss: (e.g.) • Binary Classification: Binary Cross Entropy • Categorical Classification: Cross Entropy • Regression: Mean Squared Error With the loss, the RNN will be like: 푦 푦푡−1 푦푡 푦푡+1 퐸 퐸푡−1 퐸푡 퐸푡+1 표 Unfold 푦ො푡−1 푦ො푡 푦ො푡+1 푉 푉 푉 푉 푊 푊 푊 푊 푊 … … ℎ ℎ ℎ푡−1 ℎ푡 ℎ푡+1 ℎ 푈 푈 푈 푈 푥 푥푡−1 푥푡 푥푡+1 11 11/19/2018 Back Propagation Through Time (BPTT) • Extension of standard backpropagation 푦1 푦2 푦3 that performs gradient descent on an unfolded network 퐸1 퐸2 퐸3 • Goal is to calculate gradients of the error with respect to parameters U, V, and W 푦ො1 푦ො2 푦ො3 and learn desired parameters using 푉 푉 푉 Stochastic Gradient Descent 푊 푊 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 Back Propagation Through Time (BPTT) • To update in one training example 푦1 푦2 푦3 (sequence), we sum up the gradients at each time of the sequence: 퐸1 퐸2 퐸3 휕퐸 휕퐸 푦ො1 푦ො2 푦ො3 = ෍ 푡 휕푊 휕푊 푉 푉 푉 푡 푊 푊 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 12 11/19/2018 Learning Parameters 푦1 푦2 푦3 ℎ푡 = tanh(푈푥푡 + 푊ℎ푡−1) 푧 = 푈푥 + 푊ℎ 푡 푡 푡−1 퐸1 퐸2 퐸3 ℎ푡 = tanh(푧푡) 푦ො1 푦ො2 푦ො3 휕ℎ 휕ℎ • Let 푘 훼 = 푘 = 1 − ℎ 2 푉 푉 푉 휆푘 = 푘 푘 푊 푊 휕푊 휕푧푘 ℎ1 ℎ2 ℎ3 휕퐸 푘 푈 푈 푈 훽푘 = = 표푘 − 푦푘 푉 휕ℎ푘 푥1 푥2 푥3 Learning Parameters 휕퐸푘 휕퐸푘 휕ℎ푘 푦1 푦2 푦3 = = 훽푘휆푘 휕푊 휕ℎ푘 휕푊 퐸1 퐸2 퐸3 휕ℎ 휕ℎ 휕푧 휆 = 푘 = 푘 푘 = 훼 (ℎ + 푊휆 ) 푘 휕푊 휕푧 휕푊 푘 푘−1 푘−1 푘 푦ො1 푦ො2 푦ො3 푉 푉 푉 휕ℎ 휕푧 푊 푊 휓 = 푘 = 훼 푘 = 훼 (푥 + 푊휓 ) 푘 휕푈 푘 휕푈 푘 푘 푘−1 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 13 11/19/2018 Initialization: 2 훼0 = 1 − ℎ0 ; 휆0 = 0; 휓0 = 훼0 ∙ 푥0 Then, Δ푤 = 0 ; Δ푢 = 0 ; Δ푣 = 0 푉푛푒푤 = 푉표푙푑 + 훼Δ푣 For k= 1...T (T; length of a sequence): 푊푛푒푤 = 푊표푙푑 + 훼Δ푤 2 푈 = 푈표푙 + 훼Δ푢 훼푘 = 1 − ℎ푘 푛푒푤 푑 휆푘 = 훼푘(ℎ푘−1 + 푊휆푘−1) 훼: learning rate 훽푘 = 표푘 − 푦푘 푉 Δ푤 = Δ푤 + 훽푘휆푘 휓푘 = 훼푘(푥푘 + 푊휓푘−1) Δ푢 = Δ푢 + 훽푘휓푘 Δ푣 = Δ푣 + 표푘 − 푦푘 ⊗ ℎ푘 Exploding and Vanishing Gradient Problem • In RNN, we repeatedly multiply W along 푦1 푦2 푦3 with a input sequence 퐸 퐸 퐸 ℎ푡 = tanh(푈푥푡 + 푊ℎ푡−1) 1 2 3 푦ො 푦ො 푦ො • The recurrence multiplication can result in 1 2 3 푉 푉 푉 difficulties called exploding and vanishing 푊 푊 gradient problem ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 14 11/19/2018 Exploding and Vanishing Gradient Problem • For example, we can think of simple RNN with lacking inputs 풙 ℎ푡 = 푊ℎ푡−1 • It can be simplified to ℎ푡 = 푊푡ℎ0 • If 푊푡 has an Eigen decomposition 푊푡 = (푉 diag 휆 푉−1)푡= 푉 diag 휆푡 푉−1 Exploding and Vanishing Gradient Problem 푊푡 = (푉 diag 휆 푉−1)푡= 푉 diag 휆푡 푉−1 • Any eigenvalues 휆푖 that are not near an absolute value of 1 will either • explode if they are greater than 1 in magnitude • vanish if they are less than 1 in magnitude • The gradients through such a graph are also scaled according to diag 휆푡 15 11/19/2018 Exploding and Vanishing Gradient Problem 푊푡 = (푉 diag 휆 푉−1)푡= 푉 diag 휆푡 푉−1 • Whenever the model is able to represent long-term dependencies, the gradient of a long-term interaction has exponentially smaller magnitude than the gradient of a short-term interaction • That is, it is not impossible to learn, but that it might take a very long time to learn long-term dependencies: • Because the signal about these dependencies will tend to be hidden by the smallest fluctuations arising from short-term dependencies Solution1: Truncated BPTT 푦 푦 푦 푦 푦 푦 • Run forward as it is, but run 1 2 3 4 5 6 the backward in the chunk of the sequence instead of the 퐸1 퐸2 퐸3 퐸4 퐸5 퐸6 whole sequence 푦ො1 푦ො2 푦ො3 푦ො4 푦ො5 푦ො6 푉 푉 푉 푉 푉 푉 푊 푊 푊 푊 푊 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6 푈 푈 푈 푈 푈 푈 푥1 푥2 푥3 푥4 푥5 푥6 16 11/19/2018 Solution2: Gating mechanism (LSTM;GRU) • Add gates to produce paths where gradients can flow more constantly in longer-term without vanishing nor exploding • We’ll see in next chapter Outline • RNN • LSTM • GRU • Tasks with RNN • Software Packages 17 11/19/2018 Long Short-term Memory (LSTM) • Capable of modeling longer term dependencies by having memory cells and gates that controls the information flow along with the memory cells Long Short-term Memory (LSTM) • Capable of modeling longer term dependencies by having memory cells and gates that controls the information flow along with the memory cells Images: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 18 11/19/2018 Long Short-term Memory (LSTM) • The contents of the memory cells 퐶푡 are regulated by various gates: • Forget gate 푓푡 • Input gate 푡 • Reset gate 푟푡 • Output gate 표푡 • Each gates are composed of affine transformation with Sigmoid activation function Forget Gate • It determines how much contents from previous cell 퐶푡−1 will be erased (we will see how it works in next a few slides) • Linear transformation of concatenated previous hidden states and input are followed by Sigmoid function • The sigmoid generates values 0 and 1: • 0 : completely remove info in the 푓푡 = 휎(푊푓 ∙ ℎ푡−1, 푥푡 + 푏푓) dimension • 1 : completely keep info in the dimension 19 11/19/2018 New Candidate Cell and Input Gate •
Recommended publications
  • Intro to Tensorflow 2.0 MBL, August 2019
    Intro to TensorFlow 2.0 MBL, August 2019 Josh Gordon (@random_forests) 1 Agenda 1 of 2 Exercises ● Fashion MNIST with dense layers ● CIFAR-10 with convolutional layers Concepts (as many as we can intro in this short time) ● Gradient descent, dense layers, loss, softmax, convolution Games ● QuickDraw Agenda 2 of 2 Walkthroughs and new tutorials ● Deep Dream and Style Transfer ● Time series forecasting Games ● Sketch RNN Learning more ● Book recommendations Deep Learning is representation learning Image link Image link Latest tutorials and guides tensorflow.org/beta News and updates medium.com/tensorflow twitter.com/tensorflow Demo PoseNet and BodyPix bit.ly/pose-net bit.ly/body-pix TensorFlow for JavaScript, Swift, Android, and iOS tensorflow.org/js tensorflow.org/swift tensorflow.org/lite Minimal MNIST in TF 2.0 A linear model, neural network, and deep neural network - then a short exercise. bit.ly/mnist-seq ... ... ... Softmax model = Sequential() model.add(Dense(256, activation='relu',input_shape=(784,))) model.add(Dense(128, activation='relu')) model.add(Dense(10, activation='softmax')) Linear model Neural network Deep neural network ... ... Softmax activation After training, select all the weights connected to this output. model.layers[0].get_weights() # Your code here # Select the weights for a single output # ... img = weights.reshape(28,28) plt.imshow(img, cmap = plt.get_cmap('seismic')) ... ... Softmax activation After training, select all the weights connected to this output. Exercise 1 (option #1) Exercise: bit.ly/mnist-seq Reference: tensorflow.org/beta/tutorials/keras/basic_classification TODO: Add a validation set. Add code to plot loss vs epochs (next slide). Exercise 1 (option #2) bit.ly/ijcav_adv Answers: next slide.
    [Show full text]
  • Ways to Use Machine Learning Approaches for Software Development
    Ways to use Machine Learning approaches for software development Nicklas Jonsson Nicklas Jonsson VT 2018 Examensarbete, 30 hp Supervisor: Eddie wadbro Extern Supervisor: C4 Contexture Examiner: Henrik Bjorklund¨ Master of Science Programme in Computing Science and Engineering, 300 hp Abstract With the rise of machine learning and in particular deep learning enter- ing all different types of fields, including software development. It could be a bit hard to know where to begin to search for the tools when some- one wants to use machine learning for a one’s problems. This thesis has looked at some available technologies of today for applying machine learning to one’s applications. This thesis has looked at some of the available cloud services, frame- works, and libraries for machine learning and it presents three different implementation structures that can be used with these technologies for the problem of image classification. Acknowledgements I want to thank C4 Contexture for giving me the thesis idea, support, and supplying me with a working station. I also want to thank Lantmannen¨ for supplying me with the image data that was used for this thesis. Finally, I want to thank Eddie Wadbro for his guidance during this thesis and of course a big thanks to my family and friends for their support during this period of my life. 1(45) Content 1 Introduction 3 1.1 The Client 3 1.1.1 C4 Contexture PIM software 4 1.2 The data 4 1.3 Goal 5 1.4 Limitation 5 2 Background 7 2.1 Artificial intelligence, machine learning and deep learning.
    [Show full text]
  • Keras2c: a Library for Converting Keras Neural Networks to Real-Time Compatible C
    Keras2c: A library for converting Keras neural networks to real-time compatible C Rory Conlina,∗, Keith Ericksonb, Joeseph Abbatec, Egemen Kolemena,b,∗ aDepartment of Mechanical and Aerospace Engineering, Princeton University, Princeton NJ 08544, USA bPrinceton Plasma Physics Laboratory, Princeton NJ 08544, USA cDepartment of Astrophysical Sciences at Princeton University, Princeton NJ 08544, USA Abstract With the growth of machine learning models and neural networks in mea- surement and control systems comes the need to deploy these models in a way that is compatible with existing systems. Existing options for deploying neural networks either introduce very high latency, require expensive and time con- suming work to integrate into existing code bases, or only support a very lim- ited subset of model types. We have therefore developed a new method called Keras2c, which is a simple library for converting Keras/TensorFlow neural net- work models into real-time compatible C code. It supports a wide range of Keras layers and model types including multidimensional convolutions, recurrent lay- ers, multi-input/output models, and shared layers. Keras2c re-implements the core components of Keras/TensorFlow required for predictive forward passes through neural networks in pure C, relying only on standard library functions considered safe for real-time use. The core functionality consists of ∼ 1500 lines of code, making it lightweight and easy to integrate into existing codebases. Keras2c has been successfully tested in experiments and is currently in use on the plasma control system at the DIII-D National Fusion Facility at General Atomics in San Diego. 1. Motivation TensorFlow[1] is one of the most popular libraries for developing and training neural networks.
    [Show full text]
  • Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction
    TensorFlow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction TensorFlow Google Brain, 2015 (rewritten DistBelief) Theano University of Montréal, 2009 Keras François Chollet, 2015 (now at Google) Torch Facebook AI Research, Twitter, Google DeepMind Caffe Berkeley Vision and Learning Center (BVLC), 2013 Outline 1. Introduction of each framework a. TensorFlow b. Theano c. Keras d. Torch e. Caffe 2. Further comparison a. Code + models b. Community and documentation c. Performance d. Model deployment e. Extra features 3. Which framework to choose when ..? Introduction of each framework TensorFlow architecture 1) Low-level core (C++/CUDA) 2) Simple Python API to define the computational graph 3) High-level API (TF-Learn, TF-Slim, soon Keras…) TensorFlow computational graph - auto-differentiation! - easy multi-GPU/multi-node - native C++ multithreading - device-efficient implementation for most ops - whole pipeline in the graph: data loading, preprocessing, prefetching... TensorBoard TensorFlow development + bleeding edge (GitHub yay!) + division in core and contrib => very quick merging of new hotness + a lot of new related API: CRF, BayesFlow, SparseTensor, audio IO, CTC, seq2seq + so it can easily handle images, videos, audio, text... + if you really need a new native op, you can load a dynamic lib - sometimes contrib stuff disappears or moves - recently introduced bells and whistles are barely documented Presentation of Theano: - Maintained by Montréal University group. - Pioneered the use of a computational graph. - General machine learning tool -> Use of Lasagne and Keras. - Very popular in the research community, but not elsewhere. Falling behind. What is it like to start using Theano? - Read tutorials until you no longer can, then keep going.
    [Show full text]
  • Deep Learning with Keras I
    Deep Learning with Keras i Deep Learning with Keras About the Tutorial Deep Learning essentially means training an Artificial Neural Network (ANN) with a huge amount of data. In deep learning, the network learns by itself and thus requires humongous data for learning. In this tutorial, you will learn the use of Keras in building deep neural networks. We shall look at the practical examples for teaching. Audience This tutorial is prepared for professionals who are aspiring to make a career in the field of deep learning and neural network framework. This tutorial is intended to make you comfortable in getting started with the Keras framework concepts. Prerequisites Before proceeding with the various types of concepts given in this tutorial, we assume that the readers have basic understanding of deep learning framework. In addition to this, it will be very helpful, if the readers have a sound knowledge of Python and Machine Learning. Copyright & Disclaimer Copyright 2019 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial.
    [Show full text]
  • Intro to Keras
    Intro to Keras Justin Zhang July 2017 1 Introduction Keras is a high-level Python machine learning API, which allows you to easily run neural networks. Keras is simply a specification; it provides a set of methods that you can use, and it will use a backend (TensorFlow, Theano, or CNTK, as chosen by the user) to actually run your code. Like many machine learning frameworks, Keras is a so-called define-and-run framework. This means that it will define and optimize your neural network in a compilation step before training starts. 2 First Steps First, install Keras: pip install keras We'll go over a fully-connected neural network designed for the MNIST (clas- sifying handwritten digits) dataset. # Can be found at https://github.com/fchollet/keras # Released under the MIT License import keras from keras.datasets import mnist from keras .models import S e q u e n t i a l from keras.layers import Dense, Dropout from keras.optimizers import RMSprop First, we handle the imports. keras.datasets includes many popular machine learning datasets, including CIFAR and IMDB. keras.layers includes a variety of neural network layer types, including Dense, Conv2D, and LSTM. Dense simply refers to a fully-connected layer, and Dropout is a layer commonly used to ad- dress the problem of overfitting. keras.optimizers includes most widely used optimizers. Here, we opt to use RMSprop, an alternative to the classic stochas- tic gradient decent (SGD) algorithm. Regarding keras.models, Sequential is most widely used for simple networks; there is also a functional Model class you can use for more complex networks.
    [Show full text]
  • Auto-Keras: an Efficient Neural Architecture Search System
    Auto-Keras: An Efficient Neural Architecture Search System Haifeng Jin, Qingquan Song, Xia Hu Department of Computer Science and Engineering, Texas A&M University {jin,song_3134,xiahu}@tamu.edu ABSTRACT epochs are required to further train the new architecture towards Neural architecture search (NAS) has been proposed to automat- better performance. Using network morphism would reduce the av- ically tune deep neural networks, but existing search algorithms, erage training time t¯ in neural architecture search. The most impor- e.g., NASNet [41], PNAS [22], usually suffer from expensive com- tant problem to solve for network morphism-based NAS methods is putational cost. Network morphism, which keeps the functional- the selection of operations, which is to select an operation from the ity of a neural network while changing its neural architecture, network morphism operation set to morph an existing architecture could be helpful for NAS by enabling more efficient training during to a new one. The network morphism-based NAS methods are not the search. In this paper, we propose a novel framework enabling efficient enough. They either require a large number of training Bayesian optimization to guide the network morphism for effi- examples [6], or inefficient in exploring the large search space [11]. cient neural architecture search. The framework develops a neural How to perform efficient neural architecture search with network network kernel and a tree-structured acquisition function optimiza- morphism remains a challenging problem. tion algorithm to efficiently explores the search space. Intensive As we know, Bayesian optimization [33] has been widely adopted experiments on real-world benchmark datasets have been done to to efficiently explore black-box functions for global optimization, demonstrate the superior performance of the developed framework whose observations are expensive to obtain.
    [Show full text]
  • Wavefront Parallelization of Recurrent Neural Networks on Multi-Core
    The final publication is available at ACM via http://dx.doi.org/10.1145/3392717.3392762 Wavefront Parallelization of Recurrent Neural Networks on Multi-core Architectures Robin Kumar Sharma Marc Casas Barcelona Supercomputing Center (BSC) Barcelona Supercomputing Center (BSC) Barcelona, Spain Barcelona, Spain [email protected] [email protected] ABSTRACT prevalent choice to analyze sequential and unsegmented data like Recurrent neural networks (RNNs) are widely used for natural text or speech signals. The internal structure of RNN inference and language processing, time-series prediction, or text analysis tasks. training in terms of dependencies across their fundamental numeri- The internal structure of RNNs inference and training in terms of cal kernels complicate the exploitation of model parallelism, which data or control dependencies across their fundamental numerical has not been fully exploited to accelerate forward and backward kernels complicate the exploitation of model parallelism, which is propagation of RNNs on multi-core CPUs. the reason why just data-parallelism has been traditionally applied This paper proposes W-Par, a parallel execution model for RNNs to accelerate RNNs. and its variants LSTMs and GRUs. W-Par exploits all possible con- This paper presents W-Par (Wavefront-Parallelization), a com- currency available in forward and backward propagation routines prehensive approach for RNNs inference and training on CPUs of RNNs. These propagation routines display complex data and that relies on applying model parallelism into RNNs models. We control dependencies that require sophisticated parallel approaches use ne-grained pipeline parallelism in terms of wavefront com- to extract all possible concurrency. W-Par represents RNNs forward putations to accelerate multi-layer RNNs running on multi-core and backward propagation as a computational graph [37] where CPUs.
    [Show full text]
  • Introduction to Keras Tensorflow
    Introduction to Keras TensorFlow Marco Rorro [email protected] CINECA – SCAI SuperComputing Applications and Innovation Department 1/33 Table of Contents Introduction Keras Distributed Deep Learning 2/33 Introduction Keras Distributed Deep Learning 3/33 I computations are expressed as stateful data-flow graphs I automatic differentiation capabilities I optimization algorithms: gradient and proximal gradient based I code portability (CPUs, GPUs, on desktop, server, or mobile computing platforms) I Python interface is the preferred one (Java, C and Go also exist) I installation through: pip, Docker, Anaconda, from sources I Apache 2.0 open-source license TensorFlow I Google Brain’s second generation machine learning system 4/33 I automatic differentiation capabilities I optimization algorithms: gradient and proximal gradient based I code portability (CPUs, GPUs, on desktop, server, or mobile computing platforms) I Python interface is the preferred one (Java, C and Go also exist) I installation through: pip, Docker, Anaconda, from sources I Apache 2.0 open-source license TensorFlow I Google Brain’s second generation machine learning system I computations are expressed as stateful data-flow graphs 4/33 I optimization algorithms: gradient and proximal gradient based I code portability (CPUs, GPUs, on desktop, server, or mobile computing platforms) I Python interface is the preferred one (Java, C and Go also exist) I installation through: pip, Docker, Anaconda, from sources I Apache 2.0 open-source license TensorFlow I Google Brain’s second generation
    [Show full text]
  • Deep Learning Software Security and Fairness of Deep Learning SP18 Today
    Deep Learning Software Security and Fairness of Deep Learning SP18 Today ● HW1 is out, due Feb 15th ● Anaconda and Jupyter Notebook ● Deep Learning Software ○ Keras ○ Theano ○ Numpy Anaconda ● A package management system for Python Anaconda ● A package management system for Python Jupyter notebook ● A web application that where you can code, interact, record and plot. ● Allow for remote interaction when you are working on the cloud ● You will be using it for HW1 Deep Learning Software Deep Learning Software Caffe(UCB) Caffe2(Facebook) Paddle (Baidu) Torch(NYU/Facebook) PyTorch(Facebook) CNTK(Microsoft) Theano(U Montreal) TensorFlow(Google) MXNet(Amazon) Keras (High Level Wrapper) Deep Learning Software: Most Popular Caffe(UCB) Caffe2(Facebook) Paddle (Baidu) Torch(NYU/Facebook) PyTorch(Facebook) CNTK(Microsoft) Theano(U Montreal) TensorFlow(Google) MXNet(Amazon) Keras (High Level Wrapper) Deep Learning Software: Today Caffe(UCB) Caffe2(Facebook) Paddle (Baidu) Torch(NYU/Facebook) PyTorch(Facebook) CNTK(Microsoft) Theano(U Montreal) TensorFlow(Google) MXNet(Amazon) Keras (High Level Wrapper) Mobile Platform ● Tensorflow Lite: ○ Released last November Why do we use deep learning frameworks? ● Easily build big computational graphs ○ Not the case in HW1 ● Easily compute gradients in computational graphs ● GPU support (cuDNN, cuBLA...etc) ○ Not required in HW1 Keras ● A high-level deep learning framework ● Built on other deep-learning frameworks ○ Theano ○ Tensorflow ○ CNTK ● Easy and Fun! Keras: A High-level Wrapper ● Pass on a layer of instances in the constructor ● Or: simply add layers. Make sure the dimensions match. Keras: Compile and train! Epoch: 1 epoch means going through all the training dataset once Numpy ● The fundamental package in Python for: ○ Scientific Computing ○ Data Science ● Think in terms of vectors/Matrices ○ Refrain from using for loops! ○ Similar to Matlab Numpy ● Basic vector operations ○ Sum, mean, argmax….
    [Show full text]
  • Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?
    Approach pre-trained deep learning models with caution Pre-trained models are easy to use, but are you glossing over details that could impact your model performance? Cecelia Shao Follow Apr 15 · 5 min read . How many times have you run the following snippets: import torchvision.models as models inception = models.inception_v3(pretrained=True) or from keras.applications.inception_v3 import InceptionV3 base_model = InceptionV3(weights='imagenet', include_top=False) It seems like using these pre-trained models have become a new standard for industry best practices. After all, why wouldn’t you take advantage of a model that’s been trained on more data and compute than you could ever muster by yourself? See the discussion on Reddit and HackerNews Long live pre-trained models! There are several substantial benefits to leveraging pre-trained models: • super simple to incorporate • achieve solid (same or even better) model performance quickly • there’s not as much labeled data required • versatile uses cases from transfer learning, prediction, and feature extraction Advances within the NLP space have also encouraged the use of pre- trained language models like GPT and GPT-2, AllenNLP’s ELMo, Google’s BERT, and Sebastian Ruder and Jeremy Howard’s ULMFiT (for an excellent over of these models, see this TOPBOTs post). One common technique for leveraging pretrained models is feature extraction, where you’re retrieving intermediate representations produced by the pretrained model and using those representations as inputs for a new model. These final fully-connected layers are generally assumed to capture information that is relevant for solving a new task. Everyone’s in on the game Every major framework like Tensorflow, Keras, PyTorch, MXNet, etc… offers pre-trained models like Inception V3, ResNet, AlexNet with weights: • Keras Applications • PyTorch torchvision.models • Tensorflow Official Models (and now TensorFlow Hubs) • MXNet Model Zoo • Fast.ai Applications Easy, right? .
    [Show full text]
  • Fake News Detection and Production Using Transformer-Based NLP Models
    Fake News Detection and Production using Transformer-based NLP Models Authors: Branislav Sándor - Student Number: 117728 Frode Paaske - Student Number: 102164 Martin Pajtás - Student Number: 117468 Program: MSc Business Administration and Information Systems – Data Science Course: Master’s Thesis Date: May 15, 2020 Pages: 81 Characters: 184,169 Supervisor: Daniel Hardt Contract #: 17246 1 Acknowledgements The authors would like to express their sincerest gratitude towards the following essential supporters of life throughout their academic journey: - Stack Overflow - GitHub - Google - YouTube - Forno a Legna - Tatra tea - Nexus - Eclipse Sandwich - The Cabin Trip - BitLab “Don’t believe everything you hear: real eyes, realize, real lies” Tupac Shakur “A lie told often enough becomes the truth” Vladimir Lenin 2 Glossary The following is an overview of abbreviations that will be used in this paper. Abbreviation Full Name BERT Bidirectional Encoder Representations from Transformers Bi-LSTM Bidirectional Long Short-Term Memory BoW Bag of Words CBOW Continuous Bag of Words CNN Convolutional Neural Network CV Count Vectorizer EANN Event Adversarial Neural Network ELMo Embeddings from Language Models FFNN Feed-Forward Neural Network GPT-2 Generative Pretrained Transformer 2 GPU Graphics Processing Unit KNN K-Nearest Neighbors LSTM Long Short-Term Memory MLM Masked Language Modeling NLP Natural Language Processing NSP Next Sentence Prediction OOV Out of Vocabulary Token PCFG Probabilistic Context-Free Grammars RNN Recurrent Neural Network RST Rhetorical Structure Theory SEO Search Engine Optimization SGD Stochastic Gradient Descent SVM Support-Vector Machine TF Term Frequency TF-IDF Term Frequency - Inverse Document Frequency UNK Unknown Token VRAM Video RAM 3 Abstract This paper studies fake news detection using the biggest publicly available dataset of naturally occurring expert fact-checked claims (Augenstein et al., 2019).
    [Show full text]