Recurrent Neural Networks and Long-Short Term Memory (LSTM)

11/19/2018 Recurrent neural networks and Long-short term memory (LSTM) Jeong Min Lee CS3750 Outline • RNN • RNN • Unfolding Computational Graph • Backpropagation and weight update • Explode / Vanishing gradient problem • LSTM • GRU • Tasks with RNN • Software Packages 1 11/19/2018 So far we are • Modeling sequence (time-series) and predicting future values by probabilistic models (AR, HMM, LDS, Particle Filtering, Hawkes Process, etc) • E.g. LDS • Observation 푥푡 is modeled as emission 푧 푧 푧 matrix 퐶, hidden state 푧푡 with Gaussian 푡−1 푡 푡+1 noise 푤푡 푥푡 = 퐶푧푡 + 푤푡 ; 푤푡~푁 푤 0, Σ • The hidden state is also probabilistically computed with transition matrix 퐴 and 푥푡−1 푥푡 푥푡+1 Gaussian noise 푣푡 푧푡 = 퐴푧푡−1 + 푣푡 ; 푣푡~푁(푤|0, Γ) Paradigm Shift to RNN • We are moving into a new world where no probabilistic component exists in a model • That is, you cannot do inference like in LDS and HMM • In RNN, hidden states bear no probabilistic form or assumption • Given fixed input and target from data, RNN is to learn intermediate association between them and also the real-valued vector representation 2 11/19/2018 RNN • RNN’s input, output, and internal representation (hidden states) are all real-valued vectors • ℎ푡: hidden states; real-valued vector ℎ푡 = tanh 푈푥푡 + 푊ℎ푡−1 • 푥푡: input vector (real-valued) • 푉ℎ푡: real-valued vector 푦ො = λ(푉ℎ푡) • 푦ො : output vector (real-valued) RNN • RNN consists of three parameter matrices (푈, 푊, 푉) with activation functions • 푈: input-hidden matrix ℎ = tanh 푈푥 + 푊ℎ 푡 푡 푡−1 • 푊: hidden-hidden matrix • 푉: hidden-output matrix 푦ො = λ(푉ℎ푡) 3 11/19/2018 RNN • tanh ∙ is a tangent hyperbolic function. It models non-linearity. ℎ푡 = tanh 푈푥푡 + 푊ℎ푡−1 (z) 푦ො = λ(푉ℎ푡) tanh z RNN • λ ∙ is output transformation function • It can be any function and selected for a task and type of target in data • It can be even another feed-forward neural network and it makes RNN to model anything without restriction ℎ = tanh 푈푥 + 푊ℎ • Sigmoid: binary probability distribution 푡 푡 푡−1 • Softmax: categorical probability distribution • ReLU: positive real-value output 푦ො = λ(푉ℎ푡) • Identity function: real-value output 4 11/19/2018 Make a prediction • Let’s see how it makes a prediction • In the beginning, initial hidden state ℎ0 is filled with zero or random value • Also we assume the model is already trained. (we will see how it is trained soon) ℎ0 푥1 Make a prediction • Assume we currently have observation 푥1 and want to predict 푥2 • We compute hidden states ℎ1 first ℎ1 = tanh 푈푥1 + 푊ℎ0 푊 ℎ0 ℎ1 푈 푥1 5 11/19/2018 Make a prediction • Then we generate prediction: • 푉ℎ1 is a real-valued vector or scalar value (depends on the size of output matrix 푉) ℎ1 = tanh 푈푥1 + 푊ℎ0 푥ො2 푉, λ() 푥ො2 = 푦ො = λ(푉ℎ1) 푊 ℎ0 ℎ1 푈 푥1 푥ො2 Make a prediction multiple steps • In prediction for multiple steps a head, predicted value 푥ො2 from previous step is considered as input 푥2 at time step 2 ℎ2 = tanh 푈푥ො2 + 푊ℎ1 푥ො2 푥ො3 푉, λ() 푉, λ() 푥ො3 = 푦ො = λ(푉ℎ2) 푊 푊 ℎ0 ℎ1 ℎ2 푈 푈 푥1 푥ො2 푥ො3 6 11/19/2018 Make a prediction multiple steps • Same mechanism applies forward in time.. ℎ3 = tanh 푈푥ො3 + 푊ℎ2 푥ො4 = 푦ො = λ(푉ℎ3) 푥ො2 푥ො3 푥ො4 푉, λ() 푉, λ() 푉, λ() 푊 푊 푊 ℎ0 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥ො2 푥ො3 푥ො4 RNN Characteristic • You might observed that… • Parameters 푈, 푉, 푊 are shared across all time steps • No probabilistic component (random number generation) is involved • So, everything is deterministic 푥ො2 푥ො3 푥ො4 푉, λ() 푉, λ() 푉, λ() 푊 푊 푊 ℎ0 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥ො2 푥ො3 푥ො4 7 11/19/2018 Another way to see RNN • RNN is a type of neural network Neural Network • Cascading several linear weights with nonlinear 푦 activation functions in between them 푉 • 푦: output ℎ • 푉: Hidden-Output matrix 푈 • ℎ: hidden units (states) 푥 • 푈: Input-Hidden matrix • 푥: input 8 11/19/2018 Neural Network • In traditional NN, it is assumed that every input is 푦 independent each other 푉 • But with sequential data, input in current time step is ℎ highly likely depends on input in previous time step 푈 푥 • We need some additional structure that can model dependencies of inputs over time Recurrent Neural Network • A type of a neural network that has a recurrence structure • The recurrence structure allows us to operate over a sequence of vectors 푦 푉 푊 ℎ 푈 푥 9 11/19/2018 RNN as an Unfolding Computational Graph 푦 푦ො푡−1 푦ො푡 푦ො푡+1 푉 푉 푉 푉 푊 푊 푊 푊 푊 … … ℎ Unfold ℎ ℎ푡−1 ℎ푡 ℎ푡+1 ℎ 푈 푈 푈 푈 푥 푥푡−1 푥푡 푥푡+1 RNN as an Unfolding Computational Graph 푦 푦ො푡−1 푦ො푡 푦ො푡+1 푉 푉 푉 푉 푊 푊 푊 푊 푊 … … ℎ Unfold ℎ ℎ푡−1 ℎ푡 ℎ푡+1 ℎ 푈 푈 푈 푈 푥 푥푡−1 푥푡 푥푡+1 RNN can be converted into a feed-forward neural network by unfolding over time 10 11/19/2018 How to train RNN? • Before make train happen, we need to define these: • 푦푡 : true target • 푦ො푡 : output of RNN (=prediction for true target) • 퐸푡: error (loss); difference between the true target and the output • As the output transformation function 휆 is selected by the task and data, so does the loss: (e.g.) • Binary Classification: Binary Cross Entropy • Categorical Classification: Cross Entropy • Regression: Mean Squared Error With the loss, the RNN will be like: 푦 푦푡−1 푦푡 푦푡+1 퐸 퐸푡−1 퐸푡 퐸푡+1 표 Unfold 푦ො푡−1 푦ො푡 푦ො푡+1 푉 푉 푉 푉 푊 푊 푊 푊 푊 … … ℎ ℎ ℎ푡−1 ℎ푡 ℎ푡+1 ℎ 푈 푈 푈 푈 푥 푥푡−1 푥푡 푥푡+1 11 11/19/2018 Back Propagation Through Time (BPTT) • Extension of standard backpropagation 푦1 푦2 푦3 that performs gradient descent on an unfolded network 퐸1 퐸2 퐸3 • Goal is to calculate gradients of the error with respect to parameters U, V, and W 푦ො1 푦ො2 푦ො3 and learn desired parameters using 푉 푉 푉 Stochastic Gradient Descent 푊 푊 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 Back Propagation Through Time (BPTT) • To update in one training example 푦1 푦2 푦3 (sequence), we sum up the gradients at each time of the sequence: 퐸1 퐸2 퐸3 휕퐸 휕퐸 푦ො1 푦ො2 푦ො3 = ෍ 푡 휕푊 휕푊 푉 푉 푉 푡 푊 푊 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 12 11/19/2018 Learning Parameters 푦1 푦2 푦3 ℎ푡 = tanh(푈푥푡 + 푊ℎ푡−1) 푧 = 푈푥 + 푊ℎ 푡 푡 푡−1 퐸1 퐸2 퐸3 ℎ푡 = tanh(푧푡) 푦ො1 푦ො2 푦ො3 휕ℎ 휕ℎ • Let 푘 훼 = 푘 = 1 − ℎ 2 푉 푉 푉 휆푘 = 푘 푘 푊 푊 휕푊 휕푧푘 ℎ1 ℎ2 ℎ3 휕퐸 푘 푈 푈 푈 훽푘 = = 표푘 − 푦푘 푉 휕ℎ푘 푥1 푥2 푥3 Learning Parameters 휕퐸푘 휕퐸푘 휕ℎ푘 푦1 푦2 푦3 = = 훽푘휆푘 휕푊 휕ℎ푘 휕푊 퐸1 퐸2 퐸3 휕ℎ 휕ℎ 휕푧 휆 = 푘 = 푘 푘 = 훼 (ℎ + 푊휆 ) 푘 휕푊 휕푧 휕푊 푘 푘−1 푘−1 푘 푦ො1 푦ො2 푦ො3 푉 푉 푉 휕ℎ 휕푧 푊 푊 휓 = 푘 = 훼 푘 = 훼 (푥 + 푊휓 ) 푘 휕푈 푘 휕푈 푘 푘 푘−1 ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 13 11/19/2018 Initialization: 2 훼0 = 1 − ℎ0 ; 휆0 = 0; 휓0 = 훼0 ∙ 푥0 Then, Δ푤 = 0 ; Δ푢 = 0 ; Δ푣 = 0 푉푛푒푤 = 푉표푙푑 + 훼Δ푣 For k= 1...T (T; length of a sequence): 푊푛푒푤 = 푊표푙푑 + 훼Δ푤 2 푈 = 푈표푙 + 훼Δ푢 훼푘 = 1 − ℎ푘 푛푒푤 푑 휆푘 = 훼푘(ℎ푘−1 + 푊휆푘−1) 훼: learning rate 훽푘 = 표푘 − 푦푘 푉 Δ푤 = Δ푤 + 훽푘휆푘 휓푘 = 훼푘(푥푘 + 푊휓푘−1) Δ푢 = Δ푢 + 훽푘휓푘 Δ푣 = Δ푣 + 표푘 − 푦푘 ⊗ ℎ푘 Exploding and Vanishing Gradient Problem • In RNN, we repeatedly multiply W along 푦1 푦2 푦3 with a input sequence 퐸 퐸 퐸 ℎ푡 = tanh(푈푥푡 + 푊ℎ푡−1) 1 2 3 푦ො 푦ො 푦ො • The recurrence multiplication can result in 1 2 3 푉 푉 푉 difficulties called exploding and vanishing 푊 푊 gradient problem ℎ1 ℎ2 ℎ3 푈 푈 푈 푥1 푥2 푥3 14 11/19/2018 Exploding and Vanishing Gradient Problem • For example, we can think of simple RNN with lacking inputs 풙 ℎ푡 = 푊ℎ푡−1 • It can be simplified to ℎ푡 = 푊푡ℎ0 • If 푊푡 has an Eigen decomposition 푊푡 = (푉 diag 휆 푉−1)푡= 푉 diag 휆푡 푉−1 Exploding and Vanishing Gradient Problem 푊푡 = (푉 diag 휆 푉−1)푡= 푉 diag 휆푡 푉−1 • Any eigenvalues 휆푖 that are not near an absolute value of 1 will either • explode if they are greater than 1 in magnitude • vanish if they are less than 1 in magnitude • The gradients through such a graph are also scaled according to diag 휆푡 15 11/19/2018 Exploding and Vanishing Gradient Problem 푊푡 = (푉 diag 휆 푉−1)푡= 푉 diag 휆푡 푉−1 • Whenever the model is able to represent long-term dependencies, the gradient of a long-term interaction has exponentially smaller magnitude than the gradient of a short-term interaction • That is, it is not impossible to learn, but that it might take a very long time to learn long-term dependencies: • Because the signal about these dependencies will tend to be hidden by the smallest ﬂuctuations arising from short-term dependencies Solution1: Truncated BPTT 푦 푦 푦 푦 푦 푦 • Run forward as it is, but run 1 2 3 4 5 6 the backward in the chunk of the sequence instead of the 퐸1 퐸2 퐸3 퐸4 퐸5 퐸6 whole sequence 푦ො1 푦ො2 푦ො3 푦ො4 푦ො5 푦ො6 푉 푉 푉 푉 푉 푉 푊 푊 푊 푊 푊 ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6 푈 푈 푈 푈 푈 푈 푥1 푥2 푥3 푥4 푥5 푥6 16 11/19/2018 Solution2: Gating mechanism (LSTM;GRU) • Add gates to produce paths where gradients can flow more constantly in longer-term without vanishing nor exploding • We’ll see in next chapter Outline • RNN • LSTM • GRU • Tasks with RNN • Software Packages 17 11/19/2018 Long Short-term Memory (LSTM) • Capable of modeling longer term dependencies by having memory cells and gates that controls the information flow along with the memory cells Long Short-term Memory (LSTM) • Capable of modeling longer term dependencies by having memory cells and gates that controls the information flow along with the memory cells Images: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 18 11/19/2018 Long Short-term Memory (LSTM) • The contents of the memory cells 퐶푡 are regulated by various gates: • Forget gate 푓푡 • Input gate 푡 • Reset gate 푟푡 • Output gate 표푡 • Each gates are composed of affine transformation with Sigmoid activation function Forget Gate • It determines how much contents from previous cell 퐶푡−1 will be erased (we will see how it works in next a few slides) • Linear transformation of concatenated previous hidden states and input are followed by Sigmoid function • The sigmoid generates values 0 and 1: • 0 : completely remove info in the 푓푡 = 휎(푊푓 ∙ ℎ푡−1, 푥푡 + 푏푓) dimension • 1 : completely keep info in the dimension 19 11/19/2018 New Candidate Cell and Input Gate •

Recurrent Neural Networks and Long-Short Term Memory (LSTM)

Intro to Tensorflow 2.0 MBL, August 2019

Ways to Use Machine Learning Approaches for Software Development

Keras2c: a Library for Converting Keras Neural Networks to Real-Time Compatible C

Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction

Deep Learning with Keras I

Intro to Keras

Auto-Keras: an Efficient Neural Architecture Search System

Wavefront Parallelization of Recurrent Neural Networks on Multi-Core

Introduction to Keras Tensorflow

Deep Learning Software Security and Fairness of Deep Learning SP18 Today

Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?

Fake News Detection and Production Using Transformer-Based NLP Models