Lecture 9: Recurrent Neural Networks Deep Learning @ Uva

Lecture 9: Recurrent Neural Networks Deep Learning @ UvA UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 1 Previous Lecture o Word and Language Representations o From n-grams to Neural Networks o Word2vec o Skip-gram UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 2 Lecture Overview o Recurrent Neural Networks (RNN) for sequences o Backpropagation Through Time o RNNs using Long Short-Term Memory (LSTM) o Applications of Recurrent Neural Networks UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 3 Output Recurrent Neural Networks NN Cell Recurrent State connections Input UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 4 Sequential vs. static data o Abusing the terminology ◦ static data come in a mini-batch of size 퐷 ◦ dynamic data come in a mini-batch of size 1 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 5 Sequential vs. static data o Abusing the terminology ◦ static data come in a mini-batch of size 퐷 ◦ dynamic data come in a mini-batch of size 1 What about inputs that appear in sequences, such as text? Could a neural network handle such modalities? UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 6 Modelling sequences is … o ... roughly equivalent to predicting what is going to happen next Pr 푥 = Pr 푥푖 푥1, … , 푥푖−1) 푖 o Easy to generalize to sequences of arbitrary length o Considering small chunks 푥푖 fewer parameters, easier modelling o Often is convenient to pick a “frame” 푇 Pr 푥 = Pr 푥푖 푥푖−푇, … , 푥푖−1) 푖 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 7 Word representations o One-hot vector ◦ After one-hot vector add an embedding o Instead of one-hot vector use directly a word representation ◦ Word2vec ◦ GloVE Vocabulary One-hot vectors I I 1 I 0 I 0 I 0 am am 0 am 1 am 0 am 0 Bond Bond 0 Bond 0 Bond 1 Bond 0 James James 0 James 0 James 0 James 1 tired tired 0 tired 0 tired 0 tired 0 , , 0 , 0 , 0 , 0 McGuire McGuire 0 McGuire 0 McGuire 0 McGuire 0 ! ! 0 ! 0 ! 0 ! 0 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 8 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? I am Bond , James UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 9 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond I am Bond , James tired am ! UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 10 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond I am Bond , James tired am ! UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 11 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond I am Bond , James Bond tired am ! UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 12 Modelling-wise, what is memory? o A representation (variable) of the past o How to adapt a simple Neural Network to include a memory variable? Input 푦푡 푦푡+1 푦푡+2 푦푡+3 Output parameters 푉 Memory 푉 푉 푉 parameters 푐푡 푊 푊 푊 푊 Memory 푐푡+1 푐푡+2 푐푡+3 푈 푈 푈 푈 Input parameters 푥푡 푥푡+1 푥푡+2 푥푡+3 Output UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 13 Recurrent Neural Network (RNN) Unrolled/Unfolded Network Recurrent Neural Network 푦푡 푦푡+1 푦푡+2 푦푡 푉 푉 푉 푊 푊 푊 푐푡 푐푡+1 푐푡+2 푐푡 (푐푡−1) 푈 푈 푈 푥푡 푥푡+1 푥푡+2 푥푡 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 14 Why unrolled? o Imagine we care only for 3-grams o Are the two networks that different? ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 15 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 16 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 17 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 18 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 19 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 20 RNN equations o At time step 푡 One-hot vector 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푦푡 = softmax(푉푐푡) o Example ◦ Vocabulary of 500 words ◦ An input projection of 50 dimensions (U: [50 × 500]) ◦ A memory of 128 units (푐푡: 128 × 1 , W: [128 × 128]) ◦ An output projections of 500 dimensions (V: [500 × 128]) UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 21 Loss function o Cross entropy loss 1 푃 = 푦푙푡푘 ⇒ ℒ = − log 푃 = − 푙 log 푦 , 푡푘 푇 푡 푡 푡 푘 푡 ◦ non-target words 푙푡 = 0 ⇒ Compute loss only for target words ◦ Random loss =? ⇒ ℒ = log 퐾 o Usually loss is measured w.r.t. perplexity 퐽 = 2ℒ UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 22 Training an RNN o Backpropagation Through Time (BPTT) 푦1 푦2 푦3 푉 푉 푉 푊 푊 푊 푈 푈 푈 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 23 Backpropagation Through Time 휕ℒ 휕ℒ 휕ℒ o , , 휕푉 휕푊 휕푈 o To make it simpler let’s focus on step 3 휕ℒ 휕ℒ 휕ℒ 3 , 3 , 3 휕푉 휕푊 휕푈 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 푉 푉 푉 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푊 푊 푊 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푡 푡 푈 푈 푈 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 24 Backpropagation Through Time 휕ℒ3 휕ℒ3 휕푦3 = = 푦3 − 푙3 × 푐3 휕푉 휕푦3 휕푉 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 푉 푉 푉 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푊 푊 푊 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푡 푡 푈 푈 푈 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 25 Backpropagation Through Time 휕ℒ 휕ℒ 휕푦 휕푐 o 3 = 3 3 3 휕푊 휕푦3 휕푐3 휕푊 o What is the relation between 푐3 and 푊? ◦ Two-fold: 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 휕 푓(휑 푥 , 휓(푥)) 휕푓 휕휑 휕푓 휕휓 o = + 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 휕푥 휕휑 휕푥 휕휓 휕푥 휕푐 휕푐 휕푊 o 3 ∝ 푐 + 2 ( = 1) 푉 푉 푉 휕푊 2 휕푊 휕푊 푊 푊 푊 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푈 푈 푈 푡 푡 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 26 Recursively 휕푐3 휕푐2 o = 푐2 + 휕푊 휕푊 3 3 휕푐2 휕푐1 휕푐3 휕푐3 휕푐푡 휕ℒ3 휕ℒ3 휕푦3 휕푐3 휕푐푡 o = 푐1 + = ⇒ = 휕푊 휕푊 휕푊 휕푐푡 휕푊 휕푊 휕푦3 휕푐3 휕푐푡 휕푊 푡=1 푡=1 휕푐 휕푐 o 1 = 푐 + 0 휕푊 0 휕푊 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 푉 푉 푉 푊 푊 푊 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푈 푈 푈 푡 푡 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 27 Are RNNs perfect? o NO ◦ Although in theory yes! o

Load more