Lecture 9: Recurrent Neural Networks Deep Learning @ Uva

Lecture 9: Recurrent Neural Networks Deep Learning @ UvA UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 1 Previous Lecture o Word and Language Representations o From n-grams to Neural Networks o Word2vec o Skip-gram UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 2 Lecture Overview o Recurrent Neural Networks (RNN) for sequences o Backpropagation Through Time o RNNs using Long Short-Term Memory (LSTM) o Applications of Recurrent Neural Networks UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 3 Output Recurrent Neural Networks NN Cell Recurrent State connections Input UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 4 Sequential vs. static data o Abusing the terminology ◦ static data come in a mini-batch of size 퐷 ◦ dynamic data come in a mini-batch of size 1 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 5 Sequential vs. static data o Abusing the terminology ◦ static data come in a mini-batch of size 퐷 ◦ dynamic data come in a mini-batch of size 1 What about inputs that appear in sequences, such as text? Could a neural network handle such modalities? UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 6 Modelling sequences is … o ... roughly equivalent to predicting what is going to happen next Pr 푥 = Pr 푥푖 푥1, … , 푥푖−1) 푖 o Easy to generalize to sequences of arbitrary length o Considering small chunks 푥푖 fewer parameters, easier modelling o Often is convenient to pick a “frame” 푇 Pr 푥 = Pr 푥푖 푥푖−푇, … , 푥푖−1) 푖 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 7 Word representations o One-hot vector ◦ After one-hot vector add an embedding o Instead of one-hot vector use directly a word representation ◦ Word2vec ◦ GloVE Vocabulary One-hot vectors I I 1 I 0 I 0 I 0 am am 0 am 1 am 0 am 0 Bond Bond 0 Bond 0 Bond 1 Bond 0 James James 0 James 0 James 0 James 1 tired tired 0 tired 0 tired 0 tired 0 , , 0 , 0 , 0 , 0 McGuire McGuire 0 McGuire 0 McGuire 0 McGuire 0 ! ! 0 ! 0 ! 0 ! 0 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 8 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? I am Bond , James UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 9 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond I am Bond , James tired am ! UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 10 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond I am Bond , James tired am ! UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 11 What a sequence really is? o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond I am Bond , James Bond tired am ! UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 12 Modelling-wise, what is memory? o A representation (variable) of the past o How to adapt a simple Neural Network to include a memory variable? Input 푦푡 푦푡+1 푦푡+2 푦푡+3 Output parameters 푉 Memory 푉 푉 푉 parameters 푐푡 푊 푊 푊 푊 Memory 푐푡+1 푐푡+2 푐푡+3 푈 푈 푈 푈 Input parameters 푥푡 푥푡+1 푥푡+2 푥푡+3 Output UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 13 Recurrent Neural Network (RNN) Unrolled/Unfolded Network Recurrent Neural Network 푦푡 푦푡+1 푦푡+2 푦푡 푉 푉 푉 푊 푊 푊 푐푡 푐푡+1 푐푡+2 푐푡 (푐푡−1) 푈 푈 푈 푥푡 푥푡+1 푥푡+2 푥푡 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 14 Why unrolled? o Imagine we care only for 3-grams o Are the two networks that different? ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 15 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 16 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 17 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 18 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 19 Why unrolled? o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN) 푦1 푦2 푦3 Final output 푉 푉 푉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output 푊 푊 푊 1 2 3 1 2 3 푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 20 RNN equations o At time step 푡 One-hot vector 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푦푡 = softmax(푉푐푡) o Example ◦ Vocabulary of 500 words ◦ An input projection of 50 dimensions (U: [50 × 500]) ◦ A memory of 128 units (푐푡: 128 × 1 , W: [128 × 128]) ◦ An output projections of 500 dimensions (V: [500 × 128]) UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 21 Loss function o Cross entropy loss 1 푃 = 푦푙푡푘 ⇒ ℒ = − log 푃 = − 푙 log 푦 , 푡푘 푇 푡 푡 푡 푘 푡 ◦ non-target words 푙푡 = 0 ⇒ Compute loss only for target words ◦ Random loss =? ⇒ ℒ = log 퐾 o Usually loss is measured w.r.t. perplexity 퐽 = 2ℒ UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 22 Training an RNN o Backpropagation Through Time (BPTT) 푦1 푦2 푦3 푉 푉 푉 푊 푊 푊 푈 푈 푈 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 23 Backpropagation Through Time 휕ℒ 휕ℒ 휕ℒ o , , 휕푉 휕푊 휕푈 o To make it simpler let’s focus on step 3 휕ℒ 휕ℒ 휕ℒ 3 , 3 , 3 휕푉 휕푊 휕푈 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 푉 푉 푉 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푊 푊 푊 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푡 푡 푈 푈 푈 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 24 Backpropagation Through Time 휕ℒ3 휕ℒ3 휕푦3 = = 푦3 − 푙3 × 푐3 휕푉 휕푦3 휕푉 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 푉 푉 푉 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푊 푊 푊 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푡 푡 푈 푈 푈 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 25 Backpropagation Through Time 휕ℒ 휕ℒ 휕푦 휕푐 o 3 = 3 3 3 휕푊 휕푦3 휕푐3 휕푊 o What is the relation between 푐3 and 푊? ◦ Two-fold: 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 휕 푓(휑 푥 , 휓(푥)) 휕푓 휕휑 휕푓 휕휓 o = + 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 휕푥 휕휑 휕푥 휕휓 휕푥 휕푐 휕푐 휕푊 o 3 ∝ 푐 + 2 ( = 1) 푉 푉 푉 휕푊 2 휕푊 휕푊 푊 푊 푊 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푈 푈 푈 푡 푡 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 26 Recursively 휕푐3 휕푐2 o = 푐2 + 휕푊 휕푊 3 3 휕푐2 휕푐1 휕푐3 휕푐3 휕푐푡 휕ℒ3 휕ℒ3 휕푦3 휕푐3 휕푐푡 o = 푐1 + = ⇒ = 휕푊 휕푊 휕푊 휕푐푡 휕푊 휕푊 휕푦3 휕푐3 휕푐푡 휕푊 푡=1 푡=1 휕푐 휕푐 o 1 = 푐 + 0 휕푊 0 휕푊 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 푉 푉 푉 푊 푊 푊 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푦푡 = softmax(푉푐푡) ℒ = − 푙푡 log 푦푡 = ℒ푡 푈 푈 푈 푡 푡 푥1 푥2 푥3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 27 Are RNNs perfect? o NO ◦ Although in theory yes! o

Lecture 9: Recurrent Neural Networks Deep Learning @ Uva

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support