<<

Lecture 9: Recurrent Neural Networks @ UvA

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 1 Previous Lecture

o Word and Language Representations o From n-grams to Neural Networks o o Skip-gram

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 2 Lecture Overview

o Recurrent Neural Networks (RNN) for sequences o Through Time o RNNs using Long Short-Term Memory (LSTM) o Applications of Recurrent Neural Networks

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 3 Output

Recurrent Neural Networks NN Cell Recurrent State connections

Input UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 4 Sequential vs. static data

o Abusing the terminology ◦ static data come in a mini-batch of size 퐷 ◦ dynamic data come in a mini-batch of size 1

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 5 Sequential vs. static data

o Abusing the terminology ◦ static data come in a mini-batch of size 퐷 ◦ dynamic data come in a mini-batch of size 1

What about inputs that appear in sequences, such as text? Could a neural network handle such modalities?

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 6 Modelling sequences is …

o ... roughly equivalent to predicting what is going to happen next

Pr 푥 = Pr 푥푖 푥1, … , 푥푖−1) 푖 o Easy to generalize to sequences of arbitrary length

o Considering small chunks 푥푖  fewer parameters, easier modelling o Often is convenient to pick a “frame” 푇

Pr 푥 = Pr 푥푖 푥푖−푇, … , 푥푖−1) 푖

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 7 Word representations

o One-hot vector ◦ After one-hot vector add an embedding o Instead of one-hot vector use directly a word representation ◦ Word2vec ◦ GloVE Vocabulary One-hot vectors I I 1 I 0 I 0 I 0 am am 0 am 1 am 0 am 0 Bond Bond 0 Bond 0 Bond 1 Bond 0 James James 0 James 0 James 0 James 1 tired tired 0 tired 0 tired 0 tired 0 , , 0 , 0 , 0 , 0 McGuire McGuire 0 McGuire 0 McGuire 0 McGuire 0 ! ! 0 ! 0 ! 0 ! 0

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 8 What a sequence really is?

o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ?

I am Bond , James

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 9 What a sequence really is?

o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond

I am Bond , James tired am !

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 10 What a sequence really is?

o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond

I am Bond , James tired am !

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 11 What a sequence really is?

o Data inside a sequence are non i.i.d. ◦ Identically, independently distributed o The next “word” depends on the previous “words” ◦ Ideally on all of them o We need context, and we need memory! o How to model context and memory ? McGuire Bond

I am Bond , James Bond tired am !

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 12 Modelling-wise, what is memory?

o A representation (variable) of the past o How to adapt a simple Neural Network to include a memory variable? Input 푦푡 푦푡+1 푦푡+2 푦푡+3 Output parameters 푉 Memory 푉 푉 푉 parameters 푐푡 푊 푊 푊 푊 Memory 푐푡+1 푐푡+2 푐푡+3

푈 푈 푈 푈 Input parameters

푥푡 푥푡+1 푥푡+2 푥푡+3 Output

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 13 (RNN)

Unrolled/Unfolded Network Recurrent Neural Network

푦푡 푦푡+1 푦푡+2 푦푡

푉 푉 푉 푊

푊 푊 푐푡 푐푡+1 푐푡+2 푐푡

(푐푡−1) 푈 푈 푈

푥푡 푥푡+1 푥푡+2 푥푡

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 14 Why unrolled?

o Imagine we care only for 3-grams o Are the two networks that different? ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN)

푦1 푦2 푦3 Final output 푉 푉 푉

/Step” 1 “Layer/Step” 2 “Layer/Step” 3

퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output

푊 푊 푊 1 2 3

1 2 3

푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 15 Why unrolled?

o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN)

푦1 푦2 푦3 Final output 푉 푉 푉

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output

푊 푊 푊 1 2 3

1 2 3

푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 16 Why unrolled?

o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN)

푦3 Final output 푉 푉 푉

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output

푊 푊 푊 1 2 3

1 2 3

푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 17 Why unrolled?

o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN)

푦1 푦2 푦3 Final output 푉 푉 푉

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output

푊 푊 푊 1 2 3

1 2 3

푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 18 Why unrolled?

o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN)

푦1 푦2 푦3 Final output 푉 푉 푉

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output

푊 푊 푊 1 2 3

1 2 3

푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 19 Why unrolled?

o Imagine we care only for 3-grams • Sometimes intermediate outputs are not even needed o Are the two networks that different? • Removing them, we almost end up to a standard Neural Network ◦ Steps instead of layers ◦ Step parameters are same (shared parameters in a NN)

푦1 푦2 푦3 Final output 푉 푉 푉

“Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3

퐿푎푦푒푟 퐿푎푦푒푟 퐿푎푦푒푟 푊 푊 푊 Final output

푊 푊 푊 1 2 3

1 2 3

푈 푈 푈 3-gram Unrolled Recurrent Network 3-layer Neural Network

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 20 RNN equations

o At time step 푡 One-hot vector

푐푡 = tanh(푈 푥푡 + 푊푐푡−1)

푦푡 = softmax(푉푐푡) o Example ◦ Vocabulary of 500 words ◦ An input projection of 50 dimensions (U: [50 × 500]) ◦ A memory of 128 units (푐푡: 128 × 1 , W: [128 × 128]) ◦ An output projections of 500 dimensions (V: [500 × 128])

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 21

o Cross entropy loss 1 푃 = 푦푙푡푘 ⇒ ℒ = − log 푃 = − 푙 log 푦 , 푡푘 푇 푡 푡 푡 푘 푡 ◦ non-target words 푙푡 = 0 ⇒ Compute loss only for target words ◦ Random loss =? ⇒ ℒ = log 퐾

o Usually loss is measured w.r.t. perplexity

퐽 = 2ℒ

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 22 Training an RNN

o Backpropagation Through Time (BPTT)

푦1 푦2 푦3

푉 푉 푉

푊 푊 푊

푈 푈 푈

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 23 Backpropagation Through Time

휕ℒ 휕ℒ 휕ℒ o , , 휕푉 휕푊 휕푈 o To make it simpler let’s focus on step 3 휕ℒ 휕ℒ 휕ℒ 3 , 3 , 3 휕푉 휕푊 휕푈 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3

푉 푉 푉

푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푊 푊 푊 푦푡 = softmax(푉푐푡)

ℒ = − 푙푡 log 푦푡 = ℒ푡 푡 푡 푈 푈 푈

푥1 푥2 푥3

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 24 Backpropagation Through Time

휕ℒ3 휕ℒ3 휕푦3 = = 푦3 − 푙3 × 푐3 휕푉 휕푦3 휕푉

푦1, ℒ1 푦2, ℒ2 푦3, ℒ3

푉 푉 푉

푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 푊 푊 푊 푦푡 = softmax(푉푐푡)

ℒ = − 푙푡 log 푦푡 = ℒ푡 푡 푡 푈 푈 푈

푥1 푥2 푥3

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 25 Backpropagation Through Time

휕ℒ 휕ℒ 휕푦 휕푐 o 3 = 3 3 3 휕푊 휕푦3 휕푐3 휕푊

o What is the relation between 푐3 and 푊? ◦ Two-fold: 푐푡 = tanh(푈 푥푡 + 푊푐푡−1) 휕 푓(휑 푥 , 휓(푥)) 휕푓 휕휑 휕푓 휕휓 o = + 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3 휕푥 휕휑 휕푥 휕휓 휕푥 휕푐 휕푐 휕푊 o 3 ∝ 푐 + 2 ( = 1) 푉 푉 푉 휕푊 2 휕푊 휕푊 푊 푊 푊 푐푡 = tanh(푈 푥푡 + 푊푐푡−1)

푦푡 = softmax(푉푐푡)

ℒ = − 푙푡 log 푦푡 = ℒ푡 푈 푈 푈 푡 푡 푥1 푥2 푥3

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 26 Recursively

휕푐3 휕푐2 o = 푐2 + 휕푊 휕푊 3 3 휕푐2 휕푐1 휕푐3 휕푐3 휕푐푡 휕ℒ3 휕ℒ3 휕푦3 휕푐3 휕푐푡 o = 푐1 + = ⇒ = 휕푊 휕푊 휕푊 휕푐푡 휕푊 휕푊 휕푦3 휕푐3 휕푐푡 휕푊 푡=1 푡=1 휕푐 휕푐 o 1 = 푐 + 0 휕푊 0 휕푊 푦1, ℒ1 푦2, ℒ2 푦3, ℒ3

푉 푉 푉

푊 푊 푊 푐푡 = tanh(푈 푥푡 + 푊푐푡−1)

푦푡 = softmax(푉푐푡)

ℒ = − 푙푡 log 푦푡 = ℒ푡 푈 푈 푈 푡 푡 푥1 푥2 푥3

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 27 Are RNNs perfect?

o NO ◦ Although in theory yes! o Vanishing gradient ◦ After a few time steps the gradients become almost 0 o Exploding gradient ◦ After a few time steps the gradients become huge o Can’t really capture long-term dependencies

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 28 Exploding gradients

휕ℒr r 휕ℒ푟 휕푦푟 휕푐푟 휕푐푡 o = 푡=1 휕푊 휕푦푟 휕푐푟 휕푐푡 휕푊 휕푐 o 푟 can be decomposed further based on the chain rule 휕푐푡 휕푐 휕푐 휕푐 휕푐 o 푟 = 푟 ⋅ 푟−1 ⋅ … ⋅ 푡+1 휕푐푡 휕푐푟−1 휕푐푟−2 휕푐푡

푅푒푠푡 → long-term factors 휏 ≫ 푟 → short-term factors 휕푐 o When many long-term factors, for many of which 푖 ≫ 1 휕푐푖−1 휕푐 ◦ then 푟 ≫ 1 휕푐푡 휕ℒ Gradient explodes ◦ then r ≫ 1 휕푊

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 29 Remedy for exploding gradients

o Scale the gradients to a threshold g 휕ℒ o Step 1. g ← 휕휃 휃 0 푔 푔 o Step 2. Is 푔 > 휃0? 휃 ◦ Step 3a. If yes g ← 0 푔 푔 ◦ Step 3b. If no, then do nothing 휃0 o Simple, but works!

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 30 Vanishing gradients

휕ℒr r 휕ℒ푟 휕푦푟 휕푐푟 휕푐푡 o = 푡=1 휕푊 휕푦푟 휕푐푟 휕푐푡 휕푊 휕푐 o 푟 can be decomposed further based on the chain rule 휕푐푡 휕푐 휕푐 휕푐 휕푐 o 푟 = 푟 ⋅ 푟−1 ⋅ … ⋅ 푡+1 휕푐푡 휕푐푟−1 휕푐푟−2 휕푐푡

푅푒푠푡 → long-term factors 휏 ≫ 푟 → short-term factors 휕푐 o When many 푖 → 0 (e.g. with sigmoids), 휕푐푖−1 휕푐 ◦ then 푟 → 0 휕푐푡 휕ℒ Gradient vanishes ◦ then r → 0 휕푊

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 31 푐 푐 푡−1 + 푡 Advanced RNNs tanh

푓푡 푖푡 표푡

푐 푡 휎 휎 tanh 휎

푚푡−1 푚푡 Output

푥 Input 푡 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 32 A more realistic memory unit needs what?

New memory and NN state

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 33 A more realistic memory unit needs what?

Previous Input New memory and NN state memory

Previous NN state

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 34 Long Short-Termn Memory (LSTM: Beefed up RNN)

(푖) (푖) 푐푡−1 푐푡 푖 = 휎 푥푡푈 + 푚푡−1푊 + (푓) (푓) 푓 = 휎 푥푡푈 + 푚푡−1푊 tanh (표) (표) 표 = 휎 푥푡푈 + 푚푡−1푊 푓푡 푖푡 표푡 푔 (푔) 푐 푡 = tanh(푥푡푈 + 푚푡−1푊 ) 푐 푡 푐푡 = 푐푡−1 ⊙ 푓 + 푐 푡 ⊙ 푖 휎 휎 tanh 휎

푚푡 = tanh 푐푡 ⊙ 표 푚푡−1 푚푡 Output

푥 Input 푡

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 35 LSTM Step-by-Step: Step (1)

o E.g. Model the sentence “Yesterday she slapped me. Today she loves me.” o Decide what to forget and what to remember for the new memory ◦ Sigmoid 1  Remember everything ◦ Sigmoid 0  Forget everything 푐 푐 푡−1 + 푡 tanh (푖) (푖) 푖푡 = 휎 푥푡푈 + 푚푡−1푊 푓푡 푖푡 표푡 (푓) (푓) 푓푡 = 휎 푥푡푈 + 푚푡−1푊 푐 푡 (표) (표) 표푡 = 휎 푥푡푈 + 푚푡−1푊 휎 휎 tanh 휎 푔 (푔) 푐 푡 = tanh(푥푡푈 + 푚푡−1푊 ) 푚푡−1 푚푡 푐푡 = 푐푡−1 ⊙ 푓 + 푐 푡 ⊙ 푖 푥푡 푚푡 = tanh 푐푡 ⊙ 표

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 36 LSTM Step-by-Step: Step (2)

o Decide what new information should you add to the new memory ◦ Modulate the input 푖푡 ◦ Generate candidate memories 푐 푡

푐 푐 푡−1 + 푡 tanh (푖) (푖) 푖푡 = 휎 푥푡푈 + 푚푡−1푊 푓푡 푖푡 표푡 (푓) (푓) 푓푡 = 휎 푥푡푈 + 푚푡−1푊 푐 푡 (표) (표) 표푡 = 휎 푥푡푈 + 푚푡−1푊 휎 휎 tanh 휎 푔 (푔) 푐 푡 = tanh(푥푡푈 + 푚푡−1푊 ) 푚푡−1 푚푡 푐푡 = 푐푡−1 ⊙ 푓 + 푐 푡 ⊙ 푖 푥푡 푚푡 = tanh 푐푡 ⊙ 표

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 37 LSTM Step-by-Step: Step (3)

o Compute and update the current cell state 푐푡 ◦ Depends on the previous cell state ◦ What we decided to forget ◦ What inputs we allowed 푐 푐 ◦ The candidate memories 푡−1 + 푡 tanh (푖) (푖) 푖푡 = 휎 푥푡푈 + 푚푡−1푊 푓푡 푖푡 표푡 (푓) (푓) 푓푡 = 휎 푥푡푈 + 푚푡−1푊 푐 푡 (표) (표) 표푡 = 휎 푥푡푈 + 푚푡−1푊 휎 휎 tanh 휎 푔 (푔) 푐 푡 = tanh(푥푡푈 + 푚푡−1푊 ) 푚푡−1 푚푡 푐푡 = 푐푡−1 ⊙ 푓 + 푐 푡 ⊙ 푖 푥푡 푚푡 = tanh 푐푡 ⊙ 표

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 38 LSTM Step-by-Step: Step (4)

o Modulate the output ◦ Does the cell state contain something relevant?  Sigmoid 1 o Generate the new memory 푐 푐 푡−1 + 푡 tanh (푖) (푖) 푖푡 = 휎 푥푡푈 + 푚푡−1푊 푓푡 푖푡 표푡 (푓) (푓) 푓푡 = 휎 푥푡푈 + 푚푡−1푊 푐 (표) (표) 푡 표푡 = 휎 푥푡푈 + 푚푡−1푊 휎 휎 tanh 휎 푔 (푔) 푐 푡 = tanh(푥푡푈 + 푚푡−1푊 ) 푚푡−1 푚푡 푐푡 = 푐푡−1 ⊙ 푓 + 푐 푡 ⊙ 푖

푥푡 푚푡 = tanh 푐푡 ⊙ 표

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 39 LSTM Unrolled Network

o Macroscopically very similar to standard RNNs o The engine is a bit different (more complicated) ◦ Because of their gates LSTMs capture long and short term dependencies

× + × + × + tanh tanh tanh

× × × × × ×

휎 휎 tanh 휎 휎 휎 tanh 휎 휎 휎 tanh 휎

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 40 Even more beef to the RNNs

o LSTM with peephole connections ◦ Gates have access also to the previous cell states 푐푡−1 (not only memories) ◦ Coupled forget and input gates, 푐푡 = 푓푡 ⊙ 푐푡−1 + 1 − 푓푡 ⊙ 푐 푡 ◦ Bi-directional recurrent networks o GRU ◦ A bit simpler than LSTM ◦ Performance similar to LSTM o Deep LSTM LSTM (2) LSTM (2) LSTM (2)

LSTM (1) LSTM (1) LSTM (1)

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 41 Click to go to the video in Youtube Click to go to the website

Applications of Recurrent Networks

UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 42 Text generation

o Generate text like Nietzsche, Shakespeare or Wikipedia o Generate Linux kernel-like C++ code o Or even generate a new website

Pr 푥 = 푖 Pr 푥푖 푥푝푎푠푡; 휃), where 휃 are the LSTM parameters

Training Shakespeare To be or … Prediction Two are or … O Romeo

LSTM (1) LSTM (1) LSTM (1) LSTM (1) LSTM (1)

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 43 Handwriting generation

o Problem 1: Model real coordinates instead of one-hot vectors o Recurrent Mixture Density Networks o Have outputs follow a Gaussian Distribution ◦ Output needs to be suitably squashed o We don’t just fit a Gaussian to the data ◦ We also condition on the previous outputs

Pr 표푡 = 푤푖 푥1:푇 푁(표푡|휇푖 푥1:푇 , Σ푖(푥1:푇)) 푖

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 44

o The phrase in the source language is one sequence ◦ “Today the weather is very good” o The phrase in the target language is also a sequence ◦ “Vandaag de weer is heel goed” o Problems ◦ no perfect word alignment, sentence length might differ o Solution ◦ Encoder-decoder scheme Vandaag de weer is heel goed

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good Vandaag de weer is heel goed Encoder Decoder

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 45 Better Machine Translation

o It might even pay off reversing the source sentence ◦ The first target words will be closer to their respective source words o The encoder and decoder parts can be modelled with different LSTMs o Deep LSTM

Vandaag de weer is heel goed

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM good is weather the Today Vandaag de weer is heel goed

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 46 Image captioning

o An image is a thousand words, literally! o Pretty much the same as machine transation o Replace the encoder part with the output of a Convnet ◦ E.g. use Alexnet or a VGG16 network o Keep the decoder part to operate as a translator

Today the weather is good

Convnet LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 47 Question answering

Q: John entered the living room, where o Bleeding-edge research, no real consensus he met Mary. She was drinking some ◦ Very interesting open, research problems wine and watching a movie. What room did John enter? o Again, pretty much like machine translation A: John entered the living room. o Encoder-Decoder scheme ◦ Insert the question to the encoder part ◦ Model the answer at the decoder part o You can also have question answering with images ◦ Again, bleeding-edge research ◦ How/where to add the image? ◦ What has been working so far is to add the image only in the beginning Q: what are the people playing? A: They play beach football

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 48 o Recurrent Neural Networks (RNN) for sequences o Backpropagation Through Time Summary o RNNs using Long Short-Term Memory (LSTM) o Applications of Recurrent Neural Networks

UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 49 o Memory networks o Recursive networks Next lecture

UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES RECURRENT NEURAL NETWORKS - 50