Lecture 10: Recurrent Neural Networks
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 1 May 7, 2020 Administrative: Midterm
- Midterm next Tue 5/12 take home - 1H 20 Mins (+20m buffer) within a 24 hour time period. - Will be released on Gradescope. - See Piazza for detailed information
- Midterm review session: Fri 5/8 discussion section
- Midterm covers material up to this lecture (Lecture 10) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 2 May 7, 2020 Administrative
- Project proposal feedback has been released
- Project milestone due Mon 5/18, see Piazza for requirements ** Need to have some baseline / initial results by then, so start implementing soon if you haven’t yet!
- A3 will be released Wed 5/13, due Wed 5/27
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 3 May 7, 2020 Last Time: CNN Architectures
GoogLeNet AlexNet
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 4 May 7, 2020 Last Time: CNN Architectures
ResNet
SENet
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 5 May 7, 2020 Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 6 May 7, 2020 Efficient networks... MobileNets: Efficient Convolutional Neural Networks for Mobile Applications [Howard et al. 2017] - Depthwise separable convolutions replace BatchNorm standard convolutions by Pool BatchNorm factorizing them into a C2HW Pointwise Conv (1x1, C->C) convolutions depthwise convolution and a Pool 1x1 convolution BatchNorm 9C2HW Conv (3x3, C->C) - Much more efficient, with Pool little loss in accuracy Stanford network Conv (3x3, C->C, Depthwise 2 9CHW - Follow-up MobileNetV2 work Total compute:9C HW groups=C) convolutions in 2018 (Sandler et al.) MobileNets - ShuffleNet: Zhang et al, Total compute:9CHW + C2HW CVPR 2018 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 Meta-learning: Learning to learn network architectures... Neural Architecture Search with Reinforcement Learning (NAS) [Zoph et al. 2016] - “Controller” network that learns to design a good network architecture (output a string corresponding to network design) - Iterate: 1) Sample an architecture from search space 2) Train the architecture to get a “reward” R corresponding to accuracy 3) Compute gradient of sample probability, and scale by R to perform controller parameter update (i.e. increase likelihood of good architecture being sampled, decrease likelihood of bad architecture)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 8 May 7, 2020 Meta-learning: Learning to learn network architectures... Learning Transferable Architectures for Scalable Image Recognition [Zoph et al. 2017] - Applying neural architecture search (NAS) to a large dataset like ImageNet is expensive - Design a search space of building blocks (“cells”) that can be flexibly stacked - NASNet: Use NAS to find best cell structure on smaller CIFAR-10 dataset, then transfer architecture to ImageNet - Many follow-up works in this space e.g. AmoebaNet (Real et al. 2019) and ENAS (Pham, Guan et al. 2018)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 9 May 7, 2020 Today: Recurrent Neural Networks
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 10 May 7, 2020 “Vanilla” Neural Network
Vanilla Neural Networks
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 11 May 7, 2020 Recurrent Neural Networks: Process Sequences
e.g. Image Captioning image -> sequence of words
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 12 May 7, 2020 Recurrent Neural Networks: Process Sequences
e.g. action prediction sequence of video frames -> action class
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 13 May 7, 2020 Recurrent Neural Networks: Process Sequences
E.g. Video Captioning Sequence of video frames -> caption Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 14 May 7, 2020 Recurrent Neural Networks: Process Sequences
e.g. Video classification on frame level
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 15 May 7, 2020 Sequential Processing of Non-Sequence Data
Classify images by taking a series of “glimpses”
Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015. Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 16 May 7, 2020 Sequential Processing of Non-Sequence Data Generate images one piece at a time!
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 17 May 7, 2020 Recurrent Neural Network
y
RNN
x
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 18 May 7, 2020 Recurrent Neural Network
y Key idea: RNNs have an “internal state” that is updated as a sequence is RNN processed
x
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 19 May 7, 2020 Recurrent Neural Network
y y y 1 2 3 yt
RNN RNN RNN ... RNN
x x x 1 2 3 xt
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 20 May 7, 2020 Recurrent Neural Network
We can process a sequence of vectors x by applying a recurrence formula at every time step: y
RNN new state old state input vector at some time step some function x with parameters W
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 21 May 7, 2020 Recurrent Neural Network
y y y 1 2 3 yt
h0 h h h RNN 1 RNN 2 RNN 3 ... RNN
x x x 1 2 3 xt
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 22 May 7, 2020 Recurrent Neural Network
We can process a sequence of vectors x by applying a recurrence formula at every time step: y
RNN
Notice: the same function and the same set x of parameters are used at every time step.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 23 May 7, 2020 (Simple) Recurrent Neural Network The state consists of a single “hidden” vector h:
y
RNN
x
Sometimes called a “Vanilla RNN” or an “Elman RNN” after Prof. Jeffrey Elman
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 24 May 7, 2020 RNN: Computational Graph
h h 0 fW 1
x1
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 25 May 7, 2020 RNN: Computational Graph
h h h 0 fW 1 fW 2
x1 x2
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 26 May 7, 2020 RNN: Computational Graph
h f h f h f h h 0 W 1 W 2 W 3 … T
x1 x2 x3
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 27 May 7, 2020 RNN: Computational Graph
Re-use the same weight matrix at every time-step
h f h f h f h h 0 W 1 W 2 W 3 … T
x x x W 1 2 3
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 28 May 7, 2020 RNN: Computational Graph: Many to Many
y y y y1 2 3 T
h f h f h f h h 0 W 1 W 2 W 3 … T
x x x W 1 2 3
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 29 May 7, 2020 RNN: Computational Graph: Many to Many
y L y L y L y1 L1 2 2 3 3 T T
h f h f h f h h 0 W 1 W 2 W 3 … T
x x x W 1 2 3
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 30 May 7, 2020 L RNN: Computational Graph: Many to Many
y L y L y L y1 L1 2 2 3 3 T T
h f h f h f h h 0 W 1 W 2 W 3 … T
x x x W 1 2 3
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 31 May 7, 2020 RNN: Computational Graph: Many to One
y
h f h f h f h h 0 W 1 W 2 W 3 … T
x x x W 1 2 3
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 32 May 7, 2020 RNN: Computational Graph: Many to One
y
h f h f h f h h 0 W 1 W 2 W 3 … T
x x x W 1 2 3
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 33 May 7, 2020 RNN: Computational Graph: One to Many
y y y y1 2 3 T
h f h f h f h h 0 W 1 W 2 W 3 … T
x W
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 34 May 7, 2020 RNN: Computational Graph: One to Many
y y y y1 2 3 T
h f h f h f h h 0 W 1 W 2 W 3 … T
x ? ? W ?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 35 May 7, 2020 RNN: Computational Graph: One to Many
y y y y1 2 3 T
h f h f h f h h 0 W 1 W 2 W 3 … T
x 0 0 W 0
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 36 May 7, 2020 RNN: Computational Graph: One to Many
y y y y1 2 3 T
h f h f h f h h 0 W 1 W 2 W 3 … T
x y y yT-1 W 1 2
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 37 May 7, 2020 Sequence to Sequence: Many-to-one + one-to-many
Many to one: Encode input sequence in a single vector
h h h h … h 0 fW 1 fW 2 fW 3 T
x1 x2 x3 W1
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 38 May 7, 2020 Sequence to Sequence: Many-to-one + one-to-many One to many: Produce output sequence from single input vector Many to one: Encode input sequence in a single vector
y1 y2
h h h h … h 0 f 1 f 2 f 3 T h h … W W W fW 1 fW 2 fW
x1 x2 x3 W 1 W2
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 39 May 7, 2020 Example: Character-level Language Model
Vocabulary: [h,e,l,o]
Example training sequence: “hello”
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 40 May 7, 2020 Example: Character-level Language Model
Vocabulary: [h,e,l,o]
Example training sequence: “hello”
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 41 May 7, 2020 Example: Character-level Language Model
Vocabulary: [h,e,l,o]
Example training sequence: “hello”
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 42 May 7, 2020 “e” “l” “l” “o” Example: Sample
.03 .25 .11 .11 Character-level .84 .20 .17 .02 Softmax .00 .05 .68 .08 Language Model .13 .50 .03 .79 Sampling
Vocabulary: [h,e,l,o]
At test-time sample characters one at a time, feed back to model
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 43 May 7, 2020 “e” “l” “l” “o” Example: Sample
.03 .25 .11 .11 Character-level .84 .20 .17 .02 Softmax .00 .05 .68 .08 Language Model .13 .50 .03 .79 Sampling
Vocabulary: [h,e,l,o]
At test-time sample characters one at a time, feed back to model
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 44 May 7, 2020 “e” “l” “l” “o” Example: Sample
.03 .25 .11 .11 Character-level .84 .20 .17 .02 Softmax .00 .50 .68 .08 Language Model .13 .05 .03 .79 Sampling
Vocabulary: [h,e,l,o]
At test-time sample characters one at a time, feed back to model
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 45 May 7, 2020 “e” “l” “l” “o” Example: Sample
.03 .25 .11 .11 Character-level .84 .20 .17 .02 Softmax .00 .50 .68 .08 Language Model .13 .05 .03 .79 Sampling
Vocabulary: [h,e,l,o]
At test-time sample characters one at a time, feed back to model
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 46 May 7, 2020 Forward through entire sequence to compute loss, then backward through Backpropagation through time entire sequence to compute gradient
Loss
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 47 May 7, 2020 Truncated Backpropagation through time Loss
Run forward and backward through chunks of the sequence instead of whole sequence
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 48 May 7, 2020 Truncated Backpropagation through time Loss
Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 49 May 7, 2020 Truncated Backpropagation through time Loss
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 50 May 7, 2020 min-char-rnn.py gist: 112 lines of Python
(https://gist.github.com/karpathy/d4dee 566867f8291f086)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 51 May 7, 2020 y
RNN
x
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 52 May 7, 2020 at first: train more
train more
train more
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 53 May 7, 2020 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 54 May 7, 2020 The Stacks Project: open source algebraic geometry textbook
Latex source http://stacks.math.columbia.edu/ The stacks project is licensed under the GNU Free Documentation License
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 55 May 7, 2020 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 56 May 7, 2020 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 57 May 7, 2020 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 58 May 7, 2020 Generated C code
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 59 May 7, 2020 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 60 May 7, 2020 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 61 May 7, 2020 OpenAI GPT-2 generated text source
Input: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
Output: The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 62 May 7, 2020 Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 63 May 7, 2020 Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 64 May 7, 2020 Searching for interpretable cells
quote detection cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 65 May 7, 2020 Searching for interpretable cells
line length tracking cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 66 May 7, 2020 Searching for interpretable cells
if statement cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 67 May 7, 2020 Searching for interpretable cells
quote/comment cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 68 May 7, 2020 Searching for interpretable cells
code depth cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 69 May 7, 2020 RNN tradeoffs
RNN Advantages: - Can process any length input - Computation for step t can (in theory) use information from many steps back - Model size doesn’t increase for longer input - Same weights applied on every timestep, so there is symmetry in how inputs are processed. RNN Disadvantages: - Recurrent computation is slow - In practice, difficult to access information from many steps back
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 70 May 7, 2020 Image Captioning
Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015; figure copyright IEEE, 2015. Reproduced for educational purposes.
Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 71 May 7, 2020 Recurrent Neural Network
Convolutional Neural Network
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 72 May 7, 2020 test image
This image is CC0 public domain
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
Fei-FeiX Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
x0
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
y0 before:
h = tanh(Wxh * x + Whh * h) h0 Wih now:
h = tanh(Wxh * x + Whh * h + Wih * v) x0
y0
sample! h0
x0 straw
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
y0 y1
h0 h1
x0 straw
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
y0 y1
sample! h0 h1
x0 straw hat
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
y0 y1 y2
h0 h1 h2
x0 straw hat
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 test image
y0 y1 y2 sample
h0 h1 h2 => finish.
x0 straw hat
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - May 7, 2020 Captions generated using neuraltalk2 All images are CC0 Public domain: cat suitcase, cat tree, dog, bear, Image Captioning: Example Results surfers, tennis, giraffe, motorcycle
A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in suitcase on the floor branch grass with a frisbee the grass
Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on the beach with surfboards on the court grassy field a dirt track
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 83 May 7, 2020 Captions generated using neuraltalk2 All images are CC0 Public domain: fur Image Captioning: Failure Cases coat, handstand, spider web, baseball
A bird is perched on a tree branch
A woman is holding a cat in her hand
A man in a baseball uniform throwing a ball
A woman standing on a beach holding a surfboard A person holding a computer mouse on a desk
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 84 May 7, 2020 Image Captioning with Attention
RNN focuses its attention at a different spatial location when generating each word
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 85 May 7, 2020 Image Captioning with Attention
h CNN 0
Features: Image: L x D H x W x 3 Where L = W x H
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 86 May 7, 2020 Image Captioning with Attention
Distribution over L locations
a1
v
h CNN 0
Features: Image: L x D H x W x 3
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 87 May 7, 2020 Image Captioning with Attention
Distribution over L locations
a1
v
h CNN 0
Features: Image: L x D Weighted H x W x 3 z features: D 1 Weighted Xu et al, “Show, Attend and Tell: Neural combination Image Caption Generation with Visual Attention”, ICML 2015 of features
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 88 May 7, 2020 Image Captioning with Attention
Distribution over L locations
a1
h h CNN 0 1
Features: Image: L x D Weighted H x W x 3 z y features: D 1 1 Weighted Xu et al, “Show, Attend and Tell: Neural combination First word Image Caption Generation with Visual Attention”, ICML 2015 of features
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 89 May 7, 2020 Image Captioning with Attention
Distribution over Distribution L locations over vocab
a 1 a2 d1
v
h h CNN 0 1
Features: Image: L x D Weighted H x W x 3 z y features: D 1 1 Weighted Xu et al, “Show, Attend and Tell: Neural combination First word Image Caption Generation with Visual Attention”, ICML 2015 of features
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 90 May 7, 2020 Image Captioning with Attention
Distribution over Distribution L locations over vocab
a 1 a2 d1
v
h h h CNN 0 1 2
Features: Image: L x D Weighted H x W x 3 z y z y features: D 1 1 2 2 Weighted Xu et al, “Show, Attend and Tell: Neural combination First word Image Caption Generation with Visual Attention”, ICML 2015 of features
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 91 May 7, 2020 Image Captioning with Attention
Distribution over Distribution L locations over vocab
a 1 a2 d1 a3 d2
v
h h h CNN 0 1 2
Features: Image: L x D Weighted H x W x 3 z y z y features: D 1 1 2 2 Weighted Xu et al, “Show, Attend and Tell: Neural combination First word Image Caption Generation with Visual Attention”, ICML 2015 of features
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 92 May 7, 2020 Image Captioning with Attention
Soft attention
Hard attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 93 May 7, 2020 Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 94 May 7, 2020 Visual Question Answering (VQA)
Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015 Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016 Figure from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 95 May 7, 2020 Visual Question Answering: RNNs with Attention
Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016 Figures from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 96 May 7, 2020 Visual Dialog: Conversations about images
Das et al, “Visual Dialog”, CVPR 2017 Figures from Das et al, copyright IEEE 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 97 May 7, 2020 Visual Language Navigation: Go to the living room
Agent encodes instructions in language and uses an RNN to generate a series of movements as the visual input changes after each move.
Wang et al, “Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation”, CVPR 2018 Figures from Wang et al, copyright IEEE 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 98 May 7, 2020 All images are CC0 Public domain: Image Captioning: Gender Bias dog,
Burns et al. “Women also Snowboard: Overcoming Bias in Captioning Models” ECCV 2018 Figures from Burns et al, copyright 2018. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 99 May 7, 2020 All images are CC0 Public domain: Visual Question Answering: Dataset Bias dog,
Image
Model Yes or No What is the dog Question playing with?
Frisbee Answer
Jabri et al. “Revisiting Visual Question Answering Baselines” ECCV 2016 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 100 May 7, 2020 Multilayer RNNs
depth
time
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 101 May 7, 2020 Long Short Term Memory (LSTM)
Vanilla RNN LSTM
Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 102 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
yt
tanh W
stack ht-1 ht
xt
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 103 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
Backpropagation from ht to h multiplies by W t-1 y T t (actually Whh )
tanh W
stack ht-1 ht
xt
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 104 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
Backpropagation from ht to h multiplies by W t-1 y T t (actually Whh )
tanh W
stack ht-1 ht
xt
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 105 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 106 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 107 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 108 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 109 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
Almost always < 1 Vanishing gradients
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 110 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4
What if we assumed no non-linearity?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 111 May 7, 2020
Almost always < 1 Vanishing gradients Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4 What if we assumed no non-linearity? Largest singular value > 1: Exploding gradients
Largest singular value < 1: Vanishing gradients
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 112 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4 What if we assumed no non-linearity?
Largest singular value > 1: Gradient clipping: Exploding gradients Scale gradient if its norm is too big Largest singular value < 1: Vanishing gradients
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 113 May 7, 2020 Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Vanilla RNN Gradient Flow Pascanu et al, “On the difficulty of training recurrent neural networks”, Gradients over multiple time steps: ICML 2013
y1 y2 y3 y4
h0 h1 h2 h3 h4
x1 x2 x3 x4 What if we assumed no non-linearity? Largest singular value > 1: Exploding gradients
Largest singular value < 1: Change RNN Vanishing gradients architecture
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 114 May 7, 2020 Long Short Term Memory (LSTM)
Vanilla RNN LSTM
Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 115 May 7, 2020 Long Short Term Memory (LSTM) [Hochreiter et al., 1997] i: Input gate, whether to write to cell f: Forget gate, Whether to erase cell o: Output gate, How much to reveal cell vector from g: Gate gate (?), How much to write to cell below (x)
x sigmoid i
h sigmoid f W vector from sigmoid o before (h) tanh g 4h x 2h 4h 4*h
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 116 May 7, 2020 Long Short Term Memory (LSTM) [Hochreiter et al., 1997]
☉ + c ct-1 t
f
i ☉ tanh W g stack ht-1 ☉ o ht
xt
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 117 May 7, 2020 Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]
Backpropagation from ct to ct-1 only elementwise multiplication by f, no matrix ☉ + c ct-1 t multiply by W
f
i ☉ tanh W g stack ht-1 ☉ o ht
xt
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 118 May 7, 2020 Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c0 c1 c2 c3
Notice that the gradient contains the f gate’s vector of activations - allows better control of gradients values, using suitable parameter updates of the forget gate. Also notice that are added through the f, i, g, and o gates - better balancing of gradient values
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 119 May 7, 2020 Do LSTMs solve the vanishing gradient problem?
The LSTM architecture makes it easier for the RNN to preserve information over many timesteps - e.g. if the f = 1 and the i = 0, then the information of that cell is preserved indefinitely. - By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state •
LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 120 May 7, 2020 Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c0 c1 c2 c3 3x3 conv, 128 / 2 3x3 conv, 7x7 conv, 64 / 2 7x7 conv, 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 3x3 conv, 128 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, FC 1000 Softmax
Similar to ResNet! Input Pool Pool ...
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 121 May 7, 2020 Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c0 c1 c2 c3
In between: Highway Networks 3x3 conv, 128 / 2 3x3 conv, 7x7 conv, 64 / 2 7x7 conv, 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 3x3 conv, 128 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, 3x3 conv, 64 3x3 conv, 64 3x3 conv, FC 1000 Softmax
Similar to ResNet! Input Pool Pool ...
Srivastava et al, “Highway Networks”,
ICML DL Workshop 2015
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 122 May 7, 2020 Neural Architecture Search for RNN architectures
LSTM cell Cell they found
Zoph et Le, “Neural Architecture Search with Reinforcement Learning”, ICLR 2017 Figures copyright Zoph et al, 2017. Reproduced with permission. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 123 May 7, 2020 [An Empirical Exploration of Other RNN Variants Recurrent Network Architectures, Jozefowicz et al., 2015] GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014]
[LSTM: A Search Space Odyssey, Greff et al., 2015]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 124 May 7, 2020 Recently in Natural Language Processing… New paradigms for reasoning over sequences [“Attention is all you need”, Vaswani et al., 2018]
- New “Transformer” architecture no longer processes inputs sequentially; instead it can operate over inputs in a sequence in parallel through an attention mechanism
- Has led to many state-of-the-art results and pre-training in NLP, for more interest see e.g. - “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Devlin et al., 2018 - OpenAI GPT-2, Radford et al., 2018
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 125 May 7, 2020 Transformers for Vision
- LSTM is a good default choice - Use variants like GRU if you want faster compute and less parameters - Use transformers (not covered in this lecture) as they are dominating NLP models - We need more work studying vision models in tandem with transformers
Su et al. "Vl-bert: Pre-training of generic visual-linguistic representations." ICLR 2020 Lu et al. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." NeurIPS 2019 Li et al. "Visualbert: A simple and performant baseline for vision and language." arXiv 2019
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 126 May 7, 2020 Summary
- RNNs allow a lot of flexibility in architecture design - Vanilla RNNs are simple but don’t work very well - Common to use LSTM or GRU: their additive interactions improve gradient flow - Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM) - Better/simpler architectures are a hot topic of current research, as well as new paradigms for reasoning over sequences - Better understanding (both theoretical and empirical) is needed.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 127 May 7, 2020 Next time: Midterm!
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 10 - 128 May 7, 2020