Lecture 4: Neural Networks and Backpropagation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 11, 2019 Administrative: Assignment 1
Assignment 1 due Wednesday April 17, 11:59pm
If using Google Cloud, you don’t need GPUs for this assignment!
We will distribute Google Cloud coupons by this weekend
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 2 April 11, 2019 Administrative: Alternate Midterm Time
If you need to request an alternate midterm time:
See Piazza for form, fill it out by 4/25 (two weeks from today)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 3 April 11, 2019 Administrative: Project Proposal
Project proposal due 4/24
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 4 April 11, 2019 Administrative: Discussion Section
Discussion section tomorrow (1:20pm in Gates B03):
How to pick a project / How to read a paper
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 5 April 11, 2019 Where we are...
Linear score function
SVM loss (or softmax)
data loss + regularization
How to find the best W?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 11, 2019 Finding the best W: Optimize with Gradient Descent
Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 11, 2019 Gradient descent
Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :(
In practice: Derive analytic gradient, check your implementation with numerical gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 11, 2019 Problem: Linear Classifiers are not very powerful
Visual Viewpoint Geometric Viewpoint
Linear classifiers learn Linear classifiers one template per class can only draw linear decision boundaries
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 9 April 11, 2019 One Solution: Feature Transformation
f(x, y) = (r(x, y), θ(x, y))
Transform data with a cleverly chosen feature transform f, then apply linear classifier
Color Histogram Histogram of Oriented Gradients (HoG)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 10 April 11, 2019 Image features vs ConvNets f Feature Extraction 10 numbers giving scores for classes training
Krizhevsky, Sutskever, and Hinton, “Imagenet classification with deep convolutional neural networks”, NIPS 2012. Figure copyright Krizhevsky, Sutskever, and Hinton, 2012. Reproduced with permission.
10 numbers giving scores for classes training Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 11 April 9, 2019 Today: Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 11, 2019 Neural networks: without the brain stuff
(Before) Linear score function:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 11, 2019 Neural networks: without the brain stuff
(Before) Linear score function: (Now) 2-layer Neural Network
(In practice we will usually add a learnable bias at each layer as well)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 11, 2019 Neural networks: without the brain stuff
(Before) Linear score function: (Now) 2-layer Neural Network
“Neural Network” is a very broad term; these are more accurately called “fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)
(In practice we will usually add a learnable bias at each layer as well)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 11, 2019 Neural networks: without the brain stuff
(Before) Linear score function: (Now) 2-layer Neural Network or 3-layer Neural Network
(In practice we will usually add a learnable bias at each layer as well)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network
x W1 h W2 s
3072 100 10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network
x W1 h W2 s
3072 100 10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network
The function is called the activation function. Q: What if we try to build a neural network without one?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network
The function is called the activation function. Q: What if we try to build a neural network without one?
A: We end up with a linear classifier again!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 11, 2019 Activation functions Sigmoid Leaky ReLU
tanh Maxout
ReLU ELU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 11, 2019 ReLU is a good default Activation functions choice for most problems Sigmoid Leaky ReLU
tanh Maxout
ReLU ELU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 11, 2019 Neural networks: Architectures
“3-layer Neural Net”, or “2-layer Neural Net”, or “2-hidden-layer Neural Net” “1-hidden-layer Neural Net” “Fully-connected” layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 11, 2019 Example feed-forward computation of a neural network
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 11, 2019 Full implementation of training a 2-layer Neural Network needs ~20 lines:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 11, 2019 Setting the number of layers and their sizes
more neurons = more capacity Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 26 13 Jan 2016 Do not use size of neural network as a regularizer. Use stronger regularization instead:
(Web demo with ConvNetJS: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 27 13 Jan 2016
This image by Fotis Bobolas is licensed under CC-BY 2.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 11, 2019 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 11, 2019 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 11, 2019 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
sigmoid activation function
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 11, 2019 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 11, 2019 Biological Neurons: Neurons in a neural network: Complex connectivity patterns Organized into regular layers for computational efficiency
This image is CC0 Public Domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 11, 2019 Biological Neurons: But neural networks with random Complex connectivity patterns connections can work too!
This image is CC0 Public Domain Xie et al, “Exploring Randomly Wired Neural Networks for Image Recognition”, arXiv 2019
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 11, 2019 Be very careful with your brain analogies!
Biological Neurons: ● Many different types ● Dendrites can perform complex non-linear computations ● Synapses are not a single weight but a complex non-linear dynamical system ● Rate code may not be adequate
[Dendritic Computation. London and Hausser]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 11, 2019 Problem: How to compute gradients?
Nonlinear score function SVM Loss on predictions
Regularization
Total loss: data loss + regularization
If we can compute then we can learn W1 and W2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 11, 2019 (Bad) Idea: Derive on paper Problem: Very tedious: Lots of matrix calculus, need lots of paper Problem: What if we want to change loss? E.g. use softmax instead of SVM? Need to re-derive from scratch =( Problem: Not feasible for very complex models!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 11, 2019 Better Idea: Computational graphs + Backpropagation
x s (scores) hinge + * loss L
W R
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 11, 2019 Convolutional network (AlexNet)
input image
weights
loss
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 11, 2019 Neural Turing Machine
input image
loss
Figure reproduced with permission from a Twitter post by Andrej Karpathy.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 11, 2019 Neural Turing Machine
Figure reproduced with permission from a Twitter post by Andrej Karpathy.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 11, 2019 Backpropagation: a simple example
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 11, 2019 Backpropagation: a simple example
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 11, 2019 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want: Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want: Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want: Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want: Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 13, 2017 f
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 56 April 11, 2019 “local gradient”
f
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 57 April 11, 2019 “local gradient”
f
“Upstream gradient”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 58 April 11, 2019 “local gradient”
“Downstream gradients” f
“Upstream gradient”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 59 April 11, 2019 “local gradient”
“Downstream gradients” f
“Upstream gradient”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 60 April 11, 2019 “local gradient”
“Downstream gradients” f
“Upstream gradient”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 11, 2019 Another example:
Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 11, 2019 Another example:
Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 11, 2019 Another example:
Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 11, 2019 Another example:
Upstream Local gradient gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 11, 2019 Another example:
[upstream gradient] x [local gradient] [0.2] x [1] = 0.2 [0.2] x [1] = 0.2 (both inputs!)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 11, 2019 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 75 April 11, 2019 Another example:
[upstream gradient] x [local gradient] x0: [0.2] x [2] = 0.4 w0: [0.2] x [-1] = -0.2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 76 April 11, 2019 Another example: Computational graph representation may not be unique. Choose one Sigmoid where local gradients at function each node can be easily expressed!
Sigmoid
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 77 April 11, 2019 Another example: Computational graph representation may not be unique. Choose one Sigmoid where local gradients at function each node can be easily expressed!
Sigmoid
Sigmoid local gradient:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 78 April 11, 2019 Another example: Computational graph representation may not be unique. Choose one Sigmoid where local gradients at function each node can be easily expressed!
Sigmoid
[upstream gradient] x [local gradient] [1.00] x [(1 - 0.73) (0.73)] = 0.2
Sigmoid local gradient:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 79 April 11, 2019 Patterns in gradient flow add gate: gradient distributor 3
2 7 4 + 2 2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 80 April 11, 2019 Patterns in gradient flow add gate: gradient distributor mul gate: “swap multiplier” 3 2
2 5*3=15 7 6 4 + 2 3 × 5 2 2*5=10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 81 April 11, 2019 Patterns in gradient flow add gate: gradient distributor mul gate: “swap multiplier” 3 2
2 5*3=15 7 6 4 + 2 3 × 5 2 2*5=10
copy gate: gradient adder 7
7 4 4+2=6 7 2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 82 April 11, 2019 Patterns in gradient flow add gate: gradient distributor mul gate: “swap multiplier” 3 2
2 5*3=15 7 6 4 + 2 3 × 5 2 2*5=10
copy gate: gradient adder max gate: gradient router 7 4
7 4 0 5 max 4+2=6 7 5 9 2 9
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 83 April 11, 2019 Backprop Implementation:
“Flat” code Forward pass: Compute output
Backward pass: Compute grads
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 84 April 11, 2019 Backprop Implementation:
“Flat” code Forward pass: Compute output
Base case
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 85 April 11, 2019 Backprop Implementation:
“Flat” code Forward pass: Compute output
Sigmoid
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 86 April 11, 2019 Backprop Implementation:
“Flat” code Forward pass: Compute output
Add gate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 87 April 11, 2019 Backprop Implementation:
“Flat” code Forward pass: Compute output
Add gate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 88 April 11, 2019 Backprop Implementation:
“Flat” code Forward pass: Compute output
Multiply gate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 89 April 11, 2019 Backprop Implementation:
“Flat” code Forward pass: Compute output
Multiply gate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 90 April 11, 2019 “Flat” Backprop: Do this for assignment 1! Stage your forward/backward computation! margins E.g. for the SVM:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 91 April 11, 2019 “Flat” Backprop: Do this for assignment 1! E.g. for two-layer neural net:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 92 April 11, 2019 Backprop Implementation: Modularized API
Graph (or Net) object (rough pseudo code)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 93 April 11, 2019 Modularized implementation: forward / backward API Gate / Node / Function object: Actual PyTorch code
x z Need to stash some values for * use in backward y
Upstream (x,y,z are scalars) gradient
Multiply upstream and local gradients
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 94 April 11, 2019 Example: PyTorch operators
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 95 April 11, 2019 PyTorch sigmoid layer
Forward
Source
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 96 April 11, 2019 PyTorch sigmoid layer
Forward
Forward actually defined elsewhere...
Source
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 97 April 11, 2019 PyTorch sigmoid layer
Forward
Forward actually defined elsewhere...
Backward
Source
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 98 April 11, 2019 So far: backprop with scalars
What about vector-valued functions?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 99 April 11, 2019 Recap: Vector derivatives Scalar to Scalar
Regular derivative:
If x changes by a small amount, how much will y change?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 100 April 11, 2019 Recap: Vector derivatives Scalar to Scalar Vector to Scalar
Regular derivative: Derivative is Gradient:
If x changes by a For each element of x, small amount, how if it changes by a small much will y change? amount then how much will y change?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 101 April 11, 2019 Recap: Vector derivatives Scalar to Scalar Vector to Scalar Vector to Vector
Regular derivative: Derivative is Gradient: Derivative is Jacobian:
If x changes by a For each element of x, For each element of x, if it small amount, how if it changes by a small changes by a small amount much will y change? amount then how much then how much will each will y change? element of y change?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 102 April 11, 2019 Backprop with Vectors
Loss L still a scalar! “local Dx gradients”
Dz “Downstream gradients” f
Dy “Upstream gradient”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 103 April 11, 2019 Backprop with Vectors
Loss L still a scalar! “local Dx gradients”
Dz “Downstream gradients” f
Dz Dy “Upstream gradient” For each element of z, how much does it influence L?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 104 April 11, 2019 Backprop with Vectors
Loss L still a scalar! “local Dx gradients” [D x D ] x z Dz “Downstream gradients” f [Dy x Dz] Dz Dy Jacobian matrices “Upstream gradient” For each element of z, how much does it influence L?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 105 April 11, 2019 Backprop with Vectors
Loss L still a scalar! “local Dx gradients”
Dx [D x D ] x z Dz “Downstream Matrix-vector f gradients” multiply [Dy x Dz] Dz Dy Jacobian matrices “Upstream gradient” D For each element of z, how y much does it influence L?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 106 April 11, 2019 Backprop with Vectors
4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 107 April 11, 2019 Backprop with Vectors
4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]
4D dL/dy: [ 4 ] [ -1 ] Upstream [ 5 ] gradient [ 9 ]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 108 April 11, 2019 Backprop with Vectors
4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]
Jacobian dy/dx 4D dL/dy: [ 1 0 0 0 ] [ 4 ] [ 0 0 0 0 ] [ -1 ] Upstream [ 0 0 1 0 ] [ 5 ] gradient [ 0 0 0 0 ] [ 9 ]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 109 April 11, 2019 Backprop with Vectors
4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]
[dy/dx] [dL/dy] 4D dL/dy: [ 1 0 0 0 ] [ 4 ] [ 4 ] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream [ 0 0 1 0 ] [ 5 ] [ 5 ] gradient [ 0 0 0 0 ] [ 9 ] [ 9 ]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 110 April 11, 2019 Backprop with Vectors
4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]
4D dL/dx: [dy/dx] [dL/dy] 4D dL/dy: [ 4 ] [ 1 0 0 0 ] [ 4 ] [ 4 ] [ 0 ] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream [ 5 ] [ 0 0 1 0 ] [ 5 ] [ 5 ] gradient [ 0 ] [ 0 0 0 0 ] [ 9 ] [ 9 ]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 111 April 11, 2019 Backprop with Vectors
4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] Jacobian is sparse: [ 3 ] [ 3 ] off-diagonal entries (elementwise) always zero! Never [ -1 ] [ 0 ] explicitly form Jacobian -- instead use implicit 4D dL/dx: [dy/dx] [dL/dy] 4D dL/dy: multiplication [ 4 ] [ 1 0 0 0 ] [ 4 ] [ 4 ] [ 0 ] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream [ 5 ] [ 0 0 1 0 ] [ 5 ] [ 5 ] gradient [ 0 ] [ 0 0 0 0 ] [ 9 ] [ 9 ]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 112 April 11, 2019 Backprop with Vectors
4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] Jacobian is sparse: [ 3 ] [ 3 ] off-diagonal entries (elementwise) always zero! Never [ -1 ] [ 0 ] explicitly form Jacobian -- instead use implicit 4D dL/dx: [dy/dx] [dL/dy] 4D dL/dy: multiplication [ 4 ] [ 4 ] [ 0 ] [ -1 ] Upstream [ 5 ] [ 5 ] gradient [ 0 ] [ 9 ]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 113 April 11, 2019 Backprop with Matrices (or Tensors) Loss L still a scalar!
[D ×M ] dL/dx always has the x x “local same shape as x! gradients”
[Dx×Mx] [Dz×Mz] “Downstream Matrix-vector gradients” multiply
[D ×M ] [Dz×Mz] y y Jacobian matrices “Upstream gradient” [D ×M ] y y For each element of y, how much For each element of z, how does it influence each element of z? much does it influence L?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 114 April 11, 2019 Backprop with Matrices (or Tensors) Loss L still a scalar!
[D ×M ] dL/dx always has the x x “local same shape as x! gradients”
[Dx×Mx] [(Dx×Mx)×(Dz×Mz)] [Dz×Mz] “Downstream Matrix-vector gradients” multiply [(Dy×My)×(Dz×Mz)]
[D ×M ] [Dz×Mz] y y Jacobian matrices “Upstream gradient” [D ×M ] y y For each element of y, how much For each element of z, how does it influence each element of z? much does it influence L?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 115 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] [ -8 1 4 6 ] [ 2 1 3 2] [ 3 2 1 -2]
Also see derivation in the course notes: http://cs231n.stanford.edu/handouts/linear-backprop.pdf
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 116 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Jacobians: [ -8 1 4 6 ] [ 2 1 3 2] dy/dx: [(N×D)×(N×M)] [ 3 2 1 -2] dy/dw: [(D×M)×(N×M)]
For a neural net we may have N=64, D=M=4096 Each Jacobian takes 256 GB of memory! Must work with them implicitly!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 117 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ] [ 2 1 3 2] are affected by one [ 3 2 1 -2] element of x?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 118 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ] [ 2 1 3 2] are affected by one [ 3 2 1 -2] element of x? A: affects the whole row
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 119 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ] [ 2 1 3 2] are affected by one does [ 3 2 1 -2] element of x? affect ? A: affects the whole row
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 120 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ] [ 2 1 3 2] are affected by one does [ 3 2 1 -2] element of x? affect ? A: affects the A: whole row
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 121 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ] [ 2 1 3 2] are affected by one does [ 3 2 1 -2] element of x? affect ? A: affects the A: [N×D] [N×M] [M×D] whole row
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 122 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] [ -8 1 4 6 ] [ 2 1 3 2] By similar logic: [ 3 2 1 -2]
[N×D] [N×M] [M×D] [D×M] [D×N] [N×M] These formulas are easy to remember: they are the only way to make shapes match up!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 123 April 11, 2019 Summary for today: ● (Fully-connected) Neural Networks are stacks of linear functions and nonlinear activation functions; they have much more representational power than linear classifiers ● backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates ● implementations maintain a graph structure, where the nodes implement the forward() / backward() API ● forward: compute result of an operation and save any intermediates needed for gradient computation in memory ● backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 124 April 11, 2019 Next Time: Convolutional Networks!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 125 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 126 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 127 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 128 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 129 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 130 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 131 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 132 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 133 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 134 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 135 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 136 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 137 April 11, 2019 A vectorized example:
Always check: The gradient with respect to a variable should have the same shape as the variable
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 138 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 139 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 140 April 11, 2019 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 141 April 11, 2019 In discussion section: A matrix example...
?
?
14 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017 2