Lecture 4: Backpropagation and Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 13, 2017 Administrative
Assignment 1 due Thursday April 20, 11:59pm on Canvas
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 2 April 13, 2017 Administrative
Project: TA specialities and some project ideas are posted on Piazza
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 3 April 11, 2017 Administrative
Google Cloud: All registered students will receive an email this week with instructions on how to redeem $100 in credits
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 4 April 13, 2017 Where we are...
scores function
SVM loss
data loss + regularization
want
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 5 April 13, 2017 Optimization
Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 13, 2017 Gradient descent
Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :(
In practice: Derive analytic gradient, check your implementation with numerical gradient
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 13, 2017 Computational graphs
x s (scores) hinge + * loss L
W R
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 13, 2017 Convolutional network (AlexNet)
input image
weights
loss
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 9 April 13, 2017 Neural Turing Machine
input image
loss
Figure reproduced with permission from a Twitter post by Andrej Karpathy.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 10 April 13, 2017 Neural Turing Machine
Figure reproduced with permission from a Twitter post by Andrej Karpathy.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 13, 2017 Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 13, 2017 f
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 13, 2017 “local gradient”
f
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 13, 2017 “local gradient”
f
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 26 April 13, 2017 “local gradient”
f
gradients
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 27 April 13, 2017 “local gradient”
f
gradients
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 13, 2017 “local gradient”
f
gradients
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 13, 2017 Another example:
(-1) * (-0.20) = 0.20
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 13, 2017 Another example:
[local gradient] x [upstream gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 41 April 13, 2017 Another example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 13, 2017 Another example:
[local gradient] x [upstream gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 13, 2017 sigmoid function
sigmoid gate
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017 sigmoid function
sigmoid gate
(0.73) * (1 - 0.73) = 0.2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017 Patterns in backward flow
add gate: gradient distributor
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017 Patterns in backward flow
add gate: gradient distributor Q: What is a max gate?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017 Patterns in backward flow
add gate: gradient distributor max gate: gradient router
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017 Patterns in backward flow
add gate: gradient distributor max gate: gradient router
Q: What is a mul gate?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017 Patterns in backward flow
add gate: gradient distributor max gate: gradient router
mul gate: gradient switcher
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017 Gradients add at branches
+
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 13, 2017 Gradients for vectorized code (x,y,z are This is now the now vectors) Jacobian matrix (derivative of each element of z w.r.t. each “local gradient” element of x)
f
gradients
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017 Vectorized operations
4096-d f(x) = max(0,x) 4096-d input vector (elementwise) output vector
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 13, 2017 Vectorized operations
Jacobian matrix
4096-d f(x) = max(0,x) 4096-d input vector (elementwise) output vector Q: what is the size of the Jacobian matrix?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 13, 2017 Vectorized operations
Jacobian matrix
4096-d f(x) = max(0,x) 4096-d input vector (elementwise) output vector Q: what is the size of the Jacobian matrix? [4096 x 4096!]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 13, 2017 Vectorized operations
4096-d f(x) = max(0,x) 4096-d input vector (elementwise) output vector Q: what is the in practice we process an entire minibatch (e.g. 100) size of the of examples at one time: Jacobian matrix? i.e. Jacobian would technically be a [4096 x 4096!] [409,600 x 409,600] matrix :\
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017 Vectorized operations
Jacobian matrix
4096-d f(x) = max(0,x) 4096-d input vector (elementwise) output vector Q: what is the size of the Q2: what does it Jacobian matrix? look like? [4096 x 4096!]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 58 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 59 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 60 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 13, 2017 A vectorized example:
Always check: The gradient with respect to a variable should have the same shape as the variable
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 13, 2017 A vectorized example:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 13, 2017 Modularized implementation: forward / backward API
Graph (or Net) object (rough psuedo code)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 75 April 13, 2017 Modularized implementation: forward / backward API
x z * y (x,y,z are scalars)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 76 April 13, 2017 Modularized implementation: forward / backward API
x z * y (x,y,z are scalars)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 77 April 13, 2017 Example: Caffe layers
Caffe is licensed under BSD 2-Clause
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 78 April 13, 2017 Caffe Sigmoid Layer
* top_diff (chain rule)
Caffe is licensed under BSD 2-Clause
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 79 April 13, 2017 In Assignment 1: Writing SVM / Softmax Stage your forward/backward computation! margins E.g. for the SVM:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 80 April 13, 2017 Summary so far... ● neural nets will be very large: impractical to write down gradient formula by hand for all parameters ● backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates ● implementations maintain a graph structure, where the nodes implement the forward() / backward() API ● forward: compute result of an operation and save any intermediates needed for gradient computation in memory ● backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 81 April 13, 2017 Next: Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 82 April 13, 2017 Neural networks: without the brain stuff
(Before) Linear score function:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 83 April 13, 2017 Neural networks: without the brain stuff
(Before) Linear score function: (Now) 2-layer Neural Network
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 84 April 13, 2017 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network
x W1 h W2 s
3072 100 10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 85 April 13, 2017 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network
x W1 h W2 s
3072 100 10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 86 April 13, 2017 Neural networks: without the brain stuff
(Before) Linear score function: (Now) 2-layer Neural Network or 3-layer Neural Network
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 87 April 13, 2017 Full implementation of training a 2-layer Neural Network needs ~20 lines:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 88 April 13, 2017 In Assignment 2: Writing a 2-layer net
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 89 April 13, 2017
This image by Fotis Bobolas is licensed under CC-BY 2.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 90 April 13, 2017 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 91 April 13, 2017 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 92 April 13, 2017 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
sigmoid activation function
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 93 April 13, 2017 Impulses carried toward cell body
dendrite presynaptic terminal
axon
cell body
Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 94 April 13, 2017 Be very careful with your brain analogies!
Biological Neurons: ● Many different types ● Dendrites can perform complex non-linear computations ● Synapses are not a single weight but a complex non-linear dynamical system ● Rate code may not be adequate
[Dendritic Computation. London and Hausser]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 95 April 13, 2017 Activation functions Sigmoid Leaky ReLU
tanh Maxout
ReLU ELU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 96 April 13, 2017 Neural networks: Architectures
“3-layer Neural Net”, or “2-layer Neural Net”, or “2-hidden-layer Neural Net” “1-hidden-layer Neural Net” “Fully-connected” layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 97 April 13, 2017 Example feed-forward computation of a neural network
We can efficiently evaluate an entire layer of neurons.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 98 April 13, 2017 Example feed-forward computation of a neural network
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 99 April 13, 2017 Summary
- We arrange neurons into fully-connected layers - The abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies) - Neural networks are not really neural - Next time: Convolutional Neural Networks
10 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017 0