<<

Lecture 4: Neural Networks and

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 1 April 11, 2019 Administrative: Assignment 1

Assignment 1 due Wednesday April 17, 11:59pm

If using Google Cloud, you don’t need GPUs for this assignment!

We will distribute Google Cloud coupons by this weekend

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 2 April 11, 2019 Administrative: Alternate Midterm Time

If you need to request an alternate midterm time:

See Piazza for form, fill it out by 4/25 (two weeks from today)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 3 April 11, 2019 Administrative: Project Proposal

Project proposal due 4/24

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 4 April 11, 2019 Administrative: Discussion Section

Discussion section tomorrow (1:20pm in Gates B03):

How to pick a project / How to read a paper

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 5 April 11, 2019 Where we are...

Linear score function

SVM loss (or softmax)

data loss + regularization

How to find the best W?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 6 April 11, 2019 Finding the best W: Optimize with Descent

Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 7 April 11, 2019

Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your implementation with numerical gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 8 April 11, 2019 Problem: Linear Classifiers are not very powerful

Visual Viewpoint Geometric Viewpoint

Linear classifiers learn Linear classifiers one template per class can only draw linear decision boundaries

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 9 April 11, 2019 One Solution: Feature Transformation

f(x, y) = (r(x, y), θ(x, y))

Transform data with a cleverly chosen feature transform f, then apply

Color Histogram Histogram of Oriented (HoG)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 10 April 11, 2019 Image features vs ConvNets f Feature Extraction 10 numbers giving scores for classes training

Krizhevsky, Sutskever, and Hinton, “Imagenet classification with deep convolutional neural networks”, NIPS 2012. Figure copyright Krizhevsky, Sutskever, and Hinton, 2012. Reproduced with permission.

10 numbers giving scores for classes training Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - 11 April 9, 2019 Today: Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 12 April 11, 2019 Neural networks: without the brain stuff

(Before) Linear score function:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 13 April 11, 2019 Neural networks: without the brain stuff

(Before) Linear score function: (Now) 2- Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 14 April 11, 2019 Neural networks: without the brain stuff

(Before) Linear score function: (Now) 2-layer Neural Network

“Neural Network” is a very broad term; these are more accurately called “fully-connected networks” or sometimes “multi-layer ” (MLP)

(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 15 April 11, 2019 Neural networks: without the brain stuff

(Before) Linear score function: (Now) 2-layer Neural Network or 3-layer Neural Network

(In practice we will usually add a learnable bias at each layer as well)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 16 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network

x W1 h W2 s

3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 17 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network

x W1 h W2 s

3072 100 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 18 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network

The function is called the . Q: What if we try to build a neural network without one?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 19 April 11, 2019 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network

The function is called the activation function. Q: What if we try to build a neural network without one?

A: We end up with a linear classifier again!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 20 April 11, 2019 Activation functions Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 21 April 11, 2019 ReLU is a good default Activation functions choice for most problems Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 22 April 11, 2019 Neural networks: Architectures

“3-layer Neural Net”, or “2-layer Neural Net”, or “2-hidden-layer Neural Net” “1-hidden-layer Neural Net” “Fully-connected” layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 23 April 11, 2019 Example feed-forward computation of a neural network

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 24 April 11, 2019 Full implementation of training a 2-layer Neural Network needs ~20 lines:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 25 April 11, 2019 Setting the number of layers and their sizes

more neurons = more capacity Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 26 13 Jan 2016 Do not use size of neural network as a regularizer. Use stronger regularization instead:

(Web demo with ConvNetJS: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 27 13 Jan 2016

This image by Fotis Bobolas is licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 28 April 11, 2019 Impulses carried toward cell body

dendrite presynaptic terminal

axon

cell body

Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 29 April 11, 2019 Impulses carried toward cell body

dendrite presynaptic terminal

axon

cell body

Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 30 April 11, 2019 Impulses carried toward cell body

dendrite presynaptic terminal

axon

cell body

Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

sigmoid activation function

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 31 April 11, 2019 Impulses carried toward cell body

dendrite presynaptic terminal

axon

cell body

Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 32 April 11, 2019 Biological Neurons: Neurons in a neural network: Complex connectivity patterns Organized into regular layers for computational efficiency

This image is CC0 Public Domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 33 April 11, 2019 Biological Neurons: But neural networks with random Complex connectivity patterns connections can work too!

This image is CC0 Public Domain Xie et al, “Exploring Randomly Wired Neural Networks for Image Recognition”, arXiv 2019

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 34 April 11, 2019 Be very careful with your brain analogies!

Biological Neurons: ● Many different types ● Dendrites can perform complex non-linear computations ● Synapses are not a single weight but a complex non-linear dynamical system ● Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 35 April 11, 2019 Problem: How to compute gradients?

Nonlinear score function SVM Loss on predictions

Regularization

Total loss: data loss + regularization

If we can compute then we can learn W1 and W2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 36 April 11, 2019 (Bad) Idea: Derive on paper Problem: Very tedious: Lots of matrix calculus, need lots of paper Problem: What if we want to change loss? E.g. use softmax instead of SVM? Need to re-derive from scratch =( Problem: Not feasible for very complex models!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 37 April 11, 2019 Better Idea: Computational graphs + Backpropagation

x s (scores) hinge + * loss L

W R

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 38 April 11, 2019 Convolutional network (AlexNet)

input image

weights

loss

Figure copyright Alex Krizhevsky, Ilya Sutskever, and , 2012. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 39 April 11, 2019

input image

loss

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 40 April 11, 2019 Neural Turing Machine

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 11, 2019 Backpropagation: a simple example

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 42 April 11, 2019 Backpropagation: a simple example

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 43 April 11, 2019 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 44 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 45 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 46 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 47 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 48 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 49 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 50 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 51 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 52 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 53 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 54 April 13, 2017 Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want: Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 55 April 13, 2017 f

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 56 April 11, 2019 “local gradient”

f

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 57 April 11, 2019 “local gradient”

f

“Upstream gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 58 April 11, 2019 “local gradient”

“Downstream gradients” f

“Upstream gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 59 April 11, 2019 “local gradient”

“Downstream gradients” f

“Upstream gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 60 April 11, 2019 “local gradient”

“Downstream gradients” f

“Upstream gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 61 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 62 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 63 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 64 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 65 April 11, 2019 Another example:

Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 66 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 67 April 11, 2019 Another example:

Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 68 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 69 April 11, 2019 Another example:

Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 70 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 71 April 11, 2019 Another example:

Upstream Local gradient gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 72 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 73 April 11, 2019 Another example:

[upstream gradient] x [local gradient] [0.2] x [1] = 0.2 [0.2] x [1] = 0.2 (both inputs!)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 74 April 11, 2019 Another example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 75 April 11, 2019 Another example:

[upstream gradient] x [local gradient] x0: [0.2] x [2] = 0.4 w0: [0.2] x [-1] = -0.2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 76 April 11, 2019 Another example: Computational graph representation may not be unique. Choose one Sigmoid where local gradients at function each node can be easily expressed!

Sigmoid

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 77 April 11, 2019 Another example: Computational graph representation may not be unique. Choose one Sigmoid where local gradients at function each node can be easily expressed!

Sigmoid

Sigmoid local gradient:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 78 April 11, 2019 Another example: Computational graph representation may not be unique. Choose one Sigmoid where local gradients at function each node can be easily expressed!

Sigmoid

[upstream gradient] x [local gradient] [1.00] x [(1 - 0.73) (0.73)] = 0.2

Sigmoid local gradient:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 79 April 11, 2019 Patterns in gradient flow add gate: gradient distributor 3

2 7 4 + 2 2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 80 April 11, 2019 Patterns in gradient flow add gate: gradient distributor mul gate: “swap multiplier” 3 2

2 5*3=15 7 6 4 + 2 3 × 5 2 2*5=10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 81 April 11, 2019 Patterns in gradient flow add gate: gradient distributor mul gate: “swap multiplier” 3 2

2 5*3=15 7 6 4 + 2 3 × 5 2 2*5=10

copy gate: gradient adder 7

7 4 4+2=6 7 2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 82 April 11, 2019 Patterns in gradient flow add gate: gradient distributor mul gate: “swap multiplier” 3 2

2 5*3=15 7 6 4 + 2 3 × 5 2 2*5=10

copy gate: gradient adder max gate: gradient router 7 4

7 4 0 5 max 4+2=6 7 5 9 2 9

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 83 April 11, 2019 Backprop Implementation:

“Flat” code Forward pass: Compute output

Backward pass: Compute grads

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 84 April 11, 2019 Backprop Implementation:

“Flat” code Forward pass: Compute output

Base case

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 85 April 11, 2019 Backprop Implementation:

“Flat” code Forward pass: Compute output

Sigmoid

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 86 April 11, 2019 Backprop Implementation:

“Flat” code Forward pass: Compute output

Add gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 87 April 11, 2019 Backprop Implementation:

“Flat” code Forward pass: Compute output

Add gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 88 April 11, 2019 Backprop Implementation:

“Flat” code Forward pass: Compute output

Multiply gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 89 April 11, 2019 Backprop Implementation:

“Flat” code Forward pass: Compute output

Multiply gate

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 90 April 11, 2019 “Flat” Backprop: Do this for assignment 1! Stage your forward/backward computation! margins E.g. for the SVM:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 91 April 11, 2019 “Flat” Backprop: Do this for assignment 1! E.g. for two-layer neural net:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 92 April 11, 2019 Backprop Implementation: Modularized API

Graph (or Net) object (rough pseudo code)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 93 April 11, 2019 Modularized implementation: forward / backward API Gate / Node / Function object: Actual PyTorch code

x z Need to stash some values for * use in backward y

Upstream (x,y,z are scalars) gradient

Multiply upstream and local gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 94 April 11, 2019 Example: PyTorch operators

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 95 April 11, 2019 PyTorch sigmoid layer

Forward

Source

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 96 April 11, 2019 PyTorch sigmoid layer

Forward

Forward actually defined elsewhere...

Source

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 97 April 11, 2019 PyTorch sigmoid layer

Forward

Forward actually defined elsewhere...

Backward

Source

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 98 April 11, 2019 So far: backprop with scalars

What about vector-valued functions?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 99 April 11, 2019 Recap: Vector derivatives Scalar to Scalar

Regular derivative:

If x changes by a small amount, how much will y change?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 100 April 11, 2019 Recap: Vector derivatives Scalar to Scalar Vector to Scalar

Regular derivative: Derivative is Gradient:

If x changes by a For each element of x, small amount, how if it changes by a small much will y change? amount then how much will y change?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 101 April 11, 2019 Recap: Vector derivatives Scalar to Scalar Vector to Scalar Vector to Vector

Regular derivative: Derivative is Gradient: Derivative is Jacobian:

If x changes by a For each element of x, For each element of x, if it small amount, how if it changes by a small changes by a small amount much will y change? amount then how much then how much will each will y change? element of y change?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 102 April 11, 2019 Backprop with Vectors

Loss L still a scalar! “local Dx gradients”

Dz “Downstream gradients” f

Dy “Upstream gradient”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 103 April 11, 2019 Backprop with Vectors

Loss L still a scalar! “local Dx gradients”

Dz “Downstream gradients” f

Dz Dy “Upstream gradient” For each element of z, how much does it influence L?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 104 April 11, 2019 Backprop with Vectors

Loss L still a scalar! “local Dx gradients” [D x D ] x z Dz “Downstream gradients” f [Dy x Dz] Dz Dy Jacobian matrices “Upstream gradient” For each element of z, how much does it influence L?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 105 April 11, 2019 Backprop with Vectors

Loss L still a scalar! “local Dx gradients”

Dx [D x D ] x z Dz “Downstream Matrix-vector f gradients” multiply [Dy x Dz] Dz Dy Jacobian matrices “Upstream gradient” D For each element of z, how y much does it influence L?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 106 April 11, 2019 Backprop with Vectors

4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 107 April 11, 2019 Backprop with Vectors

4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]

4D dL/dy: [ 4 ] [ -1 ] Upstream [ 5 ] gradient [ 9 ]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 108 April 11, 2019 Backprop with Vectors

4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]

Jacobian dy/dx 4D dL/dy: [ 1 0 0 0 ] [ 4 ] [ 0 0 0 0 ] [ -1 ] Upstream [ 0 0 1 0 ] [ 5 ] gradient [ 0 0 0 0 ] [ 9 ]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 109 April 11, 2019 Backprop with Vectors

4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]

[dy/dx] [dL/dy] 4D dL/dy: [ 1 0 0 0 ] [ 4 ] [ 4 ] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream [ 0 0 1 0 ] [ 5 ] [ 5 ] gradient [ 0 0 0 0 ] [ 9 ] [ 9 ]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 110 April 11, 2019 Backprop with Vectors

4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] [ 3 ] (elementwise) [ 3 ] [ -1 ] [ 0 ]

4D dL/dx: [dy/dx] [dL/dy] 4D dL/dy: [ 4 ] [ 1 0 0 0 ] [ 4 ] [ 4 ] [ 0 ] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream [ 5 ] [ 0 0 1 0 ] [ 5 ] [ 5 ] gradient [ 0 ] [ 0 0 0 0 ] [ 9 ] [ 9 ]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 111 April 11, 2019 Backprop with Vectors

4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] Jacobian is sparse: [ 3 ] [ 3 ] off-diagonal entries (elementwise) always zero! Never [ -1 ] [ 0 ] explicitly form Jacobian -- instead use implicit 4D dL/dx: [dy/dx] [dL/dy] 4D dL/dy: multiplication [ 4 ] [ 1 0 0 0 ] [ 4 ] [ 4 ] [ 0 ] [ 0 0 0 0 ] [ -1 ] [ -1 ] Upstream [ 5 ] [ 0 0 1 0 ] [ 5 ] [ 5 ] gradient [ 0 ] [ 0 0 0 0 ] [ 9 ] [ 9 ]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 112 April 11, 2019 Backprop with Vectors

4D input x: 4D output y: [ 1 ] [ 1 ] [ -2 ] f(x) = max(0,x) [ 0 ] Jacobian is sparse: [ 3 ] [ 3 ] off-diagonal entries (elementwise) always zero! Never [ -1 ] [ 0 ] explicitly form Jacobian -- instead use implicit 4D dL/dx: [dy/dx] [dL/dy] 4D dL/dy: multiplication [ 4 ] [ 4 ] [ 0 ] [ -1 ] Upstream [ 5 ] [ 5 ] gradient [ 0 ] [ 9 ]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 113 April 11, 2019 Backprop with Matrices (or Tensors) Loss L still a scalar!

[D ×M ] dL/dx always has the x x “local same shape as x! gradients”

[Dx×Mx] [Dz×Mz] “Downstream Matrix-vector gradients” multiply

[D ×M ] [Dz×Mz] y y Jacobian matrices “Upstream gradient” [D ×M ] y y For each element of y, how much For each element of z, how does it influence each element of z? much does it influence L?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 114 April 11, 2019 Backprop with Matrices (or Tensors) Loss L still a scalar!

[D ×M ] dL/dx always has the x x “local same shape as x! gradients”

[Dx×Mx] [(Dx×Mx)×(Dz×Mz)] [Dz×Mz] “Downstream Matrix-vector gradients” multiply [(Dy×My)×(Dz×Mz)]

[D ×M ] [Dz×Mz] y y Jacobian matrices “Upstream gradient” [D ×M ] y y For each element of y, how much For each element of z, how does it influence each element of z? much does it influence L?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 115 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] [ -8 1 4 6 ] [ 2 1 3 2] [ 3 2 1 -2]

Also see derivation in the course notes: http://cs231n.stanford.edu/handouts/linear-backprop.pdf

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 116 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Jacobians: [ -8 1 4 6 ] [ 2 1 3 2] dy/dx: [(N×D)×(N×M)] [ 3 2 1 -2] dy/dw: [(D×M)×(N×M)]

For a neural net we may have N=64, D=M=4096 Each Jacobian takes 256 GB of memory! Must work with them implicitly!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 117 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ] [ 2 1 3 2] are affected by one [ 3 2 1 -2] element of x?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 118 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y [ -8 1 4 6 ] [ 2 1 3 2] are affected by one [ 3 2 1 -2] element of x? A: affects the whole row

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 119 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ] [ 2 1 3 2] are affected by one does [ 3 2 1 -2] element of x? affect ? A: affects the whole row

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 120 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ] [ 2 1 3 2] are affected by one does [ 3 2 1 -2] element of x? affect ? A: affects the A: whole row

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 121 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] Q: What parts of y Q: How much [ -8 1 4 6 ] [ 2 1 3 2] are affected by one does [ 3 2 1 -2] element of x? affect ? A: affects the A: [N×D] [N×M] [M×D] whole row

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 122 April 11, 2019 Backprop with Matrices y: [N×M] [13 9 -2 -6 ] x: [N×D] Matrix Multiply [ 5 2 17 1 ] [ 2 1 -3 ] [ -3 4 2 ] dL/dy: [N×M] w: [D×M] [ 2 3 -3 9 ] [ 3 2 1 -1] [ -8 1 4 6 ] [ 2 1 3 2] By similar logic: [ 3 2 1 -2]

[N×D] [N×M] [M×D] [D×M] [D×N] [N×M] These formulas are easy to remember: they are the only way to make shapes match up!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 123 April 11, 2019 Summary for today: ● (Fully-connected) Neural Networks are stacks of linear functions and nonlinear activation functions; they have much more representational power than linear classifiers ● backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates ● implementations maintain a graph structure, where the nodes implement the forward() / backward() API ● forward: compute result of an operation and save any intermediates needed for gradient computation in memory ● backward: apply the chain rule to compute the gradient of the with respect to the inputs

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 124 April 11, 2019 Next Time: Convolutional Networks!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 125 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 126 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 127 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 128 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 129 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 130 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 131 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 132 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 133 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 134 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 135 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 136 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 137 April 11, 2019 A vectorized example:

Always check: The gradient with respect to a variable should have the same shape as the variable

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 138 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 139 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 140 April 11, 2019 A vectorized example:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - 141 April 11, 2019 In discussion section: A matrix example...

?

?

14 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017 2