<<

Backpropagation Why

• Neural networks are sequences of parametrized functions x ℎ($; !) conv subsample conv subsample linear

filters filters weights

Parameters ! Why backpropagation

• Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some

1 N min L(h(xi; ✓),yi) ✓ N i=1 X

Convolutional network Why backpropagation

• Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some loss function • Minimization through requires computing the gradient 1 N ✓(t+1) = ✓(t) L(h(x ; ✓),y ) N r i i i=1 X Why backpropagation

• Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some loss function • Minimization through gradient descent requires computing the gradient 1 N ✓(t+1) = ✓(t) L(h(x ; ✓),y ) N r i i i=1 X @L(z,y) @z z = h(x; ✓) L(z,y)= r✓ @z @✓ Why backpropagation

• Neural networks are sequences of parametrized functions • Parameters need to be set by minimizing some loss function • Minimization through gradient descent requires computing the gradient @z • Backpropagation: way to compute gradient @✓ The gradient of convnets

z z z z z5 = z 1 2 3 4 x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5 The gradient of convnets

z1 z2 z3 z4 z5 = z

x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5

@z @z @z = 3 @w3 @z3 @w3 The gradient of convnets

z1 z2 z3 z4 z5 = z

x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5

@z @z @z = 3 @w3 @z3 @w3 The gradient of convnets

z1 z2 z3 z4 z5 = z

x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5

@z @z @z = 3 @w3 @z3 @w3 The gradient of convnets

z1 z2 z3 z4 z5 = z

x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5 @z @z @z = 4 @z3 @z4 @z3 @z @z @z = 3 @w3 @z3 @w3 The gradient of convnets

z1 z2 z3 z4 z5 = z

x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5 @z @z @z = 4 @z3 @z4 @z3 @z @z @z = 3 @w3 @z3 @w3 The gradient of convnets

z1 z2 z3 z4 z5 = z

x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5

@z @z @z3 Recurrence = going @z2 @z3 @z2 backward!! @z @z @z = 2 @w2 @z2 @w2 The gradient of convnets

z1 z2 z3 z4 z5 = z

x f1 f2 f3 f4 f5

w 1 w2 w3 w4 w5 Backpropagation for a sequence of functions zi = fi(zi 1,wi) z0 = x z = zn • Assume we can compute partial derivatives of each function

@zi @fi(zi 1,wi) @zi @fi(zi 1,wi) = = @zi 1 @zi 1 @wi @wi • Use g(zi) to store gradient of z w.r.t zi, g(wi) for wi • Calculate g(zi ) by iterating backwards @z @z @zi @zi g(zn)= =1 g(zi 1)= = g(zi) @zn @zi @zi 1 @zi 1 • Use g(zi) to compute gradient of parameters @z @zi @zi g(wi)= = g(zi) @zi @wi @wi Loss as a function

label

conv subsample conv subsample linear loss

filters filters weights Putting it all together: SGD training of ConvNets 1. Sample image and label

Image label

conv subsample conv subsample linear loss

filters filters weights Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward)

Image label

conv subsample conv subsample linear loss

filters filters weights Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward) 3. Backpropagate to get gradients (backward)

Image label

conv subsample conv subsample linear loss

filters filters weights Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward) 3. Backpropagate to get gradients (backward) 4. Take step along negative gradients to update Image weights label

conv subsample conv subsample linear loss

filters filters weights Putting it all together: SGD training of ConvNets 1. Sample image and label 2. Pass image through network to get loss (forward) 3. Backpropagate to get gradients (backward) 4. Take step along negative gradients to update Image weights label 5. Repeat! conv subsample conv subsample linear loss

filters filters weights Beyond sequences: computation graphs • Arbitrary graphs of functions • No distinction between intermediate outputs and parameters u g k x f l z y w h Computation graph - Functions

• Each node implements two functions • A “forward” • Computes output given input • A “backward” • Computes derivative of z w.r.t input, given derivative of z w.r.t output Computation graphs

a

b fi d

c Computation graphs

@z @a @z @z fi @b @d @z @c Computation graphs

a

b fi d

c Computation graphs

@z @a @z @z fi @b @d @z @c Neural network frameworks Stochastic gradient descent

1 K ✓(t+1) ✓(t) L(h(x ; ✓(t)),y ) K r ik ik kX=1 Noisy! Momentum

• Average multiple gradient steps • Use exponential averaging 1 K g(t) L(h(x ; ✓(t)),y ) K r ik ik kX=1 (t) (t) (t 1) p µg +(1 µ)p ✓(t+1) ✓(t) p(t) Weight decay

• Add −"# $ to the gradient • Prevents # from growing to infinity • Equivalent to L2 regularization of weights Learning rate decay

• Large step size / learning rate • Faster convergence initially • Bouncing around at the end because of noisy gradients • Learning rate must be decreased over time • Usually done in steps Convolutional network training

• Initialize network • Sample minibatch of images • Forward pass to compute loss • Backpropagate loss to compute gradient • Combine gradient with momentum and weight decay • Take step according to current learning rate Setting hyperparameters

• How do we find a hyperparameter setting that works? • Try it! • Train on train • Test on test validation • Picking hyperparameters that work for test = Overfitting on test set Setting hyperparameters

Test on test Test on (Ideally only validation once)

Training iterations

Train Validation Test

Pick new hyperparameters Vagaries of optimization

• Non-convex • Local optima • Sensitivity to initialization • Vanishing / exploding gradients @z @z @zn 1 @zi+1 = ... @zi @zn 1 @zn 2 @zi

• If each term is (much) greater than 1 à explosion of gradients • If each term is (much) less than 1 à vanishing gradients Image Classification How to do

• Create training / validation sets • Identify loss functions • Choose hypothesis class • Find best hypothesis by minimizing training loss How to do machine learning

• Create training / validation sets Multiclass • Identify loss functions classificatio n!! • Choose hypothesis class • Find best hypothesis by minimizing training loss esk h(x)=s pˆ(y = k x) esk pˆ(y = k x)= sj | / | j e

L(h(x),y)= logp ˆ(y x) P | Building a convolutional network conv + relu + subsample conv + relu + subsample

linear

10 classes

average pool

conv + relu + subsample Building a convolutional network MNIST Classification

Method Error rate (%)

Linear classifier over pixels 12

Kernel SVM over HOG 0.56

Convolutional Network 0.8 ImageNet

• 1000 categories • ~1000 instances per category

Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. International Journal of , 2015. ImageNet

• Top-5 error: algorithm makes 5 predictions, true label must be in top 5 • Useful for incomplete labelings Challenge winner's accuracy 30 Convolutional 25 Networks 20 15 10 5 0 2010 2011 2012