Optimization and Backpropagation
I2DL: Prof. Niessner, Prof. Leal-Taixé 1 Lecture 3 Recap
I2DL: Prof. Niessner, Prof. Leal-Taixé 2 Neural Network
• Linear score function 풇 = 푾풙
On CIFAR-10
On ImageNet Credit: Li/Karpathy/Johnson I2DL: Prof. Niessner, Prof. Leal-Taixé 3 Neural Network
• Linear score function 풇 = 푾풙
• Neural network is a nesting of ‘functions’
– 2-layers: 풇 = 푾ퟐ max(ퟎ, 푾ퟏ풙) – 3-layers: 풇 = 푾ퟑ max(ퟎ, 푾ퟐ max(ퟎ, 푾ퟏ풙)) – 4-layers: 풇 = 푾ퟒ tanh (푾ퟑ, max(ퟎ, 푾ퟐ max(ퟎ, 푾ퟏ풙))) – 5-layers: 풇 = 푾ퟓ휎(푾ퟒ tanh(푾ퟑ, max(ퟎ, 푾ퟐ max(ퟎ, 푾ퟏ풙)))) – … up to hundreds of layers
I2DL: Prof. Niessner, Prof. Leal-Taixé 4 Neural Network
Output layer Input layer Hidden layer Credit: Li/Karpathy/Johnson I2DL: Prof. Niessner, Prof. Leal-Taixé 5 Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Layer
Output Layer Width
Depth I2DL: Prof. Niessner, Prof. Leal-Taixé 6 Activation Functions 1 Leaky ReLU: max 0.1푥, 푥 Sigmoid: 휎 푥 = (1+푒−푥)
tanh: tanh 푥
Parametric ReLU: max 훼푥, 푥
푇 푇 Maxout max 푤1 푥 + 푏1, 푤2 푥 + 푏2 ReLU: max 0, 푥 푥 if 푥 > 0 ELU f x = ቊ α e푥 − 1 if 푥 ≤ 0
I2DL: Prof. Niessner, Prof. Leal-Taixé 7 Loss Functions
• Measure the goodness of the predictions (or equivalently, the network's performance)
• Regression loss 1 – L1 loss 퐿 풚, 풚ෝ; 휽 = σ푛 | 푦 − 푦ෝ | 푛 푖 푖 푖 1 1 – MSE loss 퐿 풚, 풚ෝ; 휽 = σ푛 | 푦 − 푦ෝ | 2 푛 푖 푖 푖 2 • Classification loss (for multi-class classification) 푛 푘 – Cross Entropy loss E 풚, 풚ෝ; 휽 = − σ푖=1 σ푘=1(푦푖푘 ∙ log 푦ො푖푘)
I2DL: Prof. Niessner, Prof. Leal-Taixé 8 Computational Graphs • Neural network is a computational graph
– It has compute nodes
– It has edges that connect nodes
– It is directional
– It is organized in ‘layers’
I2DL: Prof. Niessner, Prof. Leal-Taixé 9 Backprop
I2DL: Prof. Niessner, Prof. Leal-Taixé 10 The Importance of Gradients
• Our optimization schemes are based on computing gradients 훻휽퐿 휽 • One can compute gradients analytically but what if our function is too complex?
• Break down gradient computation Backpropagation
Rumelhart 1986 I2DL: Prof. Niessner, Prof. Leal-Taixé 11 Backprop: Forward Pass
• 푓 푥, 푦, 푧 = 푥 + 푦 ⋅ 푧 Initialization 푥 = 1, 푦 = −3, 푧 = 4 1 1
푑 = −2 −3 sum −3 푓 = −8
mult 4 4
I2DL: Prof. Niessner, Prof. Leal-Taixé 12 Backprop: Backward Pass
푓 푥, 푦, 푧 = 푥 + 푦 ⋅ 푧 1 1
푑 = −2 with 푥 = 1, 푦 = −3, 푧 = 4 −3 sum −3 푓 = −8
mult 휕푑 휕푑 4 푑 = 푥 + 푦 = 1, = 1 휕푥 휕푦 4
휕푓 휕푓 푓 = 푑 ⋅ 푧 = 푧, = 푑 휕푑 휕푧
휕푓 휕푓 휕푓 What is , , ? 휕푥 휕푦 휕푧
I2DL: Prof. Niessner, Prof. Leal-Taixé 13 Backprop: Backward Pass
푓 푥, 푦, 푧 = 푥 + 푦 ⋅ 푧 1 1
푑 = −2 with 푥 = 1, 푦 = −3, 푧 = 4 −3 sum −3 푓 = −8
mult 1 휕푑 휕푑 4 푑 = 푥 + 푦 = 1, = 1 휕푥 휕푦 4
휕푓 휕푓 푓 = 푑 ⋅ 푧 = 푧, = 푑 휕푑 휕푧 휕푓 휕푓 휕푓 휕푓 휕푓 What is , , ? 휕푥 휕푦 휕푧
I2DL: Prof. Niessner, Prof. Leal-Taixé 14 Backprop: Backward Pass
푓 푥, 푦, 푧 = 푥 + 푦 ⋅ 푧 1 1
푑 = −2 with 푥 = 1, 푦 = −3, 푧 = 4 −3 sum −3 푓 = −8
mult 1 휕푑 휕푑 4 푑 = 푥 + 푦 = 1, = 1 휕푥 휕푦 4 −2 −2 휕푓 휕푓 푓 = 푑 ⋅ 푧 = 푧, = 푑 휕푑 휕푧 휕푓 휕푧 휕푓 휕푓 휕푓 What is , , ? 휕푥 휕푦 휕푧
I2DL: Prof. Niessner, Prof. Leal-Taixé 15 Backprop: Backward Pass
푓 푥, 푦, 푧 = 푥 + 푦 ⋅ 푧 1 1
푑 = −2 with 푥 = 1, 푦 = −3, 푧 = 4 −3 sum 4 −3 푓 = −8
mult 1 휕푑 휕푑 4 푑 = 푥 + 푦 = 1, = 1 휕푥 휕푦 4 −2 −2 휕푓 휕푓 푓 = 푑 ⋅ 푧 = 푧, = 푑 휕푑 휕푧 휕푓 휕푑 휕푓 휕푓 휕푓 What is , , ? 휕푥 휕푦 휕푧
I2DL: Prof. Niessner, Prof. Leal-Taixé 16 Backprop: Backward Pass
푓 푥, 푦, 푧 = 푥 + 푦 ⋅ 푧 1 1
푑 = −2 with 푥 = 1, 푦 = −3, 푧 = 4 −3 sum 4 −3 4 푓 = −8 4 mult 1 휕푑 휕푑 4 푑 = 푥 + 푦 = 1, = 1 휕푥 휕푦 4 −2 휕푓 휕푓 푓 = 푑 ⋅ 푧 = 푧, = 푑 휕푑 휕푧 Chain Rule: 휕푓 휕푓 휕푓 휕푑 = ⋅ 휕푦 휕푓 휕푓 휕푓 What is , , ? 휕푦 휕푑 휕푦 휕푥 휕푦 휕푧 휕푓 → = 4 ⋅ 1 = 4 휕푦 I2DL: Prof. Niessner, Prof. Leal-Taixé 17 Backprop: Backward Pass
푓 푥, 푦, 푧 = 푥 + 푦 ⋅ 푧 1 1 4 4 푑 = −2 with 푥 = 1, 푦 = −3, 푧 = 4 −3 sum 4 −3 4 푓 = −8 4 mult 1 휕푑 휕푑 4 푑 = 푥 + 푦 = 1, = 1 휕푥 휕푦 4 −2 −2 휕푓 휕푓 푓 = 푑 ⋅ 푧 = 푧, = 푑 휕푑 휕푧 Chain Rule: 휕푓 휕푓 휕푓 휕푑 = ⋅ 휕푥 휕푓 휕푓 휕푓 What is , , ? 휕푥 휕푑 휕푥 휕푥 휕푦 휕푧 휕푓 → = 4 ⋅ 1 = 4 휕푥 I2DL: Prof. Niessner, Prof. Leal-Taixé 18 Compute Graphs -> Neural Networks
• 푥푘 input variables • 푤푙,푚,푛 network weights (note 3 indices) – 푙 which layer – 푚 which neuron in layer – 푛 which weight in neuron
• 푦ො푖 computed output (𝑖 output dim; 푛표푢푡) • 푦푖 ground truth targets • 퐿 loss function
I2DL: Prof. Niessner, Prof. Leal-Taixé 19 Compute Graphs -> Neural Networks
Input layer Output layer
푥0 ∗ 푤0 푥0 ∗ Loss/ + −푦0 푥 푥 푦ො0 푦0 cost 푥1 ∗ 푤1 푥1 Input Weights L2 Loss (unknowns!) function
e.g., class label/ regression target
I2DL: Prof. Niessner, Prof. Leal-Taixé 20 Compute Graphs -> Neural Networks
Input layer Output layer
푥0 ∗ 푤0 ∗ Loss/ 푥0 + max(0, 푥) −푦 푥 푥 0 cost
푦ො0 푦0 푥1 ∗ 푤1
푥1 ReLU Activation Input Weights (btw. I’m not arguing L2 Loss (unknowns!) this is the right choice here) function
e.g., class label/ We want to compute gradients w.r.t. all weights 푾 regression target
I2DL: Prof. Niessner, Prof. Leal-Taixé 21 Compute Graphs -> Neural Networks
∗ 푤0,0 Loss/ Input layer Output layer + −푦 푥∗푥 0 cost ∗ 푤0,1
∗ 푤1,0 푦ො0 푦0 푥0 ∗ Loss/ + −푦0 푥 푥 푥0 cost ∗ 푤 푥 1,1 푦ො1 푦1 1
푥1 ∗ 푤2,0 Loss/ 푦ො2 푦2 + −푦 푥∗푥 0 cost ∗ 푤2,1 We want to compute gradients w.r.t. all weights 푾 I2DL: Prof. Niessner, Prof. Leal-Taixé 22 Compute Graphs -> Neural Networks Input layer Output layer Goal: We want to compute gradients of the loss function 퐿 w.r.t. all weights 푾
푥0 퐿 = 퐿푖 푦ො 푦 푖 0 0 퐿: sum over loss per sample, e.g. … L2 loss ⟶ simply sum up squares: 2 푦ො1 푦1 퐿푖 = 푦ො푖 − 푦푖
푥푘 ⟶ use chain rule to compute partials 푦ො푖 = 퐴(푏푖 + 푥푘푤푖,푘) 휕퐿푖 휕퐿푖 휕푦ො푖 푘 = ⋅ 휕푤 휕푦ො 휕푤 Activation bias 푖,푘 푖 푖,푘 function We want to compute gradients w.r.t. all weights 푾 AND all biases 풃 I2DL: Prof. Niessner, Prof. Leal-Taixé 23 NNs as Computational Graphs
• We can express any kind of functions in a 1 computational graph, e.g. 푓 풘, 풙 = 1+푒− 푏+푤0푥0+푤1푥1
푤1 ∗ 푥1 Sigmoid function 1 + 휎 푥 = 1 + 푒−푥 푤0 ∗ + ∙ −1 exp(∙) +1 1 푥0 ∙
푏
I2DL: Prof. Niessner, Prof. Leal-Taixé 24 NNs as Computational Graphs
1 • 푓 풘, 풙 = 1+푒− 푏+푤0푥0+푤1푥1 2 푤1 ∗ −1 −2 푥1 + −3 4 푤0 6 ∗ 1 −1 0.37 1.37 0.73 −2 + ∙ −1 exp(∙) +1 1 푥0 ∙ −3 푏
I2DL: Prof. Niessner, Prof. Leal-Taixé 25 NNs as Computational Graphs
1 1 휕푔 1 • 푓 풘, 풙 = 푔 푥 = ⇒ = − 1+푒− 푏+푤0푥0+푤1푥1 푥 휕푥 푥2 휕푔 푔 푥 = 훼 + 푥 ⇒ = 1 2 훼 휕푥 푤 휕푔 1 푔 푥 = 푒푥 ⇒ = 푒푥 휕푥 ∗ 휕푔 −1 −2 푔 푥 = 훼푥 ⇒ = 훼 푥1 훼 휕푥 1 + 1 ∙ − = −0.53 −3 4 1.372 푤0 6 ∗ 1 −1 0.37 1.37 0.73 −2 + ∙ −1 exp(∙) +1 1 푥0 ∙ −0.53 1 −3 푏
I2DL: Prof. Niessner, Prof. Leal-Taixé 26 NNs as Computational Graphs
1 1 휕푔 1 • 푓 풘, 풙 = 푔 푥 = ⇒ = − 1+푒− 푏+푤0푥0+푤1푥1 푥 휕푥 푥2 휕푔 푔 푥 = 훼 + 푥 ⇒ = 1 2 훼 휕푥 푤 휕푔 1 푔 푥 = 푒푥 ⇒ = 푒푥 휕푥 ∗ 휕푔 −1 −2 푔 푥 = 훼푥 ⇒ = 훼 푥1 훼 휕푥 + −3 4 −0.53 ∙ 1 = −0.53 푤0 6 ∗ 1 −1 0.37 1.37 0.73 −2 + ∙ −1 exp(∙) +1 1 푥0 ∙ −0.53 −0.53 1 −3 푏
I2DL: Prof. Niessner, Prof. Leal-Taixé 27 NNs as Computational Graphs
1 1 휕푔 1 • 푓 풘, 풙 = 푔 푥 = ⇒ = − 1+푒− 푏+푤0푥0+푤1푥1 푥 휕푥 푥2 휕푔 푔 푥 = 훼 + 푥 ⇒ = 1 2 훼 휕푥 푤 휕푔 1 푔 푥 = 푒푥 ⇒ = 푒푥 휕푥 ∗ 휕푔 −1 −2 푔 푥 = 훼푥 ⇒ = 훼 푥1 훼 휕푥
+ −1 −3 4 −0.53 ∙ e = −0.2 푤0 6 ∗ 1 −1 0.37 1.37 0.73 −2 + ∙ −1 exp(∙) +1 1 푥0 ∙ −0.2 −0.53 −0.53 1 −3 푏
I2DL: Prof. Niessner, Prof. Leal-Taixé 28 NNs as Computational Graphs
1 1 휕푔 1 • 푓 풘, 풙 = 푔 푥 = ⇒ = − 1+푒− 푏+푤0푥0+푤1푥1 푥 휕푥 푥2 휕푔 푔 푥 = 훼 + 푥 ⇒ = 1 2 훼 휕푥 푤 휕푔 1 푔 푥 = 푒푥 ⇒ = 푒푥 휕푥 ∗ 휕푔 −1 −2 푔 푥 = 훼푥 ⇒ = 훼 푥1 훼 휕푥 + −3 4 −0.2 ∙ −1 = 0.2 푤0 6 ∗ 1 −1 0.37 1.37 0.73 −2 + ∙ −1 exp(∙) +1 1 푥0 ∙ 0.2 −0.2 −0.53 −0.53 1 −3 푏
I2DL: Prof. Niessner, Prof. Leal-Taixé 29 NNs as Computational Graphs
1 1 휕푔 1 • 푓 풘, 풙 = 푔 푥 = ⇒ = − 1+푒− 푏+푤0푥0+푤1푥1 푥 휕푥 푥2 휕푔 푔 푥 = 훼 + 푥 ⇒ = 1 2 훼 휕푥 푤 휕푔 −0.2 1 푔 푥 = 푒푥 ⇒ = 푒푥 ∗ 휕푥 −1 −2 휕푔 푥 푔훼 푥 = 훼푥 ⇒ = 훼 0.4 1 0.2 휕푥 + −3 6 4 −0.4 푤0 0.2 0.2 ∗ 1 −1 0.37 1.37 0.73 −2 + ∙ −1 exp(∙) +1 1 푥0 ∙ −0.6 0.2 −0.2 −0.53 −0.53 1 −3 0.2 푏
I2DL: Prof. Niessner, Prof. Leal-Taixé 30 Gradient Descent
I2DL: Prof. Niessner, Prof. Leal-Taixé 31 Gradient Descent
풙∗ = arg min 푓(풙)
Initialization
Optimum
I2DL: Prof. Niessner, Prof. Leal-Taixé 32 Gradient Descent
• From derivative to gradient Direction of greatest increase ⅆ푓 푥 of the function 훻푥푓 푥 ⅆ푥 • Gradient steps in direction of negative gradient
훻푥푓(푥) 푥 ′ 푥 = 푥 − 훼훻푥푓 푥
Learning rate
I2DL: Prof. Niessner, Prof. Leal-Taixé 33 Gradient Descent for Neural Networks
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Layer
Output Layer
Neurons 푚
푙 Layers I2DL: Prof. Niessner, Prof. Leal-Taixé 34 Gradient Descent for Neural Networks For a given training pair {풙, 풚}, we want to update all weights, i.e., we need to compute the derivatives w.r.t. to all weights: 휕푓 Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 휕푤0,0,0 … Output Layer 훻푾푓 풙,풚 (푾) = …
휕푓 Neurons
휕푤푙,푚,푛 푚 Gradient step: ′ 푾 = 푾 − 훼훻푾푓 풙,풚 (푾) 푙 Layers
I2DL: Prof. Niessner, Prof. Leal-Taixé 35 NNs can Become Quite Complex…
• These graphs can be huge!
[Szegedy et al.,CVPR’15] Going Deeper with Convolutions
I2DL: Prof. Niessner, Prof. Leal-Taixé 36 The Flow of the Gradients
• Many many many many of these nodes form a neural network
NEURONS
• Each one has its own work to do FORWARD AND BACKWARD PASS
I2DL: Prof. Niessner, Prof. Leal-Taixé 37 The Flow of the Gradients 휕퐿 휕퐿 휕푧 = 푥 휕푥 휕푧 휕푥 휕푧 휕푥 푧 = 푓(푥, 푦) 휕푧 푓 휕퐿
Activations 휕푦 휕푧 푦 Activation function I2DL: Prof. Niessner, Prof. Leal-Taixé 38 Gradient Descent for Neural Networks Loss function 2 퐿푖 = 푦ො푖 − 푦푖 ℎ0
푥0 ℎ1 푦ො0 푦0
푥1 ℎ2 푦ො1 푦1
푥2 ℎ3
푦ො푖 = 퐴(푏1,푖 + ℎ푗푤1,푖,푗) 푗 Just simple: ℎ = 퐴(푏 + 푥 푤 ) 푗 0,푗 푘 0,푗,푘 퐴 푥 = max(0, 푥) 푘 I2DL: Prof. Niessner, Prof. Leal-Taixé 39 Gradient Descent for Neural Networks
ℎ0 Backpropagation 푥0 ℎ1 푦ො0 푦0 휕퐿푖 휕퐿푖 휕푦ො푖 푥1 = ⋅ ℎ2 푦ො1 푦1 휕푤1,푖,푗 휕푦ො푖 휕푤1,푖,푗 푥2 ℎ3
휕퐿푖 = 2(푦ොi − yi) 휕푦ො푖
휕푦ො푖 ℎ푗 = 퐴(푏0,푗 + 푥푘푤0,푗,푘) = ℎ푗 if > 0, else 0 휕푤1,푖,푗 푘
푦ො = 퐴(푏 + ℎ 푤 ) 휕퐿 휕퐿 휕푦ො 휕ℎ푗 푖 1,푖 푗 1,푖,푗 푖 = 푖 ⋅ 푖 ⋅ 푗 through layer Just go layer by 휕푤0,푗,푘 휕푦ො푖 휕ℎ푗 휕푤0,푗,푘 2 퐿푖 = 푦ො푖 − 푦푖 …
I2DL: Prof. Niessner, Prof. Leal-Taixé 40 Gradient Descent for Neural Networks
ℎ0
푥0 ℎ1 푦ො0 푦0 How many unknown weights? 푥1 ℎ2 푦ො1 푦1 푥2 • Output layer: 2 ⋅ 4 + 2 ℎ3 • Hidden Layer: 4 ⋅ 3 + 4 #neurons ∙ #input channels + #biases ℎ푗 = 퐴(푏0,푗 + 푥푘푤0,푗,푘) 푘 Note that some activations 푦ො = 퐴(푏 + ℎ 푤 ) 푖 1,푖 푗 1,푖,푗 have also weights 푗 2 퐿푖 = 푦ො푖 − 푦푖
I2DL: Prof. Niessner, Prof. Leal-Taixé 41 Derivatives of Cross Entropy Loss
Gradients of weights of last layer: ℎ0
푥0 푦ො 휕퐿푖 휕퐿푖 휕푦ො푖 휕푠푖 ℎ1 0 푦0 = ∙ ∙ 푥1 휕푤푗푖 휕푦ො푖 휕푠푖 휕푤푗푖 ℎ2 푦ො1 푦1 푥2 휕퐿푖 −푦푖 1 − 푦푖 푦ො푖 − 푦푖 ℎ3 = + = , 휕푦ො푖 푦ො푖 1 − 푦ො푖 푦ො푖(1 − 푦ො푖) 휕푦ො푖 Binary Cross Entropy loss = 푦ො푖 1 − 푦ො푖 , 휕푠푖 푛표푢푡 휕푠 퐿 = − (푦 log 푦ො + 1 − 푦 log(1 − 푦ො )) 푖 = ℎ 푖 푖 푖 푖 휕푤 푗 푖=1 푗푖 1 푦ො푖 = −푠 푠푖 = ℎ푗푤푗푖 휕퐿 휕퐿 1 + 푒 푖 ⟹ 푖 = (푦ො − 푦 )ℎ , 푖 = 푦ො − 푦 j 휕푤 푖 푖 푗 휕푠 푖 푖 output scores 푗푖 푖 I2DL: Prof. Niessner, Prof. Leal-Taixé 42 Derivatives of Cross Entropy Loss
Gradients of weights of first layer:
푛표푢푡 푛표푢푡 푛표푢푡 휕퐿 휕퐿 휕푦ො푖 휕푠푗 휕퐿 = = 푦ො푖 1 − 푦ො푖 푤푗푖 = 푦ො푖 − 푦푖 푤푗푖 휕ℎ푗 휕푦ො푖 휕푠푗 휕ℎ푗 휕푦ො푖 푖=1 푖=1 푖=1
푛표푢푡 푛표푢푡 휕퐿 휕퐿 휕푠푖 휕ℎ푗 1 = 1 = (푦ො푖 − 푦푖)푤푗푖(ℎ푗 1 − ℎ푗 ) 휕푠 휕푠푖 휕ℎ푗 휕푠 푗 푖=1 푗 푖=1
푛표푢푡 1 푛표푢푡 휕퐿 휕퐿 휕푠푗 = = 푦ො − 푦 푤 ℎ 1 − ℎ 푥 휕푤1 휕푠1 휕푤1 푖 푖 푗푖 푗 푗 푘 푘푗 푖=1 푗 푘푗 푖=1
I2DL: Prof. Niessner, Prof. Leal-Taixé 43 Back to Compute Graphs & NNs
• Inputs 풙 and targets 풚 푥 푧 휎 ∗ max(0,∙) • Two-layer NN for regression 푦ො with ReLU activation 푤1 ∗ • Function we want to 푤2 퐿 optimize: 푦 푛 2 푤2max 0, 푤1푥푖 − 푦푖 2 푖=1
I2DL: Prof. Niessner, Prof. Leal-Taixé 44 Gradient Descent for Neural Networks
1 1 1 푥 3 3 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 푤 = 1, 푤 = 2 ∗ max(0,∙) 3 1 3 2 1 푦ො 3 푤1 ∗ 1 퐿 풚, 풚ෝ; 휽 = σ푛 | 푦ො − 푦 | 2 4 푛 푖 푖 푖 2 2 9 푤2 퐿 In our case 푛, 푑 = 1: 1 휕퐿 0 푦 퐿 = 푦ො − 푦 2 ⇒ = 2(푦ො − 푦) 휕푦ො 휕푦ො 푦ො = 푤2 ∙ 휎 ⇒ = 휎 Backpropagation 휕푤2 휕퐿 휕퐿 휕푦ො = ⋅ 휕푤2 휕푦ො 휕푤2
I2DL: Prof. Niessner, Prof. Leal-Taixé 45 Gradient Descent for Neural Networks
1 1 1 푥 3 3 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 푤 = 1, 푤 = 2 ∗ max(0,∙) 3 1 3 2 1 푦ො 4 3 푤1 ∗ 3 1 퐿 풚, 풚ෝ; 휽 = σ푛 | 푦ො − 푦 | 2 4 푛 푖 푖 푖 2 2 9 푤2 퐿 In our case 푛, 푑 = 1: 1 휕퐿 0 푦 퐿 = 푦ො − 푦 2 ⇒ = 2(푦ො − 푦) 휕푦ො 휕푦ො 푦ො = 푤2 ∙ 휎 ⇒ = 휎 Backpropagation 휕푤2 휕퐿 휕퐿 휕푦ො = ⋅ 휕푤2 휕푦ො 휕푤2 2 2 ∙ 3
I2DL: Prof. Niessner, Prof. Leal-Taixé 46 Gradient Descent for Neural Networks
1 1 1 푥 3 3 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 푤 = 1, 푤 = 2 ∗ max(0,∙) 3 1 3 2 1 푦ො 4 3 4 푤1 ∗ 3 1 9 퐿 풚, 풚ෝ; 휽 = σ푛 | 푦ො − 푦 | 2 4 푛 푖 푖 푖 2 2 9 푤2 퐿 4 1 In our case 푛, 푑 = 1: 9 휕퐿 0 푦 퐿 = 푦ො − 푦 2 ⇒ = 2(푦ො − 푦) 휕푦ො 휕푦ො 푦ො = 푤2 ∙ 휎 ⇒ = 휎 Backpropagation 휕푤2 휕퐿 휕퐿 휕푦ො = ⋅ 휕푤2 휕푦ො 휕푤2 2 1 2 ∙ 3 ∙ 3
I2DL: Prof. Niessner, Prof. Leal-Taixé 47 Gradient Descent for Neural Networks
1 1 1 푥 3 3 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 푤 = 1, 푤 = 2 ∗ max(0,∙) 3 1 3 2 1 푦ො 3 4 푤1 ∗ In our case 푛, 푑 = 1: 9 4 2 9 2 휕퐿 푤2 퐿 퐿 = 푦ො − 푦 ⇒ = 2(푦ො − 푦) 4 1 휕푦ො 9 휕푦ො 0 푦 푦ො = 푤 ∙ 휎 ⇒ = 푤 2 휕휎 2 휕휎 1 if 푥 > 0 휎 = max 0, 푧 ⇒ = ቊ Backpropagation 휕푧 0 else 휕퐿 휕퐿 휕푦ො 휕휎 휕푧 휕푧 = ⋅ ⋅ ⋅ 푧 = 푥 ∙ 푤1 ⇒ = 푥 휕푤1 휕푦ො 휕휎 휕푧 휕푤1 휕푤1
I2DL: Prof. Niessner, Prof. Leal-Taixé 48 Gradient Descent for Neural Networks
1 1 1 푥 3 3 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 푤 = 1, 푤 = 2 ∗ max(0,∙) 3 1 3 2 1 푦ො 4 3 4 푤1 ∗ 3 In our case 푛, 푑 = 1: 9 4 2 9 2 휕퐿 푤2 퐿 퐿 = 푦ො − 푦 ⇒ = 2(푦ො − 푦) 4 1 휕푦ො 9 휕푦ො 0 푦 푦ො = 푤 ∙ 휎 ⇒ = 푤 2 휕휎 2 휕휎 1 if 푥 > 0 휎 = max 0, 푧 ⇒ = ቊ Backpropagation 휕푧 0 else 휕퐿 휕퐿 휕푦ො 휕휎 휕푧 휕푧 = ⋅ ⋅ ⋅ 푧 = 푥 ∙ 푤1 ⇒ = 푥 휕푤1 휕푦ො 휕휎 휕푧 휕푤1 휕푤1 2 2 ∙ 3
I2DL: Prof. Niessner, Prof. Leal-Taixé 49 Gradient Descent for Neural Networks
1 1 1 푥 3 3 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 1 8 푤 = , 푤 = 2 ∗ max(0,∙) 3 3 1 3 2 1 푦ො 4 3 4 푤1 ∗ 3 In our case 푛, 푑 = 1: 9 4 2 9 2 휕퐿 푤2 퐿 퐿 = 푦ො − 푦 ⇒ = 2(푦ො − 푦) 4 1 휕푦ො 9 휕푦ො 0 푦 푦ො = 푤 ∙ 휎 ⇒ = 푤 2 휕휎 2 휕휎 1 if 푥 > 0 휎 = max 0, 푧 ⇒ = ቊ Backpropagation 휕푧 0 else 휕퐿 휕퐿 휕푦ො 휕휎 휕푧 휕푧 = ⋅ ⋅ ⋅ 푧 = 푥 ∙ 푤1 ⇒ = 푥 휕푤1 휕푦ො 휕휎 휕푧 휕푤1 휕푤1 2 2 ∙ 3 ∙ 2
I2DL: Prof. Niessner, Prof. Leal-Taixé 50 Gradient Descent for Neural Networks
1 1 1 푥 3 3 8 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 1 3 8 푤 = , 푤 = 2 ∗ max(0,∙) 3 3 1 3 2 1 푦ො 4 3 4 푤1 ∗ 3 In our case 푛, 푑 = 1: 9 4 2 9 2 휕퐿 푤2 퐿 퐿 = 푦ො − 푦 ⇒ = 2(푦ො − 푦) 4 1 휕푦ො 9 휕푦ො 0 푦 푦ො = 푤 ∙ 휎 ⇒ = 푤 2 휕휎 2 휕휎 1 if 푥 > 0 휎 = max 0, 푧 ⇒ = ቊ Backpropagation 휕푧 0 else 휕퐿 휕퐿 휕푦ො 휕휎 휕푧 휕푧 = ⋅ ⋅ ⋅ 푧 = 푥 ∙ 푤1 ⇒ = 푥 휕푤1 휕푦ො 휕휎 휕푧 휕푤1 휕푤1 2 2 ∙ 3 ∙ 2 ∙ 1
I2DL: Prof. Niessner, Prof. Leal-Taixé 51 Gradient Descent for Neural Networks
1 1 1 푥 3 3 8 Initialize 푥 = 1, 푦 = 0, 푧 휎 2 1 8 3 8 푤 = , 푤 = 2 ∗ max(0,∙) 3 3 1 3 2 1 3 푦ො 4 3 4 푤1 ∗ 3 In our case 푛, 푑 = 1: 8 9 4 3 2 9 2 휕퐿 푤2 퐿 퐿 = 푦ො − 푦 ⇒ = 2(푦ො − 푦) 4 1 휕푦ො 9 휕푦ො 0 푦 푦ො = 푤 ∙ 휎 ⇒ = 푤 2 휕휎 2 휕휎 1 if 푥 > 0 휎 = max 0, 푧 ⇒ = ቊ Backpropagation 휕푧 0 else 휕퐿 휕퐿 휕푦ො 휕휎 휕푧 휕푧 = ⋅ ⋅ ⋅ 푧 = 푥 ∙ 푤1 ⇒ = 푥 휕푤1 휕푦ො 휕휎 휕푧 휕푤1 휕푤1 2 2 ∙ 3 ∙ 2 ∙ 1 ∙ 1
I2DL: Prof. Niessner, Prof. Leal-Taixé 52 Gradient Descent for Neural Networks
1 1 1 푥 3 3 • Function we want to 8 푧 휎 2 8 3 8 optimize: ∗ max(0,∙) 3 3 1 3 푛 푦ො 4 3 4 2 푤1 ∗ 3 푓 푥, 풘 = 푤2max 0, 푤1푥푖 − 푦푖 2 8 9 4 3 2 9 푖=1 푤2 퐿 4 1 9 • Computed gradients wrt 0 푦 to weights 푤1and 푤2 • Now: update the weights 훻 푓 ′ 푤1 푤1 But: how to choose a 풘 = 풘 − 훼 ∙ 훻풘푓 = − 훼 ∙ 푤2 훻 푓 푤2 good learning rate 훼 ? 1 8 3 3 = − 훼 ∙ 4 2 9 I2DL: Prof. Niessner, Prof. Leal-Taixé 53 Gradient Descent
• How to pick good learning rate?
• How to compute gradient for single training pair?
• How to compute gradient for large training set?
• How to speed things up? More to see in next lectures…
I2DL: Prof. Niessner, Prof. Leal-Taixé 54 Regularization
I2DL: Prof. Niessner, Prof. Leal-Taixé 55 Recap: Basic Recipe for ML
• Split your data
60% 20% 20%
train validation test
Find your hyperparameters
Other splits are also possible (e.g., 80%/10%/10%) I2DL: Prof. Niessner, Prof. Leal-Taixé 56 Over- and Underfitting
Underfitted Appropriate Overfitted
Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
I2DL: Prof. Niessner, Prof. Leal-Taixé 57 Training a Neural Network
• Training/ Validation curve How can we prevent our model from overfitting?
Training error Generalization Regularization too high gap is too big
Credits: Deep Learning. Goodfellow et al.
I2DL: Prof. Niessner, Prof. Leal-Taixé 58 Regularization
푛 2 • Loss function 퐿(풚, 풚ෝ, 휽) = σ푖=1 푦ො푖 − 푦푖 + 휆푅(휽)
• Regularization techniques – L2 regularization Add regularization – L1 regularization term to loss function – Max norm regularization – Dropout More details – Early stopping later – ...
I2DL: Prof. Niessner, Prof. Leal-Taixé 59 Regularization: Example
• Input: 3 features 풙 = [1, 2, 1]
• Two linear classifiers that give the same result:
Ignores 2 features • 휃1 = 0, 0.75, 0
Takes information • 휃 = [0.25, 0.5, 0.25] 2 from all features
I2DL: Prof. Niessner, Prof. Leal-Taixé 60 Regularization: Example
푛 2 • Loss 퐿 풚, 풚ෝ, 휽 = σ푖=1 푥푖휃푗푖 − 푦푖 + 휆푅 휽 푛 2 • L2 regularization푅 휽 = 휃푖 푖=1 2 휃1 0 + 0.75 + 0 = 0.5625 2 2 2 휃2 0.25 + 0.5 + 0.25 = 0.375 Minimization
푥 = 1, 2, 1 , 휃1 = 0, 0.75, 0 , 휃2 = [0.25, 0.5, 0.25]
I2DL: Prof. Niessner, Prof. Leal-Taixé 61 Regularization: Example
푛 2 • Loss 퐿 풚, 풚ෝ, 휽 = σ푖=1 푥푖휃푗푖 − 푦푖 + 휆푅 휽 푛
• L1 regularization푅 휽 = 휃푖 푖=1
휃1 0 + 0.75 + 0 = 0.75 휃2 0.25 + 0.5 + 0.25 = 1 Minimization
푥 = 1, 2, 1 , 휃1 = 0, 0.75, 0 , 휃2 = [0.25, 0.5, 0.25]
I2DL: Prof. Niessner, Prof. Leal-Taixé 62 Regularization: Example
• Input: 3 features 풙 = [1, 2, 1]
• Two linear classifiers that give the same result:
휃1 = 0, 0.75, 0 Ignores 2 features
Takes information 휃2 = [0.25, 0.5, 0.25] from all features
I2DL: Prof. Niessner, Prof. Leal-Taixé 63 Regularization: Example
• Input: 3 features 풙 = [1, 2, 1]
• Two linear classifiers that give the same result:
휃 = 0, 0.75, 0 L1 regularization 1 enforces sparsity
Takes information 휃2 = [0.25, 0.5, 0.25] from all features
I2DL: Prof. Niessner, Prof. Leal-Taixé 64 Regularization: Example
• Input: 3 features 풙 = [1, 2, 1]
• Two linear classifiers that give the same result:
휃 = 0, 0.75, 0 L1 regularization 1 enforces sparsity
휃 = [0.25, 0.5, 0.25] L2 regularization 2 enforces that the weights have similar values
I2DL: Prof. Niessner, Prof. Leal-Taixé 65 Regularization: Effect
• Dog classifier takes different inputs
Furry L1 Has two eyes regularization will focus all the Has a tail attention to a few key features Has paws
Has two ears
I2DL: Prof. Niessner, Prof. Leal-Taixé 66 Regularization: Effect
• Dog classifier takes different inputs
Furry L2 regularization Has two eyes will take all information into Has a tail account to make decisions Has paws
Has two ears
I2DL: Prof. Niessner, Prof. Leal-Taixé 67 Regularization for Neural Networks
푥
∗ max(0,∙)
푤1 ∗ 퐿2 푤 + 퐿 2 loss 푦
푅(푤1, 푤2) ∙ 휆 Combining nodes: 푛 Network output + L2-loss + 2 푤2max 0, 푤1푥푖 − 푦푖 2 + 휆푅 푤1, 푤2 regularization 푖=1 I2DL: Prof. Niessner, Prof. Leal-Taixé 68 Regularization for Neural Networks
푥
∗ max(0,∙)
푤1 ∗ 퐿2 푤 + 퐿 2 loss 푦
푅(푤1, 푤2) ∙ 휆 Combining nodes: 푛 푤 2 Network output + L2-loss + 2 1 푤2max 0, 푤1푥푖 − 푦푖 2 + 휆 푤2 2 regularization 푖=1 I2DL: Prof. Niessner, Prof. Leal-Taixé 69 Regularization for Neural Networks
푥
∗ max(0,∙)
푤1 ∗ 퐿2 푤 + 퐿 2 loss 푦
푅(푤1, 푤2) ∙ 휆 Combining nodes: 푛 Network output + L2-loss + 2 2 2 푤2max 0, 푤1푥푖 − 푦푖 2 + 휆(푤1 + 푤2 ) regularization 푖=1 I2DL: Prof. Niessner, Prof. Leal-Taixé 70 Regularization
Regularization 휆 = 0 휆 = 000001 휆 = 0.001 휆 = 1 휆 = 10
Decision Boundary
Credits: University of Washington What is the goal of regularization? What happens to the training error?
I2DL: Prof. Niessner, Prof. Leal-Taixé 71 Regularization
• Any strategy that aims to
Lower Increasing validation training error error
I2DL: Prof. Niessner, Prof. Leal-Taixé 72 Next Lecture
• This week: – Check exercises – Check office hours ☺
• Next lecture – Optimization of Neural Networks – In particular, introduction to SGD (our main method!)
I2DL: Prof. Niessner, Prof. Leal-Taixé 73 See you next week ☺
I2DL: Prof. Niessner, Prof. Leal-Taixé 74 Further Reading
• Backpropagation – Chapter 6.5 (6.5.1 - 6.5.3) in http://www.deeplearningbook.org/contents/mlp.html – Chapter 5.3 in Bishop, Pattern Recognition and Machine Learning – http://cs231n.github.io/optimization-2/ • Regularization – Chapter 7.1 (esp. 7.1.1 & 7.1.2) http://www.deeplearningbook.org/contents/regularization.html – Chapter 5.5 in Bishop, Pattern Recognition and Machine Learning
I2DL: Prof. Niessner, Prof. Leal-Taixé 75