Learning From Data Lecture 22 Neural Networks and Overfitting
Approximation vs. Generalization Regularization and Early Stopping Minimizing Ein More Efficienty
M. Magdon-Ismail CSCI 4100/6100 recap: Neural Networks and Fitting the Data
Forward Propagation: 0 L W(1) θ W(2) W( ) θ x = x(0) s(1) x(1) s(2) s(L) x(L) = h(x) Gradient Descent −→ −→ −→ ··· −→ −→ -1 1 s(ℓ) = (W(ℓ))tx(ℓ−1) x(ℓ) = θ(s(ℓ)) -2 (error) (Compute h and E ) 10 in log -3 Choose W = W(1), W(1),..., W(L) to minimize E { } in SGD -4
Gradient descent: 0 2 4 6 log (iteration) W(t + 1) W(t) η E (W(t)) 10 ← − ∇ in ∂e ∂e Compute gradient need need δ(ℓ) = −→ ∂W(ℓ) −→ ∂s(ℓ) e ∂ (ℓ 1) (ℓ) t = x − (δ ) ∂W(ℓ) Symmetry Backpropagation: δ(1) δ(2) δ(L−1) δ(L) ←− · · · ←− ←−
(ℓ) (ℓ) (ℓ) (ℓ+1) (ℓ+1) d δ = θ′(s ) W δ ⊗ 1 Average Intensity
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 2 /15 2-layer neural network −→ 2-Layer Neural Network
w0
w1 v1 w v2 2 m v3 w x 3 h(x) t h(x)= θ w0 + wjθ (vj x) v4 w4 v w5 j=1 5 X vm . . wm
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 3 /15 Tunable Transform −→ The Neural Network has a Tunable Transform
Neural Network Nonlinear Transform k-RBF-Network
˜ m d k t h(x) = θ w0 + wjθ (vj x) h(x)= θ w0 + wjΦj(x) h(x)= θ w + wjφ ( x µj ) 0 || − || j=1 ! j=1 j=1 ! X X X
1 E = O in m approximation↑
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 4 /15 Generalization −→ Generalization
MLP: dvc = O(md log(md))
m = √N
(convergence to optimal for MLP, just like k-NN)
semi-parametric because you still have to learn parameters.
tanh : dvc = O(md(m + d))
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 5 /15 Regularization – weight decay −→ Regularization – Weight Decay
1 N λ E (w)= (h(x ; w) y )2 + (w(ℓ))2 aug N n n N ij n=1 − X Xℓ,i,j
∂E (w) ∂E (w) 2λ aug = in + W(ℓ) ∂W(ℓ) ∂W(ℓ) N
backpropagation↑
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 6 /15 Digits data −→ Weight Decay with Digits Data
No Weight Decay Weight Decay, λ =0.01 Symmetry Symmetry
Average Intensity Average Intensity
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 7 /15 Early Stopping −→ Early Stopping
Eout(wt) Gradient Descent
g0 w1 = w0 η || g0 || Ω(dvc( t)) − H
w0 Error
1 H = w : w w η H1 { || − 0 || ≤ }
2 w H Ein( t) w1
w2 t iteration, t w0 ∗
= w : w w η H2 H1 ∪{ || − 1 || ≤ }
H3 w1 E w2 contour of constant in
w0 w3 = w : w w η H3 H2 ∪{ || − 2 || ≤ }
w(t∗) Each iteration explores a larger H H1 ⊂H2 ⊂H3 ⊂H4 ⊂··· w(0)
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 8 /15 Early stopping on digits data −→ Early Stopping on Digits Data
-1
Eval -1.2 (error) 10 -1.4 Ein log
-1.6 t∗ Symmetry
102 103 104 105 106 iteration, t
Use a validation set to determine t∗ Output w∗, do not retrain with all the data till t∗. Average Intensity
AM c L Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 9 /15 Minimizing Ein −→ Minimizing Ein
PSfrag 1. Use regression for classification 2. Use better algorithms than gradient descent
0 gradient descent
-2
-4 (error) 10 log -6 conjugate gradients
-8
0.1 1 10 102 103 104 optimization time (sec)
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 10 /15 Beefing up gradient descent −→ Beefing Up Gradient Descent
Determine the gradient g in in E E
Ein(w) Ein(w) in-sample error, in-sample error,
weights, w weights, w
Shallow: use large η. Deep:usesmall η.
M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 11 /15 Variable learning rate −→ Variable Learning Rate Gradient Descent
1: Initialize w(0), and η0 at t = 0. Set α> 1 and β < 1.
2: while stopping criterion has not been met do 3: Let g(t)= E (w(t)), and set v(t)= g(t). ∇ in −
4: if Ein(w(t)+ ηtv(t)) 6: else 7: reject: w(t +1)= w(t); decrease η: ηt = βηt. β [0.7, 0.8] +1 ∈ 8: end if 9: Iterate to the next step, t t + 1. ← 10: end while M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 12 /15 Steepest Descent - Line Search −→ Steepest Descent - Line Search 1: Initialize w(0) and set t = 0; contour of constant Ein 2: while stopping criterion has not been met do 3: Let g(t)= Ein(w(t)), and set v(t)= g(t). ∇ − v(t) 4: Let η∗ = argminη Ein(w(t)+ ηv(t)). 2 w∗ w 5: w(t +1)= w(t)+ η∗v(t). w(t + 1) g(t + 1) − 6: Iterate to the next step, t t + 1. ← 7: end while w(t) w How to accomplish the line search (step 4)? 1 Simple bisection (binary search) suffices in practice E(η3) E(η1) E(η2) η1 η2 η3 η¯ M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 13 /15 Comparison of optimization heuristics −→ Comparison of Optimization Heuristics 0 -1 -2 (error) -3 10 log -4 gradient descent variable η -5 steepest descent 0.1 1 10 102 103 104 optimization time (sec) Optimization Time Method 10 sec 1,000 sec 50,000 sec Gradient Descent 0.122 0.0214 0.0113 Stochastic Gradient Descent 0.0203 0.000447 1.6310 × 10−5 Variable Learning Rate 0.0432 0.0180 0.000197 Steepest Descent 0.0497 0.0194 0.000140 M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 14 /15 Conjugate gradients −→ Conjugate Gradients 1. Line search just like steepest descent. 0 2. Choose a better direction than g steepest descent − -2 contour of constant Ein -4 (error) 10 log -6 conjugate gradients v(t) -8 2 v(t + 1) w 0.1 1 10 102 103 104 w(t + 1) optimization time (sec) w(t) Optimization Time Method 10 sec 1,000 sec 50,000 sec Stochastic Gradient Descent 0.0203 0.000447 1.6310 10−5 Steepest Descent 0.0497 0.0194 0.000140× w − − 1 Conjugate Gradients 0.0200 1.13 × 10 6 2.73 × 10 9 There are better algorithms (eg. Levenberg-Marquardt), but we will stop here M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 15 /15