<<

Learning From Data Lecture 22 Neural Networks and Overfitting

Approximation vs. Generalization Regularization and Minimizing Ein More Efficienty

M. Magdon-Ismail CSCI 4100/6100 recap: Neural Networks and Fitting the Data

Forward Propagation: 0 L W(1) θ W(2) W( ) θ x = x(0) s(1) x(1) s(2) s(L) x(L) = h(x) −→ −→ −→ ··· −→ −→ -1 1 s(ℓ) = (W(ℓ))tx(ℓ−1) x(ℓ) = θ(s(ℓ))   -2 (error) (Compute h and E ) 10 in log -3 Choose W = W(1), W(1),..., W(L) to minimize E { } in SGD -4

Gradient descent: 0 2 4 6 log (iteration) W(t + 1) W(t) η E (W(t)) 10 ← − ∇ in ∂e ∂e Compute gradient need need δ(ℓ) = −→ ∂W(ℓ) −→ ∂s(ℓ) e ∂ (ℓ 1) (ℓ) t = x − (δ ) ∂W(ℓ) Symmetry : δ(1) δ(2) δ(L−1) δ(L) ←− · · · ←− ←−

(ℓ) (ℓ) (ℓ) (ℓ+1) (ℓ+1) d δ = θ′(s ) W δ ⊗ 1 Average Intensity  

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 2 /15 2- neural network −→ 2-Layer Neural Network

w0

w1 v1 w v2 2 m v3 w x 3 h(x) t h(x)= θ w0 + wjθ (vj x) v4 w4   v w5 j=1 5 X   vm . . wm

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 3 /15 Tunable Transform −→ The Neural Network has a Tunable Transform

Neural Network Nonlinear Transform k-RBF-Network

˜ m d k t h(x) = θ w0 + wjθ (vj x) h(x)= θ w0 + wjΦj(x) h(x)= θ w + wjφ ( x µj )   0 || − || j=1 ! j=1 j=1 ! X X X  

1 E = O in m   approximation↑

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 4 /15 Generalization −→ Generalization

MLP: dvc = O(md log(md))

m = √N

(convergence to optimal for MLP, just like k-NN)

semi-parametric because you still have to learn parameters.

tanh : dvc = O(md(m + d))

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 5 /15 Regularization – weight decay −→ Regularization – Weight Decay

1 N λ E (w)= (h(x ; w) y )2 + (w(ℓ))2 aug N n n N ij n=1 − X Xℓ,i,j

∂E (w) ∂E (w) 2λ aug = in + W(ℓ) ∂W(ℓ) ∂W(ℓ) N

backpropagation↑

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 6 /15 Digits data −→ Weight Decay with Digits Data

No Weight Decay Weight Decay, λ =0.01 Symmetry Symmetry

Average Intensity Average Intensity

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 7 /15 Early Stopping −→ Early Stopping

Eout(wt) Gradient Descent

g0 w1 = w0 η || g0 || Ω(dvc( t)) − H

w0 Error

1 H = w : w w η H1 { || − 0 || ≤ }

2 w H Ein( t) w1

w2 t iteration, t w0 ∗

= w : w w η H2 H1 ∪{ || − 1 || ≤ }

H3 w1 E w2 contour of constant in

w0 w3 = w : w w η H3 H2 ∪{ || − 2 || ≤ }

w(t∗) Each iteration explores a larger H H1 ⊂H2 ⊂H3 ⊂H4 ⊂··· w(0)

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 8 /15 Early stopping on digits data −→ Early Stopping on Digits Data

-1

Eval -1.2 (error) 10 -1.4 Ein log

-1.6 t∗ Symmetry

102 103 104 105 106 iteration, t

Use a validation set to determine t∗ Output w∗, do not retrain with all the data till t∗. Average Intensity

AM c L Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 9 /15 Minimizing Ein −→ Minimizing Ein

PSfrag 1. Use regression for classification 2. Use better algorithms than gradient descent

0 gradient descent

-2

-4 (error) 10 log -6 conjugate gradients

-8

0.1 1 10 102 103 104 optimization time (sec)

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 10 /15 Beefing up gradient descent −→ Beefing Up Gradient Descent

Determine the gradient g in in E E

Ein(w) Ein(w) in-sample error, in-sample error,

weights, w weights, w

Shallow: use large η. Deep:usesmall η.

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 11 /15 Variable −→ Variable Learning Rate Gradient Descent

1: Initialize w(0), and η0 at t = 0. Set α> 1 and β < 1.

2: while stopping criterion has not been met do 3: Let g(t)= E (w(t)), and set v(t)= g(t). ∇ in −

4: if Ein(w(t)+ ηtv(t))

6: else 7: reject: w(t +1)= w(t); decrease η: ηt = βηt. β [0.7, 0.8] +1 ∈

8: end if

9: Iterate to the next step, t t + 1. ←

10: end while

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 12 /15 Steepest Descent - Line Search −→ Steepest Descent - Line Search

1: Initialize w(0) and set t = 0;

contour of constant Ein 2: while stopping criterion has not been met do 3: Let g(t)= Ein(w(t)), and set v(t)= g(t). ∇ − v(t)

4: Let η∗ = argminη Ein(w(t)+ ηv(t)). 2 w∗ w 5: w(t +1)= w(t)+ η∗v(t). w(t + 1) g(t + 1) − 6: Iterate to the next step, t t + 1. ←

7: end while w(t)

w How to accomplish the line search (step 4)? 1 Simple bisection (binary search) suffices in practice

E(η3)

E(η1)

E(η2)

η1 η2 η3 η¯

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 13 /15 Comparison of optimization heuristics −→ Comparison of Optimization Heuristics

0

-1

-2

(error) -3 10 log -4 gradient descent variable η -5 steepest descent

0.1 1 10 102 103 104 optimization time (sec)

Optimization Time Method 10 sec 1,000 sec 50,000 sec Gradient Descent 0.122 0.0214 0.0113 Stochastic Gradient Descent 0.0203 0.000447 1.6310 × 10−5 Variable Learning Rate 0.0432 0.0180 0.000197 Steepest Descent 0.0497 0.0194 0.000140

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 14 /15 Conjugate gradients −→ Conjugate Gradients

1. Line search just like steepest descent. 0 2. Choose a better direction than g steepest descent − -2

contour of constant Ein -4 (error) 10 log -6 conjugate gradients v(t)

-8

2 v(t + 1)

w 0.1 1 10 102 103 104 w(t + 1) optimization time (sec)

w(t) Optimization Time Method 10 sec 1,000 sec 50,000 sec Stochastic Gradient Descent 0.0203 0.000447 1.6310 10−5 Steepest Descent 0.0497 0.0194 0.000140× w − − 1 Conjugate Gradients 0.0200 1.13 × 10 6 2.73 × 10 9

There are better algorithms (eg. Levenberg-Marquardt), but we will stop here

M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 15 /15