Learning from Data Lecture 22 Neural Networks and Overfitting

Learning From Data Lecture 22 Neural Networks and Overfitting Approximation vs. Generalization Regularization and Early Stopping Minimizing Ein More Efficienty M. Magdon-Ismail CSCI 4100/6100 recap: Neural Networks and Fitting the Data Forward Propagation: 0 L W(1) θ W(2) W( ) θ x = x(0) s(1) x(1) s(2) s(L) x(L) = h(x) Gradient Descent −→ −→ −→ ··· −→ −→ -1 1 s(ℓ) = (W(ℓ))tx(ℓ−1) x(ℓ) = θ(s(ℓ)) -2 (error) (Compute h and E ) 10 in log -3 Choose W = W(1), W(1),..., W(L) to minimize E { } in SGD -4 Gradient descent: 0 2 4 6 log (iteration) W(t + 1) W(t) η E (W(t)) 10 ← − ∇ in ∂e ∂e Compute gradient need need δ(ℓ) = −→ ∂W(ℓ) −→ ∂s(ℓ) e ∂ (ℓ 1) (ℓ) t = x − (δ ) ∂W(ℓ) Symmetry Backpropagation: δ(1) δ(2) δ(L−1) δ(L) ←− · · · ←− ←− (ℓ) (ℓ) (ℓ) (ℓ+1) (ℓ+1) d δ = θ′(s ) W δ ⊗ 1 Average Intensity M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 2 /15 2-layer neural network −→ 2-Layer Neural Network w0 w1 v1 w v2 2 m v3 w x 3 h(x) t h(x)= θ w0 + wjθ (vj x) v4 w4 v w5 j=1 5 X vm . wm M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 3 /15 Tunable Transform −→ The Neural Network has a Tunable Transform Neural Network Nonlinear Transform k-RBF-Network ˜ m d k t h(x) = θ w0 + wjθ (vj x) h(x)= θ w0 + wjΦj(x) h(x)= θ w + wjφ ( x µj ) 0 || − || j=1 ! j=1 j=1 ! X X X 1 E = O in m approximation↑ M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 4 /15 Generalization −→ Generalization MLP: dvc = O(md log(md)) m = √N (convergence to optimal for MLP, just like k-NN) semi-parametric because you still have to learn parameters. tanh : dvc = O(md(m + d)) M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 5 /15 Regularization – weight decay −→ Regularization – Weight Decay 1 N λ E (w)= (h(x ; w) y )2 + (w(ℓ))2 aug N n n N ij n=1 − X Xℓ,i,j ∂E (w) ∂E (w) 2λ aug = in + W(ℓ) ∂W(ℓ) ∂W(ℓ) N backpropagation↑ M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 6 /15 Digits data −→ Weight Decay with Digits Data No Weight Decay Weight Decay, λ =0.01 Symmetry Symmetry Average Intensity Average Intensity M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 7 /15 Early Stopping −→ Early Stopping Eout(wt) Gradient Descent g0 w1 = w0 η || g0 || Ω(dvc( t)) − H w0 Error 1 H = w : w w η H1 { || − 0 || ≤ } 2 w H Ein( t) w1 w2 t iteration, t w0 ∗ = w : w w η H2 H1 ∪{ || − 1 || ≤ } H3 w1 E w2 contour of constant in w0 w3 = w : w w η H3 H2 ∪{ || − 2 || ≤ } w(t∗) Each iteration explores a larger H H1 ⊂H2 ⊂H3 ⊂H4 ⊂··· w(0) M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 8 /15 Early stopping on digits data −→ Early Stopping on Digits Data -1 Eval -1.2 (error) 10 -1.4 Ein log -1.6 t∗ Symmetry 102 103 104 105 106 iteration, t Use a validation set to determine t∗ Output w∗, do not retrain with all the data till t∗. Average Intensity AM c L Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 9 /15 Minimizing Ein −→ Minimizing Ein PSfrag 1. Use regression for classification 2. Use better algorithms than gradient descent 0 gradient descent -2 -4 (error) 10 log -6 conjugate gradients -8 0.1 1 10 102 103 104 optimization time (sec) M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 10 /15 Beefing up gradient descent −→ Beefing Up Gradient Descent Determine the gradient g in in E E Ein(w) Ein(w) in-sample error, in-sample error, weights, w weights, w Shallow: use large η. Deep:usesmall η. M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 11 /15 Variable learning rate −→ Variable Learning Rate Gradient Descent 1: Initialize w(0), and η0 at t = 0. Set α> 1 and β < 1. 2: while stopping criterion has not been met do 3: Let g(t)= E (w(t)), and set v(t)= g(t). ∇ in − 4: if Ein(w(t)+ ηtv(t)) <Ein(w(t)) then 5: accept: w(t +1)= w(t)+ ηtv(t); increment η: ηt = αηt. α [1.05, 1.1] +1 ∈ 6: else 7: reject: w(t +1)= w(t); decrease η: ηt = βηt. β [0.7, 0.8] +1 ∈ 8: end if 9: Iterate to the next step, t t + 1. ← 10: end while M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 12 /15 Steepest Descent - Line Search −→ Steepest Descent - Line Search 1: Initialize w(0) and set t = 0; contour of constant Ein 2: while stopping criterion has not been met do 3: Let g(t)= Ein(w(t)), and set v(t)= g(t). ∇ − v(t) 4: Let η∗ = argminη Ein(w(t)+ ηv(t)). 2 w∗ w 5: w(t +1)= w(t)+ η∗v(t). w(t + 1) g(t + 1) − 6: Iterate to the next step, t t + 1. ← 7: end while w(t) w How to accomplish the line search (step 4)? 1 Simple bisection (binary search) suffices in practice E(η3) E(η1) E(η2) η1 η2 η3 η¯ M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 13 /15 Comparison of optimization heuristics −→ Comparison of Optimization Heuristics 0 -1 -2 (error) -3 10 log -4 gradient descent variable η -5 steepest descent 0.1 1 10 102 103 104 optimization time (sec) Optimization Time Method 10 sec 1,000 sec 50,000 sec Gradient Descent 0.122 0.0214 0.0113 Stochastic Gradient Descent 0.0203 0.000447 1.6310 × 10−5 Variable Learning Rate 0.0432 0.0180 0.000197 Steepest Descent 0.0497 0.0194 0.000140 M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 14 /15 Conjugate gradients −→ Conjugate Gradients 1. Line search just like steepest descent. 0 2. Choose a better direction than g steepest descent − -2 contour of constant Ein -4 (error) 10 log -6 conjugate gradients v(t) -8 2 v(t + 1) w 0.1 1 10 102 103 104 w(t + 1) optimization time (sec) w(t) Optimization Time Method 10 sec 1,000 sec 50,000 sec Stochastic Gradient Descent 0.0203 0.000447 1.6310 10−5 Steepest Descent 0.0497 0.0194 0.000140× w − − 1 Conjugate Gradients 0.0200 1.13 × 10 6 2.73 × 10 9 There are better algorithms (eg. Levenberg-Marquardt), but we will stop here M c AL Creator: Malik Magdon-Ismail Neural Networks and Overfitting: 15 /15 .

Load more