Simple Early Stopping Rules in Machine Learning

Simple early stopping rules in machine learning Sina Baghal University of Waterloo RWTH Aachen University, Chair for Mathematics of Information Processing Part I Optimization in machine learning Learning via optimization over data * Empirical risk minimization: m 1 X min f (θ) := `(xi ; yi ; θ) θ2D m i=1 d (xi ; yi ) are feature label, xi 2 R , m = # samples, ` loss function. * Neural networks m 1 X min f (θ) = (y − `(x ; θ))2 m i i i=1 N ×N N ×N I θ = (W1; ··· ; WL) ; θ 2 D := R 1 0 × · · · × R L L−1 ; N0 = d; NL = 1. I `(:; θ) = WL ◦ σ ◦ · · · ◦ σ ◦ W1 I σ(:) = max(0;:), etc. Visualization of neural networks Courtesy of infowatch.com Example: Binary classification I Given samples f(xi ; yi ): i = 1; ··· ; mg, find hyperplane H with normal vector h s.t. ∗ T h = argmin fi : sign(xi h) 6= yi g Here I ( 1 sign(xT h) 6= y `(x ; y ; h) = i i i i T 0 sign(xi h) = yi : I m 1 X min `(xi ; yi ; h) h2 d−1 m S i=1 Convex relaxation (from error to losses) * Replace the non-convex loss function with a convex surrogate. Common loss functions for binary classification: I Linear regression: `(xT h; y) = (xT h − y)2 I Hinge: `(xT h; y) = maxf0; 1 − yxT hg I Logistic: `(xT h; y) = log(1 + exp(−yxT h)) Theorem (Bartlett et al, 2006) Excess error ≤ Excess loss Part II First order methods What is a first order method? I First order methods used to solve the following optimization problem min f (x) x2D only use f (x); rf (x) to perform updates. I Low iteration costs and low memory storage are the main advantages that make first-order methods suitable for large-scale optimization problems which appear ubiquitously in machine learning and data science applications. Gradient descent & stochastic gradient descent * Gradient descent I Initialize at x0 2 D I xk+1 = xk − αrf (xk ) I Terminate when krf (x ∗)k ≤ * Stochastic gradient descent m 1 X min f (θ) := f (θ) m i i=1 I Gradient evaluation cost : m times more I At each iteration choose index i at random and compute rfi (θ) to obtain the descent direction(?) I Obvious case: f1 = ··· = fm 2 2 I Less obvious case: E krf (θ) − rfi (θ)k ≤ σ Why SGD? q 2 R2 mR2 Figure: O Rσ T + T vs:O T Courtesy of F. Bach I Bounds for SGD holds in expectation Convergence analysis I To achieve these bounds, we need to consider specific step-size values (major drawback) I There exists a dichotomy between tuning step-sizes in theory and practice (line search) I Other methods for proving convergence: Lee et al 2016, Soltanolkotabi 2017, Bah et al 2019, ... More first order methods I Frank-Wolfe: Projection free, specific structure e.g. sparsity is desired I Variance reduction (Mini-batching, SVRG, ..): Consider batches of size b > 1 to reduce the variance, tweak the algorithm, ... I Adaptive algorithm (Adagrad, Adam, ..): Update each coordinate of the gradient by a separate rate Why not second order? I Bypass Hessian creation and inversion as well as speeding up the Hessian vector product. I See Agarwal, Bullins, Hazan 2017 Optimization is not all! Optimization: Why we can fit? Generalization: Why we can predict? Part III Early stopping Why early stopping? I A form of regularization to avoid over-fitting when iterative methods such as gradient descent are used for learning. Up to a point, training improves performance on future data i.e. low generalization error. After that training error increases at the expense of worse generalization error. Figure: Black curve (early stopping is used), green curve (early stopping is not used) Learning from noisy samples using early stopping m×n I Recap (least squares problem). Let A 2 R and set ∗ 2 b = Ax + ξ where ξ ∼ N(0; σ Id ): 1 2 I Least squares problem: min 2 kAx − bk I Expected value of error term: n k ∗ 2 ∗ T k k T ∗ 2 X 1 − βi E kxk − x k = (x ) Udiag(β1 ; ··· ; βn )U x +σ : λi i=1 I Here βi = 1 − αλi where λ1 ≥ · · · ≥ λn are eigenvalues of T T T A A. In particular, A A = Udiag (λ1; ··· ; λn) U . Preliminary results (Joint work with S. Vavasis) I Consider the case where A is a de-blurring matrix and x∗ is a ∗ ∗ smooth signal. Also let xi := (Ux )i . P ∗ 2 1−δ P ∗ 2 I Assumption: i>r λi (xi ) ≤ 8 · i≤r λi (xi ) : I Denote the stopping time 2 2 T := inf k : kAxk k ≥ δkbk I Consider the GD iterates applied to the least squares problem ∗ with the bias-variance decomposition xk − x = bk + vk 2 2 Pn 1 Recall that [kv+1k ] = σ . I E i=1 λi T ∗ Figure: Eigenvalues of A A (blue) and corresponding xi (green) Cont. q 1−δ ∗ I For σ ≤ O δ · kAx k the following is then true 1. n X λi kb k2 ≤ exp 2 log 2(1 − δ) · (x ∗)2 T λ2 i i=1 r 2. n ( 2 ) 2 2 X 1 1 − δ λi kvT k ≤ σ min ; log · 2 w.h.p λi 8 λ i=1 r * Proof Sketch: 1. We define deterministic stopping times τ1 and τ2 2. Using Hanson-Wright type concentration inequalities, we show that w.h.p T lies in [τ1; τ2]. 3. We use the first assumption to show that 1 1 − δ 1 [τ1; τ2] ⊆ k : − log (2(1 − δ)) · ≤ kα ≤ − log · λ1 8 λr 4. We next obtain bounds for Cb := sup kbk k; and Cv := sup kvk k k≥τ1 k≤τ2 Double descent Figure: Double descent curve I See Belkin et al 2018 I See Xie et al 2020 - Implicit bias towards smooth interpolations leads to low generalization error I See Baratin et al 2020 I See Heckel et al 2020 Robustness to label noise via early stopping Theorem (Oymak, Soltanolkotabi 2019) In over-parametrized neural networks even with a corruption level ρ 1 smaller than 16 , gradient descent finds a model with perfect accuracy if early stopping is used. Part IV A termination criterion for SGD for binary classification Collaborators: Stephen Vavasis, Courtney Paquette Binary classification via expected loss I Data comes from a mixture model with two means µ0 and µ1 I Without much loss of generality, assume µ0 + µ1 = 0 Figure: Re-centring Binary classification via expected loss(Cont.) I We minimize the expected loss function. T f (θ) = E(ζ;y)∼P `(ζ θ; y) where ( log (1 + exp(x)) y = 0 `(x; y) = log (1 + exp(−x)) y = 1 I We fold the data set in half and minimize ^ ^ T f (θ) := Eξ∼P^`(ξ θ) where `^(x) := log(1 + exp(−x)). Denote θ∗ := argmin f^(θ) Model assumption I We assume that the data is isotropically Gaussian distributed i.e. 2 P^ ≡ N(µ; σ Id ): Lemma (Classification of the optimal linear classifiers) The following is true. T ∗ argmax P ξ θ > 0 = R++ · θ More importantly, θ∗ = ρ∗µ. In case of logistic loss, ρ∗σ2 = 2. SGD with termination I Initialize θ0 = 0, α > 0. I Update as follows αξk+1 θk+1 = θk + T : 1 + exp θk ξk+1 I Terminate when T θk ξk+1 ≥ 1 free! (1) * Criterion (1) is used in the experiments, but in the analysis, we need to have σ-algebras F(θ1; θ2; ··· ) and F(ξ1; ξ2; ··· ) to be independent. We consider the following test instead. T ^ θk ξk ≥ 1 Here ξ^1; ξ^2; · · · ∼ P^ are independently generated. Formally set n ^T o T := inf k : ξk θk ≥ 1 Results Theorem (Bound for expected value of T ) 1. (Low noise) Suppose that σ ≤ 0:33kµk and α > 0 is arbitrary. The following bound is then true. 2(c + c M)2 kµk ασ3 −kµk2 [T ] ≤ 2+ 1 2 · Φc + · exp + 1 E M σ kµk 2σ2 Here M = αkµk2. 2. (High noise) Suppose that σ ≥ 0:33kµk and α satisfies kµk2 α ≤ A · σ2(kµk2+dσ2) . It then holds that E[T ] < +1. Proof sketch I We develop Lyapunov functions in the following sense: Theorem (Meyn and Tweedi 2012) d Assume that there exist a set C ⊆ R and a non-negative function d V : R ! R+ such that we have that E [V (θk )jpast] ≤ V (θk−1) − 1 whenever θk−1 2= C: (2) The following is then true: h C i E τ1 jθ0 = θ ≤ V (θ): (3) C Here τ1 denotes the stopping time that we hit the set C. In C particular, if we denote by τm the m-th time we hit C, it then holds that h C i E τm ≤ O(m); in the case where C is compact. Proof sketch(Cont.) I Further suppose that for δ > 0 the test will be activated with probability at least δ for any θ 2 C. Lemma The following is true. +1 X h C i m E[T ] ≤ E[TC ] ≤ E τm (1 − δ) : m=1 Here TC denotes the first time the termination criterion triggers while simultaneously the iterate lies in C. Proof sketch(Cont.) I We define the Lyapunov function and also the target set C in two different regimes. I (High noise) If σ ≥ ckµk, then n o 1 C := θ : jρ − ρ∗j < 1 ρ∗ , σkθ~k ≤ c0 ; V (θ) = kθ−θ∗k2; 2 2α where θ = ρµ + θ~. I (Low noise) If σ ≤ ckµk, then 2 C = fθ : µT θ ≥ 1g; V (θ) = M~ − µT θ ; 2 where M~ = c1 + c2αkµk .

Simple Early Stopping Rules in Machine Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support