Simple Early Stopping Rules in Machine Learning

Simple early stopping rules in machine learning Sina Baghal University of Waterloo RWTH Aachen University, Chair for Mathematics of Information Processing Part I Optimization in machine learning Learning via optimization over data * Empirical risk minimization: m 1 X min f (θ) := `(xi ; yi ; θ) θ2D m i=1 d (xi ; yi ) are feature label, xi 2 R , m = # samples, ` loss function. * Neural networks m 1 X min f (θ) = (y − `(x ; θ))2 m i i i=1 N ×N N ×N I θ = (W1; ··· ; WL) ; θ 2 D := R 1 0 × · · · × R L L−1 ; N0 = d; NL = 1. I `(:; θ) = WL ◦ σ ◦ · · · ◦ σ ◦ W1 I σ(:) = max(0;:), etc. Visualization of neural networks Courtesy of infowatch.com Example: Binary classification I Given samples f(xi ; yi ): i = 1; ··· ; mg, find hyperplane H with normal vector h s.t. ∗ T h = argmin fi : sign(xi h) 6= yi g Here I ( 1 sign(xT h) 6= y `(x ; y ; h) = i i i i T 0 sign(xi h) = yi : I m 1 X min `(xi ; yi ; h) h2 d−1 m S i=1 Convex relaxation (from error to losses) * Replace the non-convex loss function with a convex surrogate. Common loss functions for binary classification: I Linear regression: `(xT h; y) = (xT h − y)2 I Hinge: `(xT h; y) = maxf0; 1 − yxT hg I Logistic: `(xT h; y) = log(1 + exp(−yxT h)) Theorem (Bartlett et al, 2006) Excess error ≤ Excess loss Part II First order methods What is a first order method? I First order methods used to solve the following optimization problem min f (x) x2D only use f (x); rf (x) to perform updates. I Low iteration costs and low memory storage are the main advantages that make first-order methods suitable for large-scale optimization problems which appear ubiquitously in machine learning and data science applications. Gradient descent & stochastic gradient descent * Gradient descent I Initialize at x0 2 D I xk+1 = xk − αrf (xk ) I Terminate when krf (x ∗)k ≤ * Stochastic gradient descent m 1 X min f (θ) := f (θ) m i i=1 I Gradient evaluation cost : m times more I At each iteration choose index i at random and compute rfi (θ) to obtain the descent direction(?) I Obvious case: f1 = ··· = fm 2 2 I Less obvious case: E krf (θ) − rfi (θ)k ≤ σ Why SGD? q 2 R2 mR2 Figure: O Rσ T + T vs:O T Courtesy of F. Bach I Bounds for SGD holds in expectation Convergence analysis I To achieve these bounds, we need to consider specific step-size values (major drawback) I There exists a dichotomy between tuning step-sizes in theory and practice (line search) I Other methods for proving convergence: Lee et al 2016, Soltanolkotabi 2017, Bah et al 2019, ... More first order methods I Frank-Wolfe: Projection free, specific structure e.g. sparsity is desired I Variance reduction (Mini-batching, SVRG, ..): Consider batches of size b > 1 to reduce the variance, tweak the algorithm, ... I Adaptive algorithm (Adagrad, Adam, ..): Update each coordinate of the gradient by a separate rate Why not second order? I Bypass Hessian creation and inversion as well as speeding up the Hessian vector product. I See Agarwal, Bullins, Hazan 2017 Optimization is not all! Optimization: Why we can fit? Generalization: Why we can predict? Part III Early stopping Why early stopping? I A form of regularization to avoid over-fitting when iterative methods such as gradient descent are used for learning. Up to a point, training improves performance on future data i.e. low generalization error. After that training error increases at the expense of worse generalization error. Figure: Black curve (early stopping is used), green curve (early stopping is not used) Learning from noisy samples using early stopping m×n I Recap (least squares problem). Let A 2 R and set ∗ 2 b = Ax + ξ where ξ ∼ N(0; σ Id ): 1 2 I Least squares problem: min 2 kAx − bk I Expected value of error term: n k ∗ 2 ∗ T k k T ∗ 2 X 1 − βi E kxk − x k = (x ) Udiag(β1 ; ··· ; βn )U x +σ : λi i=1 I Here βi = 1 − αλi where λ1 ≥ · · · ≥ λn are eigenvalues of T T T A A. In particular, A A = Udiag (λ1; ··· ; λn) U . Preliminary results (Joint work with S. Vavasis) I Consider the case where A is a de-blurring matrix and x∗ is a ∗ ∗ smooth signal. Also let xi := (Ux )i . P ∗ 2 1−δ P ∗ 2 I Assumption: i>r λi (xi ) ≤ 8 · i≤r λi (xi ) : I Denote the stopping time 2 2 T := inf k : kAxk k ≥ δkbk I Consider the GD iterates applied to the least squares problem ∗ with the bias-variance decomposition xk − x = bk + vk 2 2 Pn 1 Recall that [kv+1k ] = σ . I E i=1 λi T ∗ Figure: Eigenvalues of A A (blue) and corresponding xi (green) Cont. q 1−δ ∗ I For σ ≤ O δ · kAx k the following is then true 1. n X λi kb k2 ≤ exp 2 log 2(1 − δ) · (x ∗)2 T λ2 i i=1 r 2. n ( 2 ) 2 2 X 1 1 − δ λi kvT k ≤ σ min ; log · 2 w.h.p λi 8 λ i=1 r * Proof Sketch: 1. We define deterministic stopping times τ1 and τ2 2. Using Hanson-Wright type concentration inequalities, we show that w.h.p T lies in [τ1; τ2]. 3. We use the first assumption to show that 1 1 − δ 1 [τ1; τ2] ⊆ k : − log (2(1 − δ)) · ≤ kα ≤ − log · λ1 8 λr 4. We next obtain bounds for Cb := sup kbk k; and Cv := sup kvk k k≥τ1 k≤τ2 Double descent Figure: Double descent curve I See Belkin et al 2018 I See Xie et al 2020 - Implicit bias towards smooth interpolations leads to low generalization error I See Baratin et al 2020 I See Heckel et al 2020 Robustness to label noise via early stopping Theorem (Oymak, Soltanolkotabi 2019) In over-parametrized neural networks even with a corruption level ρ 1 smaller than 16 , gradient descent finds a model with perfect accuracy if early stopping is used. Part IV A termination criterion for SGD for binary classification Collaborators: Stephen Vavasis, Courtney Paquette Binary classification via expected loss I Data comes from a mixture model with two means µ0 and µ1 I Without much loss of generality, assume µ0 + µ1 = 0 Figure: Re-centring Binary classification via expected loss(Cont.) I We minimize the expected loss function. T f (θ) = E(ζ;y)∼P `(ζ θ; y) where ( log (1 + exp(x)) y = 0 `(x; y) = log (1 + exp(−x)) y = 1 I We fold the data set in half and minimize ^ ^ T f (θ) := Eξ∼P^`(ξ θ) where `^(x) := log(1 + exp(−x)). Denote θ∗ := argmin f^(θ) Model assumption I We assume that the data is isotropically Gaussian distributed i.e. 2 P^ ≡ N(µ; σ Id ): Lemma (Classification of the optimal linear classifiers) The following is true. T ∗ argmax P ξ θ > 0 = R++ · θ More importantly, θ∗ = ρ∗µ. In case of logistic loss, ρ∗σ2 = 2. SGD with termination I Initialize θ0 = 0, α > 0. I Update as follows αξk+1 θk+1 = θk + T : 1 + exp θk ξk+1 I Terminate when T θk ξk+1 ≥ 1 free! (1) * Criterion (1) is used in the experiments, but in the analysis, we need to have σ-algebras F(θ1; θ2; ··· ) and F(ξ1; ξ2; ··· ) to be independent. We consider the following test instead. T ^ θk ξk ≥ 1 Here ξ^1; ξ^2; · · · ∼ P^ are independently generated. Formally set n ^T o T := inf k : ξk θk ≥ 1 Results Theorem (Bound for expected value of T ) 1. (Low noise) Suppose that σ ≤ 0:33kµk and α > 0 is arbitrary. The following bound is then true. 2(c + c M)2 kµk ασ3 −kµk2 [T ] ≤ 2+ 1 2 · Φc + · exp + 1 E M σ kµk 2σ2 Here M = αkµk2. 2. (High noise) Suppose that σ ≥ 0:33kµk and α satisfies kµk2 α ≤ A · σ2(kµk2+dσ2) . It then holds that E[T ] < +1. Proof sketch I We develop Lyapunov functions in the following sense: Theorem (Meyn and Tweedi 2012) d Assume that there exist a set C ⊆ R and a non-negative function d V : R ! R+ such that we have that E [V (θk )jpast] ≤ V (θk−1) − 1 whenever θk−1 2= C: (2) The following is then true: h C i E τ1 jθ0 = θ ≤ V (θ): (3) C Here τ1 denotes the stopping time that we hit the set C. In C particular, if we denote by τm the m-th time we hit C, it then holds that h C i E τm ≤ O(m); in the case where C is compact. Proof sketch(Cont.) I Further suppose that for δ > 0 the test will be activated with probability at least δ for any θ 2 C. Lemma The following is true. +1 X h C i m E[T ] ≤ E[TC ] ≤ E τm (1 − δ) : m=1 Here TC denotes the first time the termination criterion triggers while simultaneously the iterate lies in C. Proof sketch(Cont.) I We define the Lyapunov function and also the target set C in two different regimes. I (High noise) If σ ≥ ckµk, then n o 1 C := θ : jρ − ρ∗j < 1 ρ∗ , σkθ~k ≤ c0 ; V (θ) = kθ−θ∗k2; 2 2α where θ = ρµ + θ~. I (Low noise) If σ ≤ ckµk, then 2 C = fθ : µT θ ≥ 1g; V (θ) = M~ − µT θ ; 2 where M~ = c1 + c2αkµk .

Load more