Simple early stopping rules in

Sina Baghal

University of Waterloo

RWTH Aachen University, Chair for Mathematics of Information Processing Part I

Optimization in machine learning Learning via optimization over data

* Empirical risk minimization:

m 1 X min f (θ) := `(xi , yi , θ) θ∈D m i=1

d (xi , yi ) are feature label, xi ∈ R , m = # samples, ` . * Neural networks

m 1 X min f (θ) = (y − `(x ; θ))2 m i i i=1

N ×N N ×N I θ = (W1, ··· , WL) , θ ∈ D := R 1 0 × · · · × R L L−1 , N0 = d, NL = 1.

I `(., θ) = WL ◦ σ ◦ · · · ◦ σ ◦ W1 I σ(.) = max(0,.), etc. Visualization of neural networks

Courtesy of infowatch.com Example: Binary classification

I Given samples {(xi , yi ): i = 1, ··· , m}, find hyperplane H with normal vector h s.t.

∗ T h = argmin {i : sign(xi h) 6= yi } Here I ( 1 sign(xT h) 6= y `(x , y , h) = i i i i T 0 sign(xi h) = yi .

I m 1 X min `(xi , yi , h) h∈ d−1 m S i=1 Convex relaxation (from error to losses) * Replace the non-convex loss function with a convex surrogate. Common loss functions for binary classification: I Linear regression: `(xT h, y) = (xT h − y)2 I Hinge: `(xT h, y) = max{0, 1 − yxT h} I Logistic: `(xT h, y) = log(1 + exp(−yxT h)) Theorem (Bartlett et al, 2006) Excess error ≤ Excess loss Part II

First order methods What is a first order method?

I First order methods used to solve the following optimization problem min f (x) x∈D only use f (x), ∇f (x) to perform updates. I Low iteration costs and low memory storage are the main advantages that make first-order methods suitable for large-scale optimization problems which appear ubiquitously in machine learning and data science applications. & stochastic gradient descent

* Gradient descent

I Initialize at x0 ∈ D I xk+1 = xk − α∇f (xk ) I Terminate when k∇f (x ∗)k ≤  * Stochastic gradient descent

m 1 X min f (θ) := f (θ) m i i=1

I Gradient evaluation cost : m times more I At each iteration choose index i at random and compute ∇fi (θ) to obtain the descent direction(?) I Obvious case: f1 = ··· = fm  2 2 I Less obvious case: E k∇f (θ) − ∇fi (θ)k ≤ σ Why SGD?

 q 2 R2   mR2  Figure: O Rσ T + T vs.O T

Courtesy of F. Bach I Bounds for SGD holds in expectation Convergence analysis

I To achieve these bounds, we need to consider specific step-size values (major drawback) I There exists a dichotomy between tuning step-sizes in theory and practice (line search) I Other methods for proving convergence: Lee et al 2016, Soltanolkotabi 2017, Bah et al 2019, ... More first order methods

I Frank-Wolfe: Projection free, specific structure e.g. sparsity is desired I Variance reduction (Mini-batching, SVRG, ..): Consider batches of size b > 1 to reduce the variance, tweak the algorithm, ... I Adaptive algorithm (Adagrad, Adam, ..): Update each coordinate of the gradient by a separate rate Why not second order?

I Bypass Hessian creation and inversion as well as speeding up the Hessian vector product. I See Agarwal, Bullins, Hazan 2017 Optimization is not all!

Optimization: Why we can fit? Generalization: Why we can predict? Part III

Early stopping Why early stopping?

I A form of regularization to avoid over-fitting when iterative methods such as gradient descent are used for learning. Up to a point, training improves performance on future data i.e. low . After that training error increases at the expense of worse generalization error.

Figure: Black curve (early stopping is used), green curve (early stopping is not used) Learning from noisy samples using early stopping m×n I Recap (least squares problem). Let A ∈ R and set ∗ 2 b = Ax + ξ where ξ ∼ N(0, σ Id ). 1 2 I Least squares problem: min 2 kAx − bk I Expected value of error term: n k  ∗ 2 ∗ T k k T ∗ 2 X 1 − βi E kxk − x k = (x ) Udiag(β1 , ··· , βn )U x +σ . λi i=1

I Here βi = 1 − αλi where λ1 ≥ · · · ≥ λn are eigenvalues of T T T A A. In particular, A A = Udiag (λ1, ··· , λn) U . Preliminary results (Joint work with S. Vavasis) I Consider the case where A is a de-blurring matrix and x∗ is a ∗ ∗ smooth signal. Also let xi := (Ux )i . P ∗ 2 1−δ P ∗ 2 I Assumption: i>r λi (xi ) ≤ 8 · i≤r λi (xi ) . I Denote the stopping time

 2 2 T := inf k : kAxk k ≥ δkbk

I Consider the GD iterates applied to the least squares problem ∗ with the bias-variance decomposition xk − x = bk + vk 2 2 Pn 1 Recall that [kv+∞k ] = σ . I E i=1 λi

T ∗ Figure: Eigenvalues of A A (blue) and corresponding xi (green) Cont.   q 1−δ ∗ I For σ ≤ O δ · kAx k the following is then true 1. n    X λi kb k2 ≤ exp 2 log 2(1 − δ) · (x ∗)2 T λ2 i i=1 r 2. n (  2 ) 2 2 X 1 1 − δ λi kvT k ≤ σ min , log · 2 w.h.p λi 8 λ i=1 r * Proof Sketch: 1. We define deterministic stopping times τ1 and τ2 2. Using Hanson-Wright type concentration inequalities, we show that w.h.p T lies in [τ1, τ2]. 3. We use the first assumption to show that  1 1 − δ  1  [τ1, τ2] ⊆ k : − log (2(1 − δ)) · ≤ kα ≤ − log · λ1 8 λr 4. We next obtain bounds for

Cb := sup kbk k, and Cv := sup kvk k k≥τ1 k≤τ2 Double descent

Figure: Double descent curve

I See Belkin et al 2018 I See Xie et al 2020 - Implicit bias towards smooth interpolations leads to low generalization error I See Baratin et al 2020 I See Heckel et al 2020 Robustness to label noise via early stopping

Theorem (Oymak, Soltanolkotabi 2019) In over-parametrized neural networks even with a corruption level ρ 1 smaller than 16 , gradient descent finds a model with perfect accuracy if early stopping is used. Part IV

A termination criterion for SGD for binary classification Collaborators: Stephen Vavasis, Courtney Paquette Binary classification via expected loss

I Data comes from a mixture model with two means µ0 and µ1 I Without much loss of generality, assume µ0 + µ1 = 0

Figure: Re-centring Binary classification via expected loss(Cont.)

I We minimize the expected loss function.

T f (θ) = E(ζ,y)∼P `(ζ θ, y)

where ( log (1 + exp(x)) y = 0 `(x, y) = log (1 + exp(−x)) y = 1 I We fold the data set in half and minimize ˆ ˆ T f (θ) := Eξ∼Pˆ`(ξ θ)

where `ˆ(x) := log(1 + exp(−x)). Denote

θ∗ := argmin fˆ(θ) Model assumption

I We assume that the data is isotropically Gaussian distributed i.e. 2 Pˆ ≡ N(µ, σ Id ).

Lemma (Classification of the optimal linear classifiers) The following is true.

 T  ∗ argmax P ξ θ > 0 = R++ · θ

More importantly, θ∗ = ρ∗µ. In case of logistic loss, ρ∗σ2 = 2. SGD with termination

I Initialize θ0 = 0, α > 0. I Update as follows

αξk+1 θk+1 = θk + T . 1 + exp θk ξk+1 I Terminate when T θk ξk+1 ≥ 1 free! (1) * Criterion (1) is used in the experiments, but in the analysis, we need to have σ-algebras F(θ1, θ2, ··· ) and F(ξ1, ξ2, ··· ) to be independent. We consider the following test instead. T ˆ θk ξk ≥ 1

Here ξˆ1, ξˆ2, · · · ∼ Pˆ are independently generated. Formally set

n ˆT o T := inf k : ξk θk ≥ 1 Results

Theorem (Bound for expected value of T ) 1. (Low noise) Suppose that σ ≤ 0.33kµk and α > 0 is arbitrary. The following bound is then true.

2(c + c M)2  kµk ασ3 −kµk2   [T ] ≤ 2+ 1 2 · Φc + · exp + 1 E M σ kµk 2σ2

Here M = αkµk2. 2. (High noise) Suppose that σ ≥ 0.33kµk and α satisfies kµk2 α ≤ A · σ2(kµk2+dσ2) . It then holds that E[T ] < +∞. Proof sketch I We develop Lyapunov functions in the following sense: Theorem (Meyn and Tweedi 2012) d Assume that there exist a set C ⊆ R and a non-negative function d V : R → R+ such that we have that

E [V (θk )|past] ≤ V (θk−1) − 1 whenever θk−1 ∈/ C. (2)

The following is then true:

h C i E τ1 |θ0 = θ ≤ V (θ). (3)

C Here τ1 denotes the stopping time that we hit the set C. In C particular, if we denote by τm the m-th time we hit C, it then holds that h C i E τm ≤ O(m), in the case where C is compact. Proof sketch(Cont.)

I Further suppose that for δ > 0 the test will be activated with probability at least δ for any θ ∈ C. Lemma The following is true.

+∞ X h C i m E[T ] ≤ E[TC ] ≤ E τm (1 − δ) . m=1

Here TC denotes the first time the termination criterion triggers while simultaneously the iterate lies in C. Proof sketch(Cont.)

I We define the Lyapunov function and also the target set C in two different regimes. I (High noise) If σ ≥ ckµk, then n o 1 C := θ : |ρ − ρ∗| < 1 ρ∗ , σkθ˜k ≤ c0 , V (θ) = kθ−θ∗k2, 2 2α

where θ = ρµ + θ˜. I (Low noise) If σ ≤ ckµk, then  2 C = {θ : µT θ ≥ 1}, V (θ) = M˜ − µT θ ,

2 where M˜ = c1 + c2αkµk . Proof sketch (Cont.)

I Recall that C E[τ1 |θ0 = θ] ≤ V (θ). (4) I In the low noise case, the target set is unbounded. C I To go from (4) to E[τm ] ≤ O(m), it suffices to bound the following probability

 T  pk := P −i − 1 ≤ µ θk+1 ≤ −i|θk ∈ C

C P+∞ 2 I Conclude by using E[τ1 |θ0 ∈ C] ≤ 1 + i=1 (i + 1) · pi Results (Cont.)

Theorem (Bounds for expected angle) d−1 T ∗ Let v ∈ S such that v θ = 0. The following is then true

h i r 2 v T θ ≤ σα · [T ]. E T π E

* Proof sketch: T q 2 I Let Xk := v θk − kσα π I Show E [|Xk − Xk−1| |past] ≤ C for some constant C > 0 +∞ I Observe E [Xk |past] ≤ Xk−1 i.e. {Xk }k=0 form a super-martingale. I Conclude by appealing to the optional stopping theorem. Small validation set (SVS)

I Small validation set test is often used to halt the training process in machine learning.

I Consider a training set of p elements {(ξ1, y1), ··· , (ξp, yp)} and compute the correct performance ratio for θm for every ` iteration. Halt if this ratio fails to improve. I In contrast to our proposed test, SVS causes a computational overhead. I Parameter p needs to be tuned. Numeric

Figure: Our test vs. SVS for p=32,128,512 on synthetic data Numeric (Cont.)

Figure: Our test vs. SVS for p=32,128,512 on the MNIST dataset Final remark

I We introduced a novel approach for developing early stopping rules in the first order method algorithms. We considered two examples of binary classification and least squares problem where stochastic gradient and gradient descent were used respectively. Thank you for your attention!