<<

Nonlinear Optimization in A Series of Lecture Notes at Missouri S&T

Wenqing Hu ∗

Contents

1 Background on Machine Learning: Why Nonlinear Optimization?1 1.1 Empirical Risk Minimization...... 1 1.2 Loss Functions...... 2 1.3 Learning Models...... 2

2 Basic Information about Convex Functions6 2.1 Optimization problem and its solutions...... 6 2.2 Convexity...... 6 2.3 Taylor’s theorem and convexity in Taylor’s expansion...... 7 2.4 Optimality Conditions...... 9

3 Descent and Methods 11 3.1 Gradient Descent...... 11 3.2 Back Propagation and GD on Neural Networks...... 13 3.3 Online Principle Component Analysis (PCA)...... 14 3.4 Line Search Methods...... 15 3.5 Convergence to Approximate Second Order Necessary Points...... 16

4 Accelerated Gradient Methods 18 4.1 Polyak’s heavy ball method...... 18 4.2 Nesterov’s Accelerated Gradient Descent (AGD)...... 19 4.3 Conjugate ...... 23

∗Department of Mathematics and Statistics, Missouri University of Science and Technology (formerly University of Missouri, Rolla). Web: http://web.mst.edu/~huwen/ Email: [email protected]

1 5 Stochastic Gradient Methods 24 5.1 SGD...... 24 5.2 Variance–reduction methods: SVRG, SAG, SAGA, SARAH, SPIDER, STORM...... 26 5.3 Escape from saddle points and nonconvex problems, SGLD, Perturbed GD 32 5.4 and Stochastic Coordinate Descent...... 33

6 Second Order Methods 35 6.1 Newton’s method...... 35 6.2 Quasi–Newton: BFGS...... 36 6.3 Stochastic Quasi–Newton...... 38

7 Dual Methods 41 7.1 Dual Coordinate Descent/Ascent...... 41 7.2 SDCA...... 42

8 Optimization with Constraints 44 8.1 Subgradient Projection Method, Frank–Wolfe Method...... 44 8.2 ADMM...... 45

9 Adaptive Methods 47 9.1 SGD with momentum...... 47 9.2 AdaGrad, RMSProp, AdaDelta...... 47 9.3 Adam...... 49

10 Optimization with Discontinuous Objective 50 10.1 Proximal operators and proximal ...... 50

11 Optimization and Generalization 51 11.1 Generalization Bounds in Statistical Machine Learning...... 51 11.2 Training Matters: Connecting Optimization with Generalization.... 52 11.3 Experiments: Real Data Sets...... 53

A Big-O and Little-O notations 55

2 1 Background on Machine Learning: Why Nonlinear Op- timization?

1.1 Empirical Risk Minimization

Supervised Learning: Given training data points (x1, y1), ..., (xn, yn), construct a learning model y = g(x, ω) that best fits the training data. Here ω stands for the parameters of the learning model.

Here (xi, yi) comes from an independent identically distributed family (xi, yi) ∼ p(x, y), where p(x, y) is the joint density. The model x → y is a black-box p(y|x), which is to be fit by g(x, ω). “” L(g(x, ω), y), for example, can be L(g(x, ω), y) = (g(x, ω) − y)2. Empirical Risk Minimization (ERM):

n ∗ 1 X ω = arg min L(yi, g(xi, ω)) . (1.1) n ω n i=1 Regularized Empirical Risk Minimization (R-ERM):

n ∗ 1 X ω = arg min L(yi, g(xi, ω)) + λR(ω) . (1.2) n ω n i=1

In general let fi(ω) = L(yi, g(xi, ω)) or fi(ω) = L(yi, g(xi, ω)) + λR(ω), then the optimization problem is

n ∗ 1 X ω = arg min fi(ω) . (1.3) n ω n i=1 Key features of nonlinear optimization problem in machine learning: large–scale, nonconvex, ... etc. Key problems in machine learning: optimization combined with generalization. “Population Loss”: EL(g(x, ω), y), minimizer

ω∗ = arg min EL(g(x, ω), y) . ω

∗ ∗ ∗ : EL(y, g(x, ωn)). Consistency: Do we have ωn → ω ? At what speed? Key problems in optimization: convergence, acceleration, variance reduction. How can optimization be related to generalization? Vapnik–Chervonenkis dimen- sion. Radmacher complexity, geometry of the loss landscape.

1 1.2 Loss Functions

Classification Problems: Choice of Loss function L(y, g), y = 1, −1. 0/1 Loss:

`0/1(y, g) = 1 if yg < 0 and `0/1(y, g) = 0 otherwise. (1) Hinge Loss. L(y, g) = max(0, 1 − yg) ; (1.4)

(2) Exponential Loss. L(y, g) = exp(−yg) ; (1.5)

(3) Cross Entropy Loss.

 eg e−g  L(y, g) = − I ln + I ln . (1.6) {y=1} eg + e−g {y=−1} eg + e−g

Regression Problems: Choice of Loss function L(y, g). (1) L2–Loss. 2 L(y, g) = |y − g|2 . (1.7) (2) L1–Loss.

L(y, g) = |y − g|1 . (1.8) (3) L0–Loss.

L(y, g) = |y − g|0 . (1.9)

1 0 Regularized (penalize) term R(ω): L –regularized R(ω) = |ω|1; L –regularized

R(ω) = |ω|0.

1.3 Learning Models 1 (1) Linear regression: g(x, ω) = ωT x. g(x, ω) = . 1 + exp(−ωT x) Least squares problem:

m 1 X T 2 1 2 min (x ω − yj) = |Aω − y| (1.10) ω 2m j 2m 2 j=1

Here A = (x1, ..., xm). Tikhonov regularization: 1 min |Aω − y|2 + λ|ω|2 . (1.11) ω 2m 2 2 (Least Absolute Shrinkage and Selection Operator):

1 2 min |Aω − y| + λ|ω|1 . (1.12) ω 2m 2

2 n n (2) Support Vector Machines (SVM): xj ∈ R , yj ∈ {1, −1}. Goal is to seek ω ∈ R T T and β ∈ R such that ω xj + β ≥ 1 if yj = 1 and ω xj + β ≤ −1 if yj = −1. Hinge Loss m 1 P T L(ω, β) = max(0, 1 − yj(xj ω − β)). m j=1 Classification Problem: Goal is to find a hyperplane ωT x + β = 0 such that it classifies the two kinds of data points most efficiently. The signed distance of any point ωT x + β x ∈ n to the hyperplane is given by r = . If the classification is good enough, R |ω| T T we expect to have ω xj + β > 0 when y = 1 and ω xj + β < 0 when y = −1. We can T then formulate the problem as looking for optimal ω and β such that ω xj +β ≥ 1 when T yj = 1 and ω xj + β ≤ −1 when yi = −1. The closest few data points that match these two inequality are called “support vectors”. The distance to the separating hyperplane 2 created by two support vectors of opposite type is . So we have the constrained |ω| optimization problem

2 T max such that yj(ω xj + β) ≥ 1 for j = 1, 2, ..., m . ω∈Rn,β∈R |ω| Or in other words we have the constrained optimization problem

1 2 T min |ω| such that yj(ω xj + β) ≥ 1 for j = 1, 2, ..., m . (1.13) ω∈Rn,β∈R 2 “Soft margin”: We allow the SVM to make errors on some training data points but we T want to minimize the error. In fact, we allow some training data to violate yj(ω xj +β) ≥ 1, so that ideally we minimize

m 1 2 X T min |ω| + C `0/1(yj(ω xj + β) − 1) . ω∈ n,β∈ 2 R R j=1

Here the 0/1 loss is `0/1(z) = 1 if z < 0 and `0/1(z) = 0 otherwise. We can then turn the 0/1 loss to hinge loss, that is why hinge loss comes in:

m 1 2 X T min |ω| + C max(0, 1 − yj(ω xj + β)) . (1.14) ω∈ n,β∈ 2 R R j=1

“slack variables” ξi ≥ 0, “Soft margin SVM”

m 1 2 X T min |ω| + C ξj s.t. yj(ω xj + b) ≥ 1 − ξj , ξj ≥ 0 , j = 1, 2, ..., m . (1.15) ω∈ n,β∈ 2 R R j=1

(3) Neural Network: “” σ. Sigmoid: 1 σ(z) = ; (1.16) 1 + exp(−z)

3 tanh: exp(z) − exp(−z) σ(z) = ; (1.17) exp(z) + exp(−z) ReLU (Rectified Linear Unit):

σ(z) = max(0, z) . (1.18)

n n n Vector–valued activation: if z ∈ R then σ : R → R is defined by (σ(z))i = σ(zi) where each σ(zi) is the scalar activation function. Fully connected neural network prediction function

T   (H) (H−1) (1)  g(x, ω) = a σ W (σ(W (...(σ(W x + b1))...) + bH−1) + bH . (1.19)

Optimization n 1 X 2 min (g(xi, ω) − yi) . ω 2 i=1 Many other different network structures: convolutional, recurrent (Gate: GRU, LSTM), ResNet, ... Two (one hidden layer) fully connected ReLU neural network has specific loss  T  ω1 (1)   function structure: our W =  ...  and T ωm

m X T  g(x, ω) = ar max ωr x, 0 . (1.20) r=1 Optimization problem is given by

n m !2 1 X X T  min ar max ω xi, 0 − yi . ω 2 r i=1 r=1 Non–convexity issues: see Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz, SGD learns over–parameterized networks that provably generalize on linearly separable data, arXiv:1710.10174, ICLR, 2018. Also see problem 1 in Assignment 1.

(4) : Each Aj is n × p matrix, and we seek for another n × p matrix such that m 1 X 2 min (hAj,Xi − yj) (1.21) X 2m j=1 T where hA, Bi = tr(A B). We can think of the Aj as “probing” the unknown matrix X.

4 n×p (5) Nonnegative Matrix Factorization: If the full matrix Y ∈ R is observed, n×r p×r then we seek for L ∈ R and R ∈ R such that

T 2 min kLR − Y kF subject to L ≥ 0 and R ≥ 0 . (1.22) L,R

PP 21/2 Here kAkF = |aij| is the Frobenius norm of a matrix A. See the following review paper Chi, Y., Lu, Y.M., Chen, Y., Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview. IEEE Transactions on Signal Processing, Vol. 67, No. 20, October 15, 2019. (6) Principle Component Analysis (PCA): PCA:

T max v Sv such that kvk2 = 1 , kvk0 ≤ k . (1.23) v∈Rn The objective function is convex, but if you take into account the , then this problem becomes non–convex. See dimension 1 example. Online PCA: see C.J Li, M Wang, H Liu, T Zhang. Near–Optimal Stochastic Approximation for Online Principal Component Estimation. Mathematical Programming, 167(1):75–97, 2018 (7) Sparse inverse covariance matrix estimation: Sample covariance matrix S = m 1 P T −1 ajaj . S = X. “Graphical LASSO”: m − 1 j=1

min hS, Xi − ln det X + λkXk1 (1.24) X∈ Symmetric Rn×n,X0 P where kXk1 = |Xij|.

5 2 Basic Information about Convex Functions

2.1 Optimization problem and its solutions

n Notion of Minimizers: f : R ⊃ D → R: Local minimizer, global minimizer, strict local minimizer, isolated local minimizer. Constrained optimization problem

min f(x) , (2.1) x∈Ω n where Ω ⊂ D ⊂ R is a closed set. Local Solution. Global Solution.

2.2 Convexity

Convex Set: x, y ∈ Ω =⇒ (1 − α)x + αy ∈ Ω for all α ∈ [0, 1]. Supporting hyperplane. n n Given convex set Ω ⊂ R , the projection operator P : R → Ω is given by

2 P (y) = arg min kz − yk2 . (2.2) z∈Ω P (y) is the point in Ω that is closest to y in the sense of Euclidean norm. n : φ : R → R ∪ {±∞} such that φ((1 − α)x + αy) ≤ (1 − α)φ(x) + αφ(y) (2.3)

n for all x, y ∈ R and all α ∈ [0, 1]. Effective domain, epigraph, proper convex function, closed proper convex function. Strongly convex function with modulus of convexity m: a convex function φ and there exists some m > 0 such that 1 φ((1 − α)x + αy) ≤ (1 − α)φ(x) + αφ(y) − mα(1 − α)kx − yk2 (2.4) 2 2 for all x, y in the domain of φ. n Indicator function: If Ω ⊂ R is convex we define the indicator function ( 0 if x ∈ Ω , IΩ(x) = (2.5) +∞ otherwise .

Constrained optimization problem min f(x) is the same thing as the unconstrained op- x∈Ω timization problem min[f(x) + IΩ(x)]. Theorem 2.1 (Minimizers for convex functions). If the function f is convex and the set Ω is closed and convex, then (a) Any local solution of (2.1) is also global; (b) The set of global solutions of (2.1) is a convex set.

6 2.3 Taylor’s theorem and convexity in Taylor’s expansion

Theorem 2.2 (Taylor’s theorem). Given a continuously differentiable function f : n n R → R and given x, p ∈ R we have Z 1 f(x + p) = f(x) + ∇f(x + γp)T pdγ , (2.6) 0 f(x + p) = f(x) + ∇f(x + γp)T p , for some γ ∈ (0, 1) . (2.7) If f is twice continuously differentiable, then Z 1 ∇f(x + p) = ∇f(x) + ∇2f(x + γp)T pdγ , (2.8) 0 1 f(x + p) = f(x) + ∇f(x + γp)T p + pT ∇2f(x + γp)p , for some γ ∈ (0, 1) . (2.9) 2 n L–Lipschitz: If f : R → R is such that

|f(x) − f(y)| ≤ Lkx − yk (2.10)

n for L > 0 and any x, y ∈ R , then f(x) is L–Lipschitz. Optimization literatures often assume that ∇f is L–Lipschitz

k∇f(x) − ∇f(y)k ≤ Lkx − yk . (2.11)

Theorem 2.3. (1) If f is continuously differentiable and convex then

f(y) ≥ f(x) + (∇f(x))T (y − x) (2.12) for any x, y ∈ dom(f). (2) If f is differentiable and m–strongly convex then m f(y) ≥ f(x) + (∇f(x))T (y − x) + ky − xk2 (2.13) 2 for any x, y ∈ dom(f). (3) If ∇f is uniformly Lipschitz continuous with Lipschitz constant L > 0 and f is convex then L f(y) ≤ f(x) + (∇f(x))T (y − x) + ky − xk2 (2.14) 2 for any x, y ∈ dom(f).

n Theorem 2.4. Suppose that the function f is twice continuously differentiable on R . Then (1) f is strongly convex with modulus of convexity m if and only if ∇2f(x)  mI for all x; (2) ∇f is Lipschitz continuous with Lipschitz constant L if and only if ∇2f(x)  LI for all x.

7 Proof. (Part (1)) Suppose that the function f is strongly–m convex with the standard inequality (2.4) above for m–convexity. Set zα = (1 − α)x + αy. Then by (2.4) we have 1 αf(y) − αf(x) ≥ f(z ) − f(x) + mα(1 − α)kx − yk2 . α 2 Making use of Taylor’s expansion 1 αf(y) − αf(x) ≥ (∇f(x))T (z − x) + O(kz − xk2) + mα(1 − α)kx − yk2 . α α 2

Since zα → x as α → 0, and zα − x = α(y − x), we can set α → 0 to get from above n that for any x, y ∈ R m f(y) ≥ f(x) + (∇f(x))T (y − x) + ky − xk2 . (2.15) 2 n Set u ∈ R and α > 0, then we consider the Taylor’s expansion 1 f(x + αu) = f(x) + α∇f(x)T u + α2uT ∇2f(x + tαu)u (2.16) 2 for some 0 ≤ t ≤ 1. We apply (2.15) with y = x + αu so that m f(x + αu) ≥ f(x) + α(∇f(x))T u + α2kuk2 . (2.17) 2 n Comparing (2.16) and (2.17) we see that for arbitrary choice of u ∈ R we have

uT ∇2f(x + tαu)u ≥ mkuk2 .

This implies that ∇2f(x)  mI as claimed. This shows the “only if” part. Indeed one can also show the “if” part. To this end assume that ∇2f(x)  mI. n T 2 2 Then for any z ∈ R we have that (x − z) ∇ f(z + t(x − z))(x − z) ≥ mkx − zk . Thus 1 f(x) = f(z) + (∇f(z))T (x − z) + (x − z)T ∇2f(z + t(x − z))(x − z) m2 (2.18) ≥ f(z) + (∇f(z))T (x − z) + kx − zk2 . 2 Similarly 1 f(y) = f(z) + (∇f(z))T (y − z) + (y − z)T ∇2f(z + t(y − z))(y − z) m2 (2.19) ≥ f(z) + (∇f(z))T (y − z) + ky − zk2 . 2 We consider (1 − α)(2.18)+α(2.19) and we set z = (1 − α)x + αy. This gives

(1 − α)f(x) + αf(y) m ≥ (α + (1 − α))f(z) + (∇f(z))T ((1 − α)(x − z) + α(y − z)) + (1 − α)kx − zk2 + αky − zk2 m 2 = f(z) + (∇f(z))T ((1 − α)(x − z) + α(y − z)) + (1 − α)kx − zk2 + αky − zk2 . 2 (2.20)

8 Since x−z = α(x−y) and y−z = (1−α)(y−x), we see that ((1−α)(x−z)+α(y−z)) = 0. Moreover, this means that

(1 − α)kx − zk2 + αky − zk2 = (1 − α)α2 + α(1 − α)2 kx − yk2 = α(1 − α)kx − yk2 .

From these we see that (2.20) is the same as saying m (1 − α)f(x) + αf(y) ≥ f((1 − α)x + αy) + α(1 − α)kx − yk2 , 2 which is (2.4).

Theorem 2.5 (Strongly convex functions have unique minimizers). Let f be differen- tiable and strongly convex with modulus m > 0. Then the minimizer x∗ of f exists and is unique.

More technical issues...

Theorem 2.6. Let f be convex and continuously differentiable and ∇f be L–Lipschitz. Then for any x, y ∈ dom(f) the following bounds hold: 1 f(x) + ∇f(x)T (y − x) + k∇f(x) − ∇f(y)k2 ≤ f(y) , (2.21) 2L

1 k∇f(x) − ∇f(y)k2 ≤ (∇f(x) − ∇f(y))T (x − y) ≤ Lkx − yk2 . (2.22) L In addition, let f be strongly convex with modulus m and unique minimizer x∗. Then for any x, y ∈ dom(f) we have that

1 f(y) − f(x) ≥ − k∇f(x)k2 . (2.23) 2m Generalized Strong Convexity: (f ∗ = f(x∗))

k∇f(x)k2 ≥ 2m[f(x) − f ∗] for some m > 0 . (2.24)

It holds in situations other than when f is strongly convex.

2.4 Optimality Conditions

Smooth unconstrained problem

min f(x) , (2.25) x∈Rn where f(x) is a smooth function.

Theorem 2.7 (Necessary conditions for smooth unconstrained optimization). We have

9 (a) (first–order necessary condition) Suppose that f is continuously differentiable. Then if x∗ is a local minimizer of (2.25), then ∇f(x∗) = 0;

(b) (second–order necessary condition) Suppose that f is twice continuously differen- tiable. Then if x∗ is a local minimizer of (2.25), then ∇f(x∗) = 0 and ∇2f(x∗) is positive definite.

Theorem 2.8 (Sufficient conditions for convex functions in smooth unconstrained opti- mization). If f is continuously differentiable and convex, then if ∇f(x∗) = 0, then x∗ is a global minimizer of (2.25). In addition, if f is strongly convex, then x∗ is the unique global minimizer.

Theorem 2.9 (Sufficient conditions for nonconvex functions in smooth unconstrained optimization). Suppose that f is twice continuously differentiable and that for some x∗ we have ∇f(x∗) = 0 and ∇2f(x∗) is positive definite, then x∗ is a strict local minimizer of (2.25).

10 3 Gradient Descent and Line Search Methods

3.1 Gradient Descent

Want to solve min f(x). x∈Rn Want to construct an approximating sequence {xk} such that f(xk+1) < f(xk), k = 0, 1, 2, .... n “descent direction” d ∈ R such that f(x + td) < f(x) for all t > 0 sufficiently small.

Say f is continuously differentiable

f(x + td) = f(x) + t∇f(x + γtd)T d for some γ ∈ (0, 1) .

If dT ∇f(x) < 0, then d is a descent direction. “Steepest Descent”: inf dT ∇f(x) = −k∇f(x)k. kdk=1 ∇f(x) The inf is achieved when d = − . This is called the Gradient Descent k∇f(x)k Method. 3.1 Gradient Descent 0 1: Input: Input initialization x , stepsize αk > 0, k = 0, 1, 2, ... 2: for k = 0, 1, 2, ..., K − 1 do 3: Calculate ∇f(xk) k+1 k k 4: x = x − αk∇f(x ) 5: end for 6: Output: Output xK

Convergence of optimization algorithms:

∗ ∗ 2 ∗ kxT − x k ≤ ε(T ) or EkxT − x k ≤ ε(T ) or Ef(xT ) − f(x ) ≤ ε(T ) .

ε(T ) → 0 as T → ∞: convergence. Convergence Rates are measured by ln ε(T )

(1) If ln ε(T ) = O(−T ), then we say the algorithm has linear convergence rate;

(2) If ln ε(T ) decays slower than O(−T ), then we say the algorithm has sub–linear convergence rate;

(3) If ln ε(T ) decays faster than O(−T ), then we say the algorithm has super–linear convergence rate. In particular, if ln ln ε(T ) = O(−T ), then we say the algorithm has second–order convergence rate.

11 Theorem 3.1. Assume that f is twice continuously differentiable, m–strongly convex 2 n and ∇f is L–Lipschitz, so that mI  ∇ f(x)  LI for all x ∈ R . Then the GD 1 − β 1 + β  Algorithm 3.1 with constant stepsize α = α and α ∈ , for some 0 ≤ β < 1 k m L has linear convergence rate.

Proof. Use fixed point iteration. Let Φ(x) = x − α∇f(x), α > 0. Then xk+1 = Φ(xk). Calculate kΦ(x) − Φ(z)k ≤ βkx − zk. kx0 − x∗k ln ε In order kxT − x∗k ≤ ε we want T ≥ . | ln β| L − m Want −β ≤ 1−αL ≤ 1−αm ≤ β, and when 1−αL = 1−αm, we have β = L + m 2 and α = . L + m Standard procedure in optimization theory: make use of Taylor’s expansion. Consider

Z 1 f(x + αd) = f(x) + α∇f(x)T d + α [∇f(x + γαd) − ∇f(x)]T d dγ 0 Z 1 ≤ f(x) + α∇f(x)T d + α k∇f(x + γαd) − ∇f(x)kkdkdγ (3.1) 0 L ≤ f(x) + α∇f(x)T d + α2 kdk2 . 2 From (3.1) and the line search d = −∇f(xk), x = xk on RHS of (3.1), then the 1 α = minimizes RHS of (3.1), and thus for GD in particular L

 L 1 f(xk+1) ≤ f(xk) + −α + α2 k∇f(xk)k2 = f(xk) − k∇f(xk)k2 . (3.2) 2 2L

Estimates of type (3.2) is a standard procedure that leads to convergence analysis 1 of iterative optimization algorithms. Using it, for GD Algorithm 3.1 with α = we L have r 2L(f(x0) − f(x∗)) • General nonconvex case: min k∇f(xk)k ≤ . 0≤k≤T −1 T 2L(f(x0) − f(x∗)) Number of iterations 0 ≤ k ≤ T until k∇f(xk)k ≤ ε is T ≥ , ε2 sublinear; L • Convex case: f(xk) − f(x∗) ≤ kx0 − x∗k2. 2k f(x0) − f(x∗) Number of iterations 0 ≤ k ≤ T until f(xk) − f(x∗) ≤ ε is T ≥ , ε sublinear;

12  mk • Strongly convex case: f(xk) − f(x∗) ≤ 1 − (f(x0) − f(x∗)). L L f(x0) − f(x∗) Number of iterations 0 ≤ k ≤ T until f(xk)−f(x∗) ≤ ε is T ≥ ln , m ε linear.

3.2 Back Propagation and GD on Neural Networks

Classical algorithm of training multi–layer neural networks: Back–propagation. Reference: Catherine F. Higham, Desmond J. Higham, : An Intro- duction for Applied Mathematicians, arXiv:1801.05894[math.HO] Back–propagation method goes as follows. Let us set

z[l] = W [l]a[l−1] + b[l] for l = 2, 3, ..., L , in which l stands for the number of layers in the neural network. The neural network iteration can be viewed as

a[l] = σ(z[l]) for l = 2, 3, ..., L .

We can view the neural network function as a chain with a[1] = x and a[L] = g(x; ω). Let the training data be one point (x, y). Then the loss function is given by

1 C = C(ω, b) = ky − a[L]k2 . 2 To perform the GD algorithm as in Algorithm 3.1, we shall need the of ∂C the loss function C with respect to the neural–network parameters ω and b, that is [l] ∂wjk ∂C and . Set the error in the j–th neuron at layer l to be δ[l] = (δ[l]) (viewed as a [l] j j ∂bj ∂C column vector) with δ[l] = . Then one can calculate directly to obtain the following j [l] ∂zj lemma

n n Lemma 3.2. If x, y ∈ R , let x ◦ y ∈ R to be the Hadamard product defined by (x ◦ y)i = xiyi. We can directly calculate

δ[L] = σ0(z[L]) ◦ (a[L] − y) , (3.3)

δ[l] = σ0(z[l]) ◦ (W [l+1])T δ[l+1] , for 2 ≤ l ≤ L − 1 , (3.4) ∂C = δ[l] , for 2 ≤ l ≤ L, (3.5) [l] j ∂bj ∂C = δ[l]a[l−1] , for 2 ≤ l ≤ L. (3.6) [l] j k ∂ωjk

13 Algorithm 3.2 Gradient Descent on Neural Networks: 1: Input: Training data (x, y) 2: Forward pass: x = a[1] → a[2] → ... → a[L−1] → a[L] 3: Obtain δ[L] from (3.3) 4: Backward pass: Form (3.4) obtain δ[L] → δ[L−1] → δ[L−2] → ... → δ[2] 5: Calculation of the gradient of the loss C: Use (3.5) and (3.6)

Training GD on Over–parametrized Neural Networks has been understood only recently. See the following references: Zhang, X., Yu, Y., Wang, L., Gu, Q., Learning One–hidden–layer ReLU networks via Gradient Descent, arXiv:1806.07808v1[stat.ML] Du, S.S., Zhai, X., P´oczos, B., Singh, A., Gradient Descent Provably Optimizes Over–parameterized Neural Networks, arXiv:1810.02054v1[cs.LG] Allen–Zhu, Z., Li, Y., Song, Z., A Convergence Theory for Deep Learning via Over–Parametrization, arXiv:1811.03962v3[cs.LG] Du, S.S., Lee, J.D., Li, H., Wang, L., Zhai, X., Gradient Descent Finds Global Minima of Deep Neural Networks, arXiv:1811.03804v1[cs.LG]

Neural Tangent Kernel. See the following paper: Jacot, A., Gabriel, F., Hongler, C., : Convergence and gen- eralization in neural networks, NIPS, 2018.

Computational Graph and General Backpropagation for taking derivatives: basic idea in tensorflow. https://www.tensorflow.org/tutorials/customization/autodiff

3.3 Online Principle Component Analysis (PCA)

Reference: Li, C.J., Wang, M., Liu, H. and Zhang, T., Near-optimal stochastic approximation for online principal component estimation, Mathematical Programming, Ser. B (2018) 167:75-97. d Let X be a random vector in R with mean 0 and unknown covariance matrix

 T  d×d Σ = E XX ∈ R .

Let the eigenvalues of Σ be ordered as λ1 > λ2 ≥ ... ≥ λd ≥ 0. Principal Com- ponent Analysis (PCA) aims to find the principal eigenvector of Σ that corresponds to the largest eigenvalue λ1, based on independent and identically distributed sample realizations X(1), ..., X(n). This can be casted into a nonconvex stochastic optimization problem given by

14 maximize uT E XXT  u , d subject to kuk = 1, u ∈ R . Here k • k denotes the Euclidian norm. Assume the principle component u∗ is unique. Classical PCA: (n) u = arg max uT Σb(n)u . b kuk=1

Here Σb(n) is the empirical covariance matrix based on n samples:

n 1 X Σb(n) = X(i)(X(i))T . n i=1 u Online PCA: Πu = retraction. kuk

Algorithm 3.3 Online PCA Algorithm 1: Input: Initialize u(0) and choose the stepsize β > 0 2: for n = 1, 2, ..., N do 3: Draw one sample X(n) from the (streaming) data source 4: Update the iterate u(n) by n o u(n) = Π u(n−1) + βX(n)(X(n))T u(n−1) .

5: end for 6: Output: u(N)

Oja’s algorithm. See Oja, E. (1982). Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3), 267-273.

3.4 Line Search Methods

Line search methods in general

k+1 k k x = x + αkd , k = 0, 1, 2, ... (3.7)

k n Here αk > 0 is the stepsize and d ∈ R is the descent direction. Preliminary conditions:

(1) −(dk)T ∇f(xk) ≥ ε¯k∇f(xk)k · kdkk; (3.8) k k k (2) γ1k∇f(x )k ≤ kd k ≤ γ2k∇f(x )k whereε, ¯ γ1, γ2 > 0.

k k Notice that if d = −∇f(x ), thenε ¯ = γ1 = γ2 = 1. In fact one can expand

15 f(xk+1) = f(xk + αdk) Z 1 k k T k k k k k = f(x ) + α∇f(x ) d + α [∇f(x + γαd ) − ∇f(x )]d dγ (3.9) 0  L  ≤ f(xk) − α ε¯− α γ k∇f(xk)k · kdkk . 2 2

 2¯ε  So if α ∈ 0, , then f(xk+1) < f(xk). Lγ2 Schemes of descent type produce iterates that satisfy a bound of the form

f(xk+1) ≤ f(xk) − Ck∇f(xk)k2 for some C > 0 . (3.10)

Various Line Search Methods: (1) dk = −Sk∇f(xk), Sk symmetric positively k definite, λ(S ) ∈ [γ1, γ2]; (2) Gauss-Southwell; (3) Stochastic coordinate descent; (4) Stochastic Gradient Methods; (5) Exact line search min f(xk + αdk); (6) Approximate α>0 Line Search, “weak ”; (7) .

3.5 Convergence to Approximate Second Order Necessary Points

Second–order necessary point: ∇f(x∗) = 0 and ∇2f(x∗) is positive semidefinite.

n Definition 3.1. We call point x ∈ R an (εg, εH )–approximate second order stationary point if 2 k∇f(x)k ≤ εg and λmin(∇ f(x)) ≥ −εH (3.11) for some choices of small constants εg > 0 and εH > 0.

How to search for such a second–order stationary point? Set M > 0 to be the Lipschitz bound for the Hessian ∇2f(x), so that

k∇2f(x) − ∇2f(y)k ≤ Mkx − yk for all x, y ∈ dom(f) . (3.12)

Cubic upper bound

1 1 f(x + p) ≤ f(x) + ∇f(x)T p + pT ∇2f(x)p + Mkpk3 . (3.13) 2 6 For steepest descent step we have from (3.2) that

1 ε2 f(xk+1) ≤ f(xk) − k∇f(xk)k2 ≤ f(xk) − g . 2L 2L Otherwise, the negative direction Hessian eigenvector search will output, by using third– order Taylor expansion and (3.12), that

16 Algorithm 3.4 Search for second–order approximate stationary point based on Gradi- ent Descent 1: Input: Input initialization x0, k = 0, 1, 2, ... k 2: while k∇f(x )k > εg or λk < −εH do k 3: if k∇f(x )k > εg then 1 4: xk+1 = xk − ∇f(xk) L 5: else k 2 k 6: λ := λmin(∇ f(x )) 7: Choose pk to be the eigenvector corresponding to the most negative eigenvalue of ∇2f(xk) 8: Choose the size and sign of pk such that kpkk = 1 and (pk)T ∇f(xk) ≤ 0 2λk 9: xk+1 = xk + α pk, α = k k M 10: end if 11: k ← k + 1 12: end while k 13: Output: An estimate of an (εg, εH )–approximate second order stationary point x

1 1 f(xk+1) ≤ f(xk) + α ∇f(xk)T pk + α2(pk)T ∇2f(xk)pk + Mα3kpkk3 k 2 k 6 k 1 2|λ |2 1 2|λ |3 ≤ f(xk) − k |λ | + M k 2 M k 6 M 2 |λ |3 = f(xk) − k 3 M 2 2 ε3 = f(xk) − H . 3 M 2 Number of iterations

 3  K ≤ max 2Lε−2, M 2ε−3 (f(x0) − f(x∗)) . g 2 H

17 4 Accelerated Gradient Methods

4.1 Polyak’s heavy ball method

Why acceleration in GD? The GD method id “greedy” in the sense that it steps in the direction that is most productive at current iterate – no explicit use of knowledge of f at earlier iterations. Adding “momentum” = search direction tends to be similar to that one used in the previous step. 1 Say f(x) = xT Qx − bT x + c is a quadratic function. It may well happen in deep 2 L neural network models that Q has a large κ = , see this paper m Chaudhari, P., Soatto, S., Stochastic Gradient Descent performes variational infer- ence, converges to limit cycles for deep networks, ICLR, 2018. 1 Slow convergence for κ large: for the strongly convex case we showed that β = 1− κ which → 1 when κ → ∞. Demonstrate a picture of oscillations of GD near local min when the Hessian has a large condition number. Motivated from classical mechanics, second order equation that models a nonlinear oscillator: dx d2x dx GD: = −∇f(x); AGD: µ = −∇f(x) − µb . dt dt2 dt d2x “Small mass limit” µ → 0. Term µ serves as “regularization”. dt2 Finite difference approximation: x(t + ∆t) − 2x(t) + x(t − ∆t) µb(x(t + ∆t) − x(t)) µ ≈ −∇f(x(t)) − . (∆t)2 ∆t i.e. x(t + ∆t) = x(t) − α∇f(x(t)) + β(x(t) − x(t − ∆t)) . “heavy ball method” (see Polyak, B., Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, Volume 4, Issue 5, 1964, pp. 1–17.)

xk+1 = xk − α∇f(xk) + β(xk − xk−1) , x−1 = x0 .

Algorithm 4.1 Polyak’s heavy ball method (GD with momentum) 1: Input: Input initialization x−1 = x0, p−1 = 0, stepsizes α, β > 0, k = 0, 1, 2, ... 2: for k = 0, 1, 2, ..., K − 1 do 3: pk = −α∇f(xk) + βpk−1 4: xk+1 = xk + pk 5: end for 6: Output: Output xK

18 k+1 k “Momentum”: pk = x − x . The momentum iteration

pk = βpk−1 − α∇f(xk) can be understood as doing an exponential moving average over the . This is where the “memory” of “inertia” comes in, that inspires the adaptive methods.

4.2 Nesterov’s Accelerated Gradient Descent (AGD)

“first order optimal method”: Instead of evaluating ∇f(xk), we evaluate ∇f(xk + β(xk − xk−1)). Nesterov’s accelerated gradient descent:

xk+1 = xk − α∇f(xk + β(xk − xk−1)) + β(xk − xk−1) , x−1 = x0 .

Algorithm 4.2 Nesterov’s accelerated gradient descent −1 0 1: Input: Input initialization x = x , stepsizes αk, βk > 0, k = 0, 1, 2, ... 2: for k = 0, 1, 2, ..., K − 1 do k k k k−1 3: y = x + βk(x − x ) k+1 k k 4: x = y − αk∇f(y ) 5: end for 6: Output: Output xK

By introducing the momentum variable pk = xk+1 − xk we can also rewrite the Nesterov’s acclerated gradient descent in terms of momentum, this is the Netsterov’s momentum algorithm:

Algorithm 4.3 Nesterov’s momentum (equivalent to Nesterov’s accelerated gradient descent) −1 0 −1 1: Input: Input initialization x = x , p = 0, stepsizes αk, βk > 0, k = 0, 1, 2, ... 2: for k = 0, 1, 2, ..., K − 1 do k k−1 k k−1 3: p = βkp − αk∇f(x + βkp ) 4: xk+1 = xk + pk 5: end for 6: Output: Output xK

Convergence proof of Nesterov’s method: (1) Convex quadratic objective function: Spectral method.

1 f(x) = xT Qx − bT x + c , Q 0 , 0 < m = λ ≤ λ ≤ ... ≤ λ ≤ λ = L. 2 n n−1 2 1

19 L Condition number κ = . m Minimizer of f is x∗ = Q−1b and ∇f(x) = Qx − b = Q(x − x∗). “Spectral method”     xk+1 − x∗ = (xk − x∗) − αQ xk + β(xk − xk−1) − x∗ + β (xk − x∗) − (xk−1 − x∗) .

So we have

" # " #" # xk+1 − x∗ (1 + β)(I − αQ) −β(I − αQ) xk − x∗ = , k = 1, 2, ... xk − x∗ I 0 xk−1 − x∗ ! " # xk+1 − x∗ (1 + β)(I − αQ) −β(I − αQ) Set ωk = with x−1 = x0, and set T = , xk − x∗ I 0 then ωk = T ωk−1. So ωk = T kω0. √ √ √ 1 L − m κ − 1 Theorem 4.1. Pick α = , β = √ √ = √ , then the eigenvalues of T are L L + m κ + 1

1 h p i ν = (1 + β)(1 − αλ ) + i 4β(1 − αλ ) − (1 + β)2(1 − αλ )2 , i,1 2 i i i 1 h p i ν = (1 + β)(1 − αλ ) − i 4β(1 − αλ ) − (1 + β)2(1 − αλ )2 . i,2 2 i i i 1 In particular, the spectral radius ρ(T ) ≡ lim kT kk1/k ≤ 1 − √ . k→∞ κ By Theorem 4.1, we conclude for any ε > 0 that

kωkk ≤ Ckω0k(ρ(T ) + ε)k . 1  L  Comparison of the rates with GD: when α = , GD needs O | ln ε| to reach L m ! r L f(xk) − f ∗ ≤ ε while Nesterov needs O | ln ε| to reach a factor ε in kωkk. Key m L parameter is κ = . m (2) Strongly convex objective functions: Lyapunov’s method. d ∗ d ∗ Lyapunov function V : R → R, V (ω) > 0 for all ω 6= ω ∈ R , V (ω ) = 0, V (ωk+1) < ρ2V (ωk) for 0 < ρ < 1. Indeed, f(x) − f ∗ is a Lyapunov function for GD. What is the Lyapunov function for Nesterov? k k ∗ k k ∗ xe = x − x , ye = y − y , set L V = f(xk) − f(x∗) + kxk − ρ2xk−1k2 . k 2 e e

20 √ 1 κ − 1 1 Calculations... Possible to show α = , β = √ , ρ2 = 1 − √ such that k L k κ + 1 κ 2 Vk+1 ≤ ρ Vk. This is to say we have

 1 k n m o f(xk) − f ∗ ≤ 1 − √ f(x ) − f ∗ + kx − x∗k2 , κ 0 2 0 m where one can verify that V = f(x ) − f ∗ + kx − x∗k2. 0 0 2 0 (3) Convex objective functions: Lyapunov’s method. 1 Set α = but β is variable, and thus ρ is variable. k L k k L V = f(xk) − f ∗ + kxk − ρ2 xk−1k2 . k 2 e k−1e (Weakly) Convex objective: m = 0. Obtain 2 (weak) Vk+1 ≤ ρkVk + Rk , where L ρ2L R(weak) = kyk − ρ2xkk2 − k kxk − ρ2 xk−1k2 . k 2 e k e 2 e k−1e 2 2 2 (1 − ρk) 2 Recursive relation: ρk = 2 2 , k = 1, 2, .... We have 1 + βk − ρk = ρk, (1 − ρk−1) 2 (weak) βk = ρkρk−1. This makes Rk = 0. 2 So Vk ≤ ρk−1Vk−1 and thus

2 2 2 2 2 (1 − ρk−1) Vk ≤ ρk−1ρk−2...ρ1V1 = 2 2 V1 . (1 − ρ0) L 2 Bound V ≤ (1 − ρ2 )2 kx0 − x∗k2 and inductive shows that 1 − ρ2 ≤ . 1 k−1 2 k k + 2 Finally we obtain the rate

2L f(xk) − f ∗ ≤ V ≤ kx0 − x∗k2 . k (k + 1)2 See Nesterov, Y., A method of solving a convex programming problem with con- vergence rate O(1/k2). Soviet Math Dokl, 27(2), 1983.

21 Algorithm 4.4 Nesterov’s optimal algorithm for general convex function f −1 0 1: Input: Input initialization x = x , β0 = ρ0 = 0 and Lipschitz constant L for f 2: for k = 0, 1, 2, ..., K − 1 do k k k k−1 3: y = x + βk(x − x ) 1 4: xk+1 = yk − ∇f(yk) L 5: Define ρk+1 ∈ [0, 1] to be the root of the quadratic equation

2 2 1 + ρk+1(ρk − 1) − ρk+1 = 0

2 6: Set βk+1 = ρk+1ρk 7: end for 8: Output: Output xK

Connection with differential equations: Su, W., Boyd, S., Cand´es,E.J., A differen- tial equation for modeling Nesterov’s accelerated gradient method: theory and insights. k − 1 Journal of Machine Learning Research, 17:143, 2016. Suggest choosing β = ? k k + 2 Su-Boyd-Cand´esdifferential equation: 3 X¨ + X˙ + ∇f(X ) = 0 . t t t t Nesterov’s method achieves first order optimal convergence rate: Example. See Nesterov, Y., Introductory Lectures on , Springer, 2004, §2.1.2. Indeed, one can construct a carefully designed function for which no method that makes use of all gradient observed up to and including iteration k (namely ∇f(xi), i = 0, 1, 2, ..., k) can produce a sequence {xk} that achieves a rate better than the order O(1/k2). 1 Set f(x) = xT Ax − eT x, where 2 1

 2 −1 0 0 ...... 0   1      −1 2 −1 0 ...... 0   0      A =  ......  , e1 =  0  .          0 0 0 0 −1 2 −1 ... 0 0 0 0 0 −1 2 0

∗ ∗ Optimization min f(x) has a solution x such that Ax = e1, and its components x∈Rn i are x∗ = 1 − for i = 1, 2, ..., n. Let the starting point x0 = 0, and the iterate i n + 1

k k+1 k X j x = x + γj∇f(x ) j=0

22 k for some coefficients γj, j = 0, 1, ..., k. Then each iterate x can have nonzero entries only in its first k components. Thus n n 2 X X  j  kxk − x∗k2 ≥ (x∗)2 = 1 − . j n + 1 j=k+1 j=k+1 n One can show that for k = 1, 2, ..., − 1 we have 2 1 kxk − x∗k2 ≥ kx0 − x∗k2 . 8 This will lead to the bound that 3L n f(xk) − f(x∗) ≥ kx0 − x∗k2 , k = 1, 2, ..., − 1 32(k + 1)2 2 where L = kAk2.

4.3 Conjugate gradient method

Conjugate gradient method is a more classical method in computational mathemat- ics. Unlike the Nesterov’s accelerated gradient method, the conjugate gradient method does not require knowledge of the convexity parameters L and m to compute the appro- priate step–sizes. Unfortunately, it does not admit particularly rigorous analysis when the objective function f is not quadratic. Originally this method comes from numerical linear algebra: goal is to solve Qx = p. 1 But this can be formulated as optimization problem min f(x) with f(x) = xT Qx − x 2 pT x + c.

Algorithm 4.5 Conjugate Gradient Method 1 1: Input: Input initialization x0, y0 = −∇f(x0), f(x) = xT Qx − pT x + c, k = 2 0, 1, 2, ... 2: for k = 0, 1, 2, ..., K − 1 do 3: rk = Qxk − p hyk, rki 4: α = k hyk, Qyki k+1 k k 5: x = x − αky k+1 k+1 k 6: y = −∇f(x ) + βky hrk, Qyki 7: β = k+1 hyk, Qyki 8: end for 9: Output: Output xK

k k k Exact line search in the direction y : αk = arg min f(x + αy ). α>0 k+1 k Conjugate gradient direction: βk+1 is so chosen that hy , Qy i = 0.

23 5 Stochastic Gradient Methods

5.1 SGD

Stochastic Gradient Descent (SGD) is a stochastic analogue of the Gradient Descent algorithm, aiming at finding the local or global minimizers of the function expectation parameterized by some random variable. Schematically, the algorithm can be interpreted as targeting at finding a local minimum point of the expectation of function

f(x) = Ef(x; ζ) , (5.1) where the index random variable ζ follows some prescribed distribution D, and the d weight vector x ∈ R . SGD updates via the iteration

xk+1 = xk − αkg(xk; ζk) , (5.2) where the noisy gradient g(x; ζ) = ∇f(x; ζ), αk > 0 is the , and ζk are i.i.d. random variables that have the same distribution as ζ ∼ D. In particular, in the case of training a deep neural network, the random variable ζ samples size m (m ≤ n) mini–batches B uniformly from an index set {1, 2, ..., n}:

B ⊂ {1, 2, ..., n} and |B| = m. In this case, given loss functions f1(x), ..., fn(x) on training data, we have

n 1 X 1 X f(x; ζ) = f (x) and f(x) = f (x) . m i n i i∈B i=1

Algorithm 5.1 Stochastic Gradient Descent 1: Input: Input initialization x0, index random variable ζ ∼ D, noisy gradient

g(x; ζ) = ∇f(x; ζ) such that Eg(x; ζ) = ∇f(x), learning rate αk > 0, k = 0, 1, 2, ... 2: for k = 0, 1, 2, ..., K − 1 do

3: Sample the i.i.d. random variable ζk ∼ D

4: Calculate the noisy gradient g(xk; ζk) = ∇f(xk; ζk)

5: xk+1 = xk − αkg(xk; ζk) 6: end for

7: Output: Output xK

Convergence analysis of SGD.  ∗ 2 f is strongly convex: measure E kxk − x k ; f is convex but not necessarily ∗ strongly convex: measure f(xk) − f(x ).

24 Assumption 5.1. We assume that for some Lg > 0 and B > 0 we have

 2 2 ∗ 2 2 Eζ kg(x; ζ)k2 ≤ Lgkx − x k + B (5.3) for all x.

1 P Remark 5.1. In the mini-batch stochastic gradient case, g(x; ζ) = ∇fi(x). Then m i∈B we have 2 Eζ kg(x; ζ)k2 2 1 P = E ∇fi(x) m i∈B 2 1 P ∗ 1 P ∗ = E (∇fi(x) − ∇fi(x )) + ∇fi(x ) m i∈B m i∈B 2 2 1 P ∗ 1 P ∗ ≤ 2E (∇fi(x) − ∇fi(x )) + 2E ∇fi(x ) m i∈B m i∈B P 1 P ∗ 2 2 ≤ 2E 2 k∇fi(x) − ∇fi(x )k + B i∈B m i∈B 1 P ∗ 2 2 ≤ 2E k∇fi(x) − ∇fi(x )k + B m i∈B ∗ 2 2 ≤ 2Ek∇fi(x) − ∇fi(x )k + B . √ So we can pick Lg = 2L∇f . If Lg = 0 in (5.3), then f cannot be m-strongly convex over an unbounded domain, since by Jensen’s inequality k∇f(x)k2 = kEg(x; ξ)k2 ≤  2 ∗ ∗ ∗ ∗ R 1 2 ∗ E kg(x; ξ)k but hx − x , ∇f(x)i = hx − x , ∇f(x) − ∇f(x )i = hx − x , 0 ∇ f(x + γ(x − x∗))(x − x∗)dγi ≥ mkx − x∗k2 for all x.

Square expansion for a stochastic iteration:

kx − x∗k2 = kx − α g(x ; ζ ) − x∗k2 k+1 k k k k (5.4) ∗ 2 ∗ 2 2 = kxk − x k − 2αkhg(xk; ζk), xk − x i + αkkg(xk; ζk)k .

 ∗ 2 Set ak = E kxk − x k , then

2 2 ∗ 2 2 ak+1 ≤ 1 + αkLg ak − 2αkE [h∇f(xk), xk − x i] + αkB . (5.5) K P αjxj j=0 Case 1: L = 0. Average output:x ¯ = . g K K P αj j=0 D0 α Set αk = α > 0 and αopt = √ , θ = . Set Lg = 0. Then B T + 1 αopt   ∗ 1 1 −1 BD0 E [f(¯xT ) − f(x )] ≤ θ + θ √ . 2 2 T + 1

25 Case 2: B = 0. Assume that h∇f(x), x − x∗i ≥ mkx − x∗k2 for some m > 0. Then  2m 2 if αk = α ∈ 0, 2 , then we obtain a linear rate of convergence. Optimal α = 2 . Lg Lg

 2 k & 2  2 ' m 2 Lg D0 ak ≤ 1 − 2 D0 < ε ⇐⇒ T ≥ 2 ln . Lg m ε

 2m Case 3: B > 0 and Lg > 0, and f is strongly convex. Choose α ∈ 0, 2 and Lg αk = α. 2 2 2k 2 αB ak ≤ 1 − 2mα + α Lg D0 + 2 . 2m − αLg αB2 “threshold value” 2 . “epoch doubling”, “hyper–parameter tuning” 2m − αLg

5.2 Variance–reduction methods: SVRG, SAG, SAGA, SARAH, SPI- DER, STORM

Variance Reduction Methods: Goal is to reduce the variance produced by the stochastic terms in the iteration scheme. SVRG: Johnson, R., Zhang, T., Accelerating Stochastic Gradient using Predictive Variance Reduction, NIPS, 2013. n 1 P Goal is to minimize min P (ω) where P (ω) = ψi(ω). n i=1 SGD: draw it randomly from {1, ..., n} and set

(t) (t−1) (t−1) ω = ω − ηt∇ψit (ω ) .

How to reduce variance? We can keep a snapshot ωe after every m SGD iterations, and we maintain the average gradient

n 1 X µ = ∇P (ω) = ∇ψ (ω) . e e n i e i=1 Variance–reduction:   ω(t) = ω(t−1) − η ∇ψ (ω(t−1)) − ∇ψ (ω) + µ . t it it e e

26 Algorithm 5.2 Stochastic Variance Reduced Gradient 1: Input: Input initialization ωe0, update frequency m and learning rate η 2: for s = 1, 2, ... do

3: ω = ωs−1 e e n 1 P 4: µe = ∇ψi(ωe) n i=1 5: ω0 = ωe 6: for t = 1, 2, ..., m do

7: Sample it independently and uniformly from {1, ..., n} 8: Update the weight ω = ω − η (∇ψ (ω ) − ∇ψ (ω) + µ) t t−1 it t−1 it e e 9: end for 10: ωes ← ωt for randomly chosen t ∈ {0, ..., m − 1} 11: end for 12: Output: ωes

Convergence analysis of SVRG. Follow the original SVRG paper.

Theorem 5.2. Consider the SVRG algorithm 5.2 and assume that each ∇ψi is L– n 1 P Lipschitz and convex, and the function P (ω) = ψi(ω) is γ–strongly convex. Set n i=1 ω∗ = arg min P (ω) and assume that the epoch length m is so large that ω 1 2Lη α = + < 1 , γη(1 − 2Lη)m 1 − 2Lη then we have the linear convergence in expectation of SVRG algorithm 5.2, i.e.

s E[P (ωes) − P (ω∗)] ≤ α [P (ωe0) − P (ω∗)] . SAG: Roux, Nicolas L., Mark Schmidt, Francis R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets. NIPS, 2012. n 1 P Guaranteed linear convergence rate for strongly convex objective P (w) = fi(w). n i=1

n 1 P Algorithm 5.3 Stochastic Average Gradient: minimizing P (w) = fi(x) n i=1 1: Input: Input initialization x0, stepsize α > 0, d = 0, yi = 0, i = 1, 2, ..., n 2: for k = 0, 1, ..., K − 1 do 3: Sample i uniformly randomly from {1, 2, ..., n}

4: d ← d − yi + ∇fi(x)

5: yi = ∇fi(x) α 6: x ← x − d n 7: end for

8: Output: xK

27 SAGA: Defazio, A., Bach, F., Lacoste-Julien, S., SAGA: A Fast Incremental Gra- dient Method With Support for Non-Strongly Convex Composite Objectives, NIPS, 2014. n 1 P Guaranteed linear convergence rate for strongly convex objective P (w) = fi(w). n i=1

n 1 P Algorithm 5.4 SAGA: minimizing P (w) = fi(x) n i=1 0 d 0 d 0 0 1: Input: Input initial vector x ∈ R and known gradients ∇fi(φi ) ∈ R with φi = x for each i. These derivatives are stored in an n × d matrix. 2: for k = 0, 1, ..., K − 1 do 3: Sample j uniformly randomly from {1, 2, ..., n} k+1 k k+1 4: Take φj = x and store ∇fj(φj ) in the table. All other entries in the table k+1 remain unchanged. The quantity φj is not explicitly stored. k+1 k 5: Update x using ∇fj(φj ), ∇fj(φj ) and the table average:

" n # 1 X wk+1 = xk − γ ∇f (φk+1) − ∇f (φk) + ∇f (φk) , j j j j n j i i=1   k+1 h k+1 1 k+1 2 x = proxγ(w ) = arg min h(x) + kx − w k . x∈Rd 2γ 6: end for 7: Output: xK

SARAH: Lam M. Nguyen, Jie Liu, Katya Scheinberg, Martin Tak´aˇc,SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient, Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2613- 2621, 2017. n 1 P Guaranteed linear convergence rate for strongly convex objective P (w) = fi(w). n i=1

28 Algorithm 5.5 StochAstic Recursive grAdient algoritHm: minimizing P (w) = n 1 P fi(w) n i=1 1: Input: Initialization we0, the learning rate η > 0 and the inner loop size m 2: for s = 1, 2, ... do

3: w0 = ws−1 e n 1 P 4: v0 = ∇fi(w0) n i=1 5: w1 = w0 − ηv0 6: for t = 1, ..., m − 1 do

7: Sample it uniformly randomly in {1, 2, ..., n}

8: vt = ∇fit (wt) − ∇fit (wt−1) + vt−1 9: wt+1 = wt − ηvt 10: end for 11: Set wes = wt with t chosen uniformly randomly from {0, 1, ..., m − 1} 12: end for 13: Output: wes

SPIDER: Cong Fang, Chris Junchi Li, Zhouchen Lin, Tong Zhang, SPIDER: Near- Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estima- tor, NIPS, 2018. No assumption on convexity. Optimal Rate O min(n1/2ε−2, ε−3) for finding an ε–approximate first order stationary point.

29 Algorithm 5.6 Stochastic Path–Integrated Differential EstiamtoR: Searching First Or- der stationary point 0 1: Input: Input x , q, S1, S2, n0, ε and εe 2: for k = 0, 1, ..., K − 1 do 3: if mod(k, q) = 0 then

4: Draw S1 samples (or compute the full gradient in the finite sum case), let k k v = ∇fS1 (x ) 5: else k k k−1 k−1 6: Draw S2 samples, and let v = ∇fS2 (x ) − ∇fS2 (x ) + v 7: end if 8: —Option I— k 9: if kv k ≤ 2εe then 10: Return xk 11: else ε 12: η = Ln0 vk 13: xk+1 = xk − η kvkk 14: end if 15: —Option II—  ε 1  16: k η = min k , Ln0kv k 2Ln0 17: xk+1 = xk − ηkvk 18: end for k K−1 19: Output: Output xe uniformly at random from {x }k=0

Remark 5.3. In the classical SGD algorithm, the minibatches B are sampled uniformly randomly from the index set {1, 2, ..., n} with fixed minibatch size m ≤ n. This kind of minibatches can be regarded as “sample without replacement” minibatches. This is the commonly understood minibatches in practice.

In the SPIDER algorithm, the samples S1 and S2 are drawn with replacement, so that they form minibatches S that are sampled uniformly randomly from the index set {1, 2, ..., n} but with replacement, and the minibatch has a given size, say m but counting multiplicities. This kind of minibacthes can be regarded as “sample with replacement” minibacthes. This is less seen in the literature. The reason why “sample with replacement” minibatches are used in the SPIDER al- gorithm is because of the following fact: consider estimating the variance of the stochastic 1 P estimator ∇fi(x) where S is the “sample with replacement” minibatch with size m i∈S

30 m, then 2 2 1 X 1 X 1 E ∇f (x) − ∇f(x) = E (∇f (x) − ∇f(x)) = Ek∇f (x)−∇f(x)k2 , m i m i m i i∈S i∈S where in the last expectation the variable i is taken uniformly randomly from {1, 2, ..., n}. On the contrary, if B is the “sample without replacement” minibatch with the same size m, then instead we have 2 2 1 P 1 P E ∇fi(x) − ∇f(x) = E (∇fi(x) − ∇f(x)) m i∈B m i∈B P 1 P 2 2 ≤ E 2 k∇fi(x) − ∇f(x)k = Ek∇fi(x) − ∇f(x)k . i∈B m i∈B So we see a factor of 1/m is lost. Previous mentioned methods typically use non-adaptive learning rates and reliance on giant batch sizes to construct variance reduced gradients through the use of low-noise gradients calculated at a “checkpoint”. STOchastic Recursive Momentum (STORM, Cutkosky, A., Orabona, F., Momentum- Based Variance Reduction in Non-Convex SGD, arXiv:1905.10018v2, 2020.) is a recently proposed novel variance-reduction method that achieves variance reduction through the use of a variant of the momentum term. This method never needs to compute a checkpoint gradient and achieves the optimal convergence rate of O(1/T 1/3). In fact, the key ideas in momentum-based methods is to replace the gradient estimator with an exponential moving avearge of the gradient estimators in all previous iteration steps, i.e.

dt = (1 − a)dt−1 + a∇f(xt, ξt) ,

xt = xt−1 − ηdt , where ∇f(x, ξ) is the usual stochastic gradient estimator. The STORM estimator in- corporates ideas in SARAH estimator to modify the momentum updates as

dt = (1 − a)dt−1 + a∇f(xt, ξt) + (1 − a)(∇f(xt, ξt) − ∇f(xt−1, ξt)) ,

xt = xt−1 − ηdt . We can think of the momentum iteration as an “exponential moving average” ver- sion of SARAH, that is

dt = (1 − a)[dt−1 + (∇f(xt, ξt) − ∇f(xt−1, ξt))] + a∇f(xt, ξt) . Here the first part is like SARAH update, but it is weighted with the stochastic gradient update. Moreover, different from the SARAH update, the STORM update does not have to compute checkpoint gradients. Rather, variance-reduction is achieved by tuning the weight parameter a appropriately. The exact algorithm goes as follows:

31 Algorithm 5.7 STORM: STOchastic Recursive Momentum 1: Input: Parameters k, w, c, initial point x1

2: Sample ξ1

3: G1 ← k∇f(x1, ξ1)k

4: d1 ← ∇f(x1, ξ1) 1/3 5: η0 ← k/w 6: for t = 1, 2, ..., T do k 7: η ← t Pt 2 1/3 (w + i=1 Gi ) 8: xt+1 ← xt − ηtdt 2 9: at+1 ← cηt 10: Sample ξt+1

11: Gt+1 ← k∇f(xt+1, ξt+1)k

12: dt+1 ← (1−at+1)dt+at+1∇f(xt+1, ξt+1)+(1−at+1)[∇f(xt+1, ξt+1)−∇f(xt, ξt+1)] 13: end for 14: Output: xb sampled uniformly randomly from x1, ..., xT (In practice, set xb = xT )

5.3 Escape from saddle points and nonconvex problems, SGLD, Per- turbed GD

Continuous point of view: Hu, W., Li, C.J., Li, L., Liu, J., On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Science and Applications, Vol. 4, No. 1 (2019), pp. 3-32. SGLD: Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky, Non-convex learn- ing via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis, Proceedings of the 2017 Conference on Learning Theory, PMLR 65:1674-1703, 2017. SGLD is a continuous version of SGD with additional moderate and isotropic noise. Convergence Guarantee: Xu, P., Chen, J., Zou, D., Gu, Q., Global convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization, NIPS, 2018.

32 Algorithm 5.8 Stochastic Gradient Langevin Dynamics: minimizing P (w) = n 1 P fi(w) n i=1 d 1: Input: Input initialization w0 ∈ R , stepsize α > 0, inverse temperature β > 0 2: for k = 0, 1, ..., K − 1 do 3: Sample a random minibatch B ⊂ {1, 2, ..., n}

4: Calculate ∇fi(wk) for i ∈ B

5: Sample i.i.d. εk ∼ N (0,Id) η P p −1 6: wk+1 = wk − ∇fi(wk) + 2ηβ εk |B| i∈B 7: end for

8: Output: wK

Perturbed GD: Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan, How to Escape Saddle Points Efficiently, ICML 2017. Perturbed GD has guaranteed escape rate from saddle points.

Algorithm 5.9 Perturbed Gradient Descent: PGD(x0, `, ρ, ε, c, δ, ∆f ) 1: Input: Input Initialization x . 0 √ √ n  d`∆f  o c c ε c 2: Parameters χ ← 3 max ln cε2δ , 4 , η ← ` , r ← χ2 · ` , gthres ← χ2 · ε, fthres ← q 3 c c χ √` χ3 · ρ , tthres ← c2 · ρε , tnoise ← −tthres − 1. 3: for t = 0, 1, ..., K − 1 do

4: if k∇f(xt)k ≤ gthres and t − tnoise > tthres then 5: xet ← xt, tnoise ← t 6: xt ← xet + ξt, ξt uniformly ∼ B0(r) 7: end if 8: if t − t = t and f(x ) − f(x ) > −f then noise thres t etnoise thres 9: RETURN x etnoise 10: end if

11: xt+1 ← xt − η∇f(xt) 12: end for

13: Output: xK

5.4 Coordinate Descent and Stochastic Coordinate Descent

References: Wright, S.J., Coordinate Descent Algorithms, Mathematical Programming, 2015, 151(1): 3–34. Nesterov, Y., Efficiency of Coordinate Descent Methods on Huge–Scale Optimiza- tion Problems, SIAM Journal on Optimization, 2012, 22(2): 341.

33 n The k–th iteration of Coordinate Descent applied to a function f : R → R chooses some index ik ∈ {1, 2, ..., n} and takes iteration

k+1 k x = x + γkeik ,

where eik is the ik–th unit vector and γk is the step. Example 1. Gauss–Seidel method:

k γk = arg min f(x + γei ) . γ k

Example 2. Coordinate Gradient Descent method:

∂f k γk = −αk (x )eik ∂xik for some αk > 0.

Example 3. If ik is chosen randomly the method is Stochastic Coordinate Descent. Descent methods need f(xk+1) < f(xk) for all k.

34 6 Second Order Methods

6.1 Newton’s method

Second order methods: Produce an optimization sequence {xk} using second–order Taylor expansions (quadratic surrogate)

 1  min f(x) ≈ min f(xk) + ∇f(xk)T (x − xk) + (x − xk)T ∇2f(xk)(x − xk) , x∈U(xk) x∈U(xk) 2 where U(xk) is a convex open neighborhood of xk. If the Hessian ∇2f(xk) is invertible, the minimum is taken at xk−[∇2f(xk)]−1∇f(xk). Newton’s iteration:

xk+1 = xk − [∇2f(xk)]−1∇f(xk) . (6.1)

Algorithm 6.1 Newton’s method 1: Input: Input Initialization x0. 2: for k = 0, 1, ..., K − 1 do 3: Calculate the gradient ∇f(xk) 4: Calculate the Hessian ∇2f(xk) and the inverse [∇2f(xk)]−1 5: Update xk+1 = xk − [∇2f(xk)]−1∇f(xk) 6: end for 7: Output: xK

Convergence of Newton’s method.

Theorem 6.1 (Local superlinear convergence of the Newton’s method). Suppose that the function f is twice differentiable and the Hessian ∇2f(x) is Lipschitz–continuous with Lipschitz constant M > 0, i.e.

k∇2f(x) − ∇2f(y)k ≤ Mkx − yk .

Let x∗ be a second–order stationary point in the sense of Theorem 2.7 (b), then if the ∗ starting point x0 is sufficiently close to x , then the sequence of iterates from Newton’s method (6.1) converges to x∗ at a superlinear rate (in particular, this rate is quadratic).

Proof. See the proof of Theorem 3.5 of the textbook: Nocedal, J., Wright, S.J., Numer- ical Optimization, Second Edition, Springer, 2006.

Drawbacks of Newton’s method: (1) Inverse of Hessian [∇2f(x)]−1 hard to calculate for high–dimensional problems; (2) No positive definite guarantee of Hessian ∇2f(x).

35 6.2 Quasi–Newton: BFGS

Reference: J. E. Dennis, Jr., Jorge J. Mor´e,Quasi–Newton methods, motivation and theory, SIAM Review, Vol. 19, No. 1 (Jan. 1977), pp. 46–89. Quasi–Newton method: If the Hessian ∇2f(xk) is not positive definite or it is hard to calculate, we need to replace it with a positive definite matrix Bk that ap- 2 k 2 k proximates ∇ f(x ): Bk ≈ ∇ f(x ). Or we update the inverse of the 2 k −1 Hk ≈ [∇ f(x )] iteratively, so that we do not have to recalculate the approximation of the inverse of the Hessian matrix [∇2f(xk)]−1 at each iteration. 2 k Set Bk = ∇ f(x ). Quasi–Newton condition uses quadratic surrogate around each k k+1 two consecutive iterations x and x , which produces an iteration that calculates Bk+1 from current iteration. In fact we have

1 f(xk) ≈ f(xk+1) + ∇f(xk+1)T (xk − xk+1) + (xk − xk+1)T ∇2f(xk+1)(xk − xk+1) 2 gives k k+1 k k+1 ∇f(x ) ≈ ∇f(x ) + Bk+1(x − x ) .

k+1 k k+1 k Set yk = ∇f(x ) − ∇f(x ) and sk = x − x . Then

Bk+1sk = yk . (6.2)

Equation (6.2) is called the secant equation. The secant equation will produce a next iteration Bk+1 from current iteration, and the update rule from Bk to Bk+1 should always satisfy the following criteria: (1) it satisfies the secant equation; (2) it keeps symmetry and positive definiteness of Bk+1; (3) it is a low rank modification from Bk to Bk+1. The positive definiteness can be directly obtained from the secant equation since T T we have from (6.2) that sk Bk+1sk = sk yk > 0. This give the condition: for any x, y ∈ Dom(f) we have

(x − y)T (∇f(x) − ∇f(y)) > 0 . (6.3)

It can be shown that when f is strongly convex then (6.3) is satisfied. This is left as one problem in Homework 4. Secant equation is solved by doing minimization such as

T min kB − Bkk subject to B = B , Bsk = yk . B Choosing appropriate norm, solution is DFP (Davidson–Fletcher–Powell) update

36 T T T 1 Bk+1 = (I − ρkyksk )Bk(I − ρkskyk ) + ρkykyk , ρk = T . (6.4) yk sk Sherman–Morrison–Woodbury formula

A−1uvT A−1 (A + uvT )−1 = A−1 − , where u, v are column vectors . 1 + vT A−1u

−1 Thus we get (try it!) the DFP iteration for Hk = Bk as

T T Hkykyk Hk sksk Hk+1 = Hk − T + T . (6.5) yk Hkyk yk sk In a same way but conversely, we have the secant equation for H:

Hk+1yk = sk . (6.6) Secant equation is solved by doing minimization such as

T min kH − Hkk subject to H = H , Hyk = sk . H BFGS (Broyden–Fletcher–Goldfarb–Shanno) update

T T T 1 Hk+1 = (I − ρkskyk )Hk(I − ρkyksk ) + ρksksk , ρk = T . (6.7) sk yk Invert it we get T T Bksksk Bk ykyk Bk+1 = Bk − T + T . (6.8) sk Bksk yk sk

Algorithm 6.2 Quasi–Newton: BFGS method 0 0 2 0 −1 1: Input: Input Initialization x , ∇f(x ), H0 = [∇ f(x )] , B0 = I, learning rate η > 0 2: for k = 0, 1, ..., K − 1 do k+1 k k 3: Calculate x = x − ηHk∇f(x ) k+1 k k+1 k 4: Calculate yk = ∇f(x ) − ∇f(x ) and sk = x − x T T T 5: Update Hk+1 = (I − ρkskyk )Hk(I − ρkyksk ) + ρksksk T T Bksks Bk yky 6: k k Update Bk+1 = Bk − T + T sk Bksk yk sk 7: end for 8: Output: xK

Superlinear convergence of BFGS method (see Theorem 6.6 in Nocedal, J., Wright, S.J., Numerical Optimization, Second Edition, Springer 2006.).

37 6.3 Stochastic Quasi–Newton

Reference: Byrd, R.H., Hansen, S.L., Nocedal, J. et al, A stochastic quasi–Newton method for large–scale optimization, SIAM Journal on Optimization, 2016, 26(2): 1008– 1031. Large–scale optimization

N 1 X F (w) = f (w) . N i i=1 Stochastic gradient estimate: b = |S| <

1 X ∇ˆ F (w) = ∇f (w) . b i i∈S It is necessary to employ a limited memory variant that is scalable in the number of variables, but enjoys only a linear rate of convergence. Quasi–Newton updating is inherently an overwriting process rather than an aver- aging process, and therefore the vector y = ∇F (wk+1)−∇F (wk) must reflect the action of the Hessian of the entire objective F , which is not achieved by differencing stochastic gradients based on small samples. To achieve stable Hessian approximation one needs to decouple the stochastic gra- dient and curvature estimation calculations.

38 Algorithm 6.3 Stochastic Quasi–Newton Method 1: Input: Input Initialization w1, positive integers M,L, and step–length sequence αk > 0 2: Set t = −1 . Records number of correction pairs currently computed

3: w¯t = 0 4: for k = 1, 2, ..., K − 1 do 5: Choose a sample S ⊂ {1, 2, ..., N} k 1 P 6: Calculate the stochastic gradient ∇ˆ F (w ) = ∇fi(w) |S| i∈S k 7: w¯t =w ¯t + w 8: if k ≤ 2L then 9: wk+1 = wk − αk∇ˆ F (wk) . Stochastic gradient iteration 10: else k+1 k k k 11: w = w − α Ht∇ˆ F (w ), where Ht is defined by Algorithm 6.4 12: end if 13: if mod (k, L) = 0 then 14: t = t + 1

15: w¯t =w ¯t/L 16: if t > 0 then 2 1 P 2 17: Choose a sample SH ⊂ {1, 2, ..., N} to define ∇ˆ F (w ¯t) ≡ ∇ f(w) |SH | i∈SH 2 18: Compute st =w ¯t − w¯t−1, yt = ∇ˆ F (w ¯t)(w ¯t − w¯t−1) . Compute correction pairs every L iterations 19: end if

20: w¯t = 0 21: end if 22: end for 23: Output: wK

39 Algorithm 6.4 Stochastic Quasi–Newton Method: Hessian Updating 1: Input: Updating counter t, memory parameter M, correction pairs (sj, yj), j = t − me + 1, ..., t where me = min{t, M} T s yt 2: t Set H = T I yt yt 3: for j = t − me + 1, ..., t do 1 4: ρj = T yj sj 5: Apply BFGS formula

T T T H ← (I − ρjsjyj )H(I − ρjyjsj ) + ρjsjsj

6: end for

7: Output: new matrix Ht = H

40 7 Dual Methods

7.1 Dual Coordinate Descent/Ascent

Reference: Chapter 13 & 14 of Nocedal, J., Wright, S.J., Numerical Optimization, Second Edition, Springer, 2006.

Constraint optimization problem P0:

min f(w) w s.t. gi(w) ≤ 0 , i = 1, 2, ..., m1 , hj(w) = 0 , j = 1, 2, ..., m2 . Let the optimal solution be p∗. Unconstrained Version:

 m m  X1 X2 min f(w) + ∞ 1[g (w)>0] + ∞ 1[h (w)6=0] . w  i j  i=1 j=1

Approximate the indicators by linear functions, obtain the Lagrangian function

m m X1 X2 L(w, λ, ν) ≡ f(w) + λigi(w) + νjhj(w) , i=1 j=1 in which λi ≥ 0 and νj ∈ R are the Lagrange multipliers. The variable w is called the primal variable. Set the Lagrangian dual function

h(λ, ν) ≡ inf L(w, λ, ν) , λi ≥ 0 and νj ∈ R . acceptable w

Since for any acceptable w we have gi(w) ≤ 0 and hi(w) = 0, so we have

h(λ, ν) ≤ inf f(w) = p∗ . acceptable w

The variables (λ, ν) are called dual variables.

Observation: h(λ, ν) is always concave (even when the original problem P0 is not convex) in the dual variables λ and ν.

Dual Problem D0 (“primal-dual”):

max h(λ, ν) λ,ν s.t. λi ≥ 0 , i = 1, 2, ..., m . Let the optimal solution be d∗. Then d∗ ≤ p∗. If d∗ = p∗ this is called strong duality. A sufficient condition for strong duality is Slater condition and a necessary condition for strong duality is KKT condition.

41 Set d∗ = h(λ∗, ν∗). Then solve the unconstrained optimization problem

min L(w, λ∗, ν∗) . w If the solution is acceptable then it is the optimal solution for the original problem.

Algorithm 7.1 Dual Coordinate Ascent 1: Input: Input Initialization v0, λ0 ≥ 0 2: for k = 0, 1, ..., K − 1 do

3: Calculate wk such that wk = arg min L(w, λk−1, vk−1) w 4: Update λk,i = max{λk−1,i + ηkgi(wk), 0}, i = 1, 2, ..., m1

5: Update vk,j = vk−1,j + ηkhj(wk), j = 1, 2, ..., m2 6: end for

7: Output: wK

7.2 SDCA

Shalev–Schwarz, S., Zhang, T., Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization, Journal of Machine Learning Research, 14(2013), pp. 567–599. d Let x1, ..., xn be vectors in R and φ1, ..., φn be a sequence of scalar convex functions, and let λ > 0 be a regularization parameter. Our goal is to solve the optimization problem P (w∗) = min P (w) where w∈Rd

n 1 X λ P (w) = φ (wT x ) + kwk2 . n i i 2 i=1

T Here φi(w xi) is the loss function for linear models. For example, given labels y1, ..., yn in {±1}, the SVM problem (with linear kernels and no bias term) is obtained by setting φi(a) = max{0, 1 − yia}; regularized is obtained by setting 2 φi(a) = log(1 + exp(−yia)); ridge regression is obtained by setting φi(a) = (a − yi) ; regression with the absolute–value is obtained by setting φi(a) = |a − yi|; support vector regression is obtained by setting φi(a) = max{0, |a − yi| − ν} for some predefined insensitivity parameter ν > 0. Dual formulation: D(α∗) = max D(α) where α∈Rd n n 2 1 X λ 1 X D(α) = −φ∗(−α ) − α x n i i 2 λn i i i=1 i=1 ∗ where φ (u) = max(zu − φi(z)). i z In fact we have

42  n  n  1 P T T λ 2 2 P T min P (w) = min (φi(w xi) + αi(w xi)) + kwk − αiw xi w w n i=1 2 λn i=1 " n n 2 n 2# 1 P T T λ 1 P λ 1 P = min (φi(w xi) + αi(w xi)) + w − αixi − αixi w n i=1 2 λn i=1 2 λn i=1  n  n 2 1 P T T λ 1 P ≥ min (φi(w xi) + αi(w xi)) − αixi w n i=1 2 λn i=1 n n 2 1 P T T λ 1 P ≥ min(φi(w xi) + αi(w xi)) − αixi n i=1 ω 2 λn i=1 n n 2 1 P λ 1 P ≥ − max(−φi(z) − αiz) − αixi n i=1 z 2 λn i=1 n n 2 1 P ∗ λ 1 P = −φi (−αi) − αixi = D(α) n i=1 2 λn i=1

d for any α ∈ R . n 1 P So min P (w) ≥ max D(α). Set w(α) = αixi, then strong duality holds and w α λn i=1

w(α∗) = w∗ ,P (w∗) = D(α∗) .

Algorithm 7.2 Stochastic Dual Coordinate Ascent 1: Input: Input Initialization α0, w0 = w(α0) 2: for k = 0, 1, ..., K − 1 do

3: Randomly sample ik ∈ {1, 2, ..., n} ( 2) ∗ λn 1 4: Find ∆αi = arg max −φ (−αk,i + z) − wk + zxi k z ik 2 λn k 1 5: Update α = α + ∆α e , w = w + ∆α x k+1 k ik ik k+1 k λn ik ik 6: end for K 1 P 7: Output (Average Option):α ¯ = αi andw ¯ = w(¯α) = K − K0 i=K0+1 K 1 P wk−1, returnw ¯ K − K0 i=K0+1 8: Output (Random Option): Letα ¯ = αk andw ¯ = wk for some random k ∈

K0 + 1, ..., K, returnw ¯

43 8 Optimization with Constraints

8.1 Subgradient Projection Method, Frank–Wolfe Method

References: Levitin, E.S., Polyak, B.T., Constrained Minimization Methods, USSR Computa- tional Mathematics and Mathematical Physics, 1966, 6(5), pp. 1–50. Frank, M., Wolfe, P., An algorithm for , Naval Research Logistics (NRL), 1956, 3(1–2), pp. 95–110.

The subgradient ∂f(x0) for a convex function at point x0 is defined as

∂f(x0) = {c ∈ R : f(x) − f(x0) ≥ c(x − x0)} .

Algorithm 8.1 Subgradient Projection Method 1: Input: Input Initialization x0; Convex Set X ; learning rate η > 0 2: for k = 0, 1, ..., K − 1 do

3: Randomly sample a subgradient gk ∈ ∂f(xk)

4: Update yk+1 = xk − ηgk

5: Project xk+1 = ΠX (yk+1), ΠX (y) = arg min ky − xk x∈X 6: end for

7: Output: xK

The projection ΠX (y) can be hard to calculate in practical cases. However, sub- gradient projection = first gradient descent then projection; Frank–Wolfe = combine gradient descent with projection to constraint. How to do that? GD is an optimal choice of search direction via first–order Taylor expansion: f(x + td) = f(x) + t∇f(x + γtd)T d for some γ ∈ (0, 1) .

If dT ∇f(x) < 0, then d is a descent direction. T ∇f(x) “Steepest Descent”: doptimal = arg inf d ∇f(x) = − . kdk=1 k∇f(x)k Want to incorporate the constraint information: how to do that?

T doptimal = arg inf d ∇f(x) . d∈X

44 Algorithm 8.2 Frank–Wolfe Method 1: Input: Input Initialization x0 ∈ X ; Convex Set X ; weights ηk > 0 2: for k = 0, 1, ..., K − 1 do

3: Calculate ∇f(xk) T 4: Update dk = arg inf d ∇f(xk) d∈X 5: Update xk+1 = (1 − γk)xk + γkdk 6: end for

7: Output: xK

8.2 ADMM

Reference: Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., Distributed Op- timization and Statistical Learning via the Alternating Direction Method of Multipliers, Foundations and Trends in Machine Learning, Vol. 3, No. 1 (2010), pp. 1–122.

Equality–constrained convex optimization problem:

minimize f(x) subject to Ax = b .

Augmented Lagrangian.

ρ L (x, y) = f(x) + yT (Ax − b) + kAx − bk2 . ρ 2 2 The augmented Lagrangian can be viewed as the (unaugmented) Lagrangian asso- ciated with the problem ρ minimize f(x) + kAx − bk2 subject to Ax = b . 2 2 This problem can be solved by dual gradient ascent:

xk+1 ≡ arg min L(x, yk) , x yk+1 ≡ yk + ρ(Axk+1 − b) .

Alternating Direction Method of Multipliers (ADMM).

Optimization problem has the form

minimize f(x) + g(z) subject to Ax + Bz = c .

Optimal value p∗ = inf{f(x) + g(z); Ax + Bz = c} .

Augmented Lagrangian can be formed as

45 ρ L (x, z, y) = f(x) + g(z) + yT (Ax + Bz − c) + kAx + Bz − ck2 . ρ 2 2 This problem can be solved by dual gradient ascent:

k+1 k+1 k (x , z ) ≡ arg min Lρ(x, z, y ) , x,z yk+1 ≡ yk + ρ(Axk+1 + Bzk+1 − c) . Unlike the classical dual gradient ascent which minimizes the augmented Lagrangian jointly with respect to the two primal variables x and z, ADMM updates x and z in an alternating or sequential fashion, which accounts for the term alternating direction.

Algorithm 8.3 Alternating Direction Method of Multipliers 1: Input: Input Initialization (x0, z0), dual variable y0, parameter ρ > 0 2: for k = 0, 1, ..., K − 1 do k+1 k k 3: Calculate x ≡ arg min Lρ(x, z , y ) x k+1 k+1 k 4: Calculate z ≡ arg min Lρ(x , z, y ) z 5: Calculate yk+1 = yk + ρ(Axk+1 + Bzk+1 − c) 6: end for 7: Output:(xK , zK )

46 9 Adaptive Methods

Adapted algorithms are those training algorithms that update the parameters un- der training using not only the current gradient information but also all past gradient information, as well as adaptively change the learning rate. These algorithms are some- times also called to be belonging to the Ada–Series.

9.1 SGD with momentum

Reference: Sutskever, I., Martens, J., Dahl, G., Hinton, G., On the importance and momentum in deep learning, ICML, 2013.

Algorithm 9.1 SGD with momentum (= stochastic heavy ball method) 1: Input: Input Initialization w0 = w−1 and v−1 = 0 2: for k = 0, 1, ..., K − 1 do

3: Randomly sample minibatch Si ⊂ {1, 2, ..., n} such that |Si| = m 1 P 4: Update vk+1 = µvk − η ∇fi(wk) m i∈S 5: Update wk+1 = wk + vk+1 6: end for

7: Output: wK

9.2 AdaGrad, RMSProp, AdaDelta

Reference: Duchi, J., Hazan, E., Singer, Y., Adaptive Subgradient Methods for Online Learn- ing and Stochastic Optimization, Journal of Machine Learning Research, 2011, 12(July), pp. 2121–2159. AdaGrad not just accumulate historical information (such as square gradient in- formation) but also adjust the learning rate (stepsize) according to the magnitude of historical gradient information at different dimensions, so that those dimensions with relatively small magnitudes of the gradient have larger learning rates. In this way, the algorithm favors some dimensions that have currently small magnitude of the gradient but contributes importantly to the optimization search path. This is more suitable to optimize neural network models that possess many local minimizers and saddle points. Empirical evidence shows that AdaGrad can stabilize SGD. The disadvantage of Ada- √ Grad comes from the term gk+1 in the denominator, since as it accumulates during the training process the learning rates will gradually tend to 0 and this leads to .

47 Algorithm 9.2 AdaGrad 1: Input: Input Initialization w0 and g0 = 0 2: for k = 0, 1, ..., K − 1 do 2 3: Update gk+1 = gk + (∇f(wk)) η∇f(wk) 4: Update wk+1 = wk − √ , ε0 > 0 gk+1 + ε0 5: end for

6: Output: wK

Tieleman, T., Hinton, G., Lecture 6. 5–RmsProp: Divide the Gradient by A Running Average of its Recent Magnitude, COURSERA, Neural Networks for Machine Learning, 2012, 4(2), pp. 26–31. RMSProp is similar to AdaGrad but also makes use of the idea of SGD with momentum, this gives an exponential weight–decay average of historical information. 2 RMSProp does a weighted average of both gk and (∇f(wk)) , so that the weights in historical information will be small and thus it favors more on current gradient infor- mation. This avoids the monotonic decay of the learning rate (stepsize) along with the training process.

Algorithm 9.3 RMSProp 1: Input: Input Initialization w0 and g0 = 0 2: for k = 0, 1, ..., K − 1 do 2 3: Update gk+1 = γgk + (1 − γ)(∇f(wk)) η∇f(wk) 4: Update wk+1 = wk − √ , ε0 > 0 gk+1 + ε0 5: end for

6: Output: wK

Zeiler, M.D., AdaDelta: An Adaptive Learning Rate Method, arXiv:1212.5701, 2012. AdaDelta, based on RMSProp, adjust the accumulated historical information ac- cording to the magnitude of gradients.

48 Algorithm 9.4 AdaDelta 1: Input: Input Initialization w0 and u0 = 0, v0 = 0 and g0 = 0 2: for k = 0, 1, ..., K − 1 do 2 3: Update gk+1 = γgk + (1 − γ)(∇f(wk)) 2 4: Update vk+1 = γv√k + (1 − γ)uk vk + ε0∇f(wk) 5: Update uk+1 = − √ gk+1 + ε0 6: Update wk+1 = wk + uk+1 7: end for

8: Output: wK

9.3 Adam

Reference: Kingma, D.P., Ba, J.L., ADAM: A method for stochastic optimization, ICLR, 2015. Reddi, S.J., Kale, S., Kumar, S., On the convergence of ADAM and Beyond, ICLR, 2018. Adam combines all ingredients in SGD–momentum, AdaGrad, RMSProp, AdaDelta: (1) The update direction is determined by accumulated historical gradient information; (2) Correct the stepsize using accumulated square gradient information; (3) Exponen- tial decay of accumulated information. Adam has been widely used in training neural networks.

Algorithm 9.5 Adam 1: Input: α: stepsize; β1, β2 ∈ [0, 1): Exponential decay rates for the moment esti-

mates; f(θ): stochastic objective function with parameters θ; θ0: initial parameter

vector; m0 ← 0 (initialize first moment vector); v0 ← 0 (initialize second moment vector); t ← 0 (initialize timestep)

2: while θt not converged do 3: t ← t + 1

4: gt ← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)

5: mt ← β1 · mt−1 + (1 − β1) · gt (Update biased first moment estimate) 2 6: vt ← β2 · vt−1 + (1 − β2) · gt (Update biased second raw moment estimate) t 7: mˆ t ← mt/(1 − β1) (Compute bias–corrected first moment estimate) 8: vˆ ← v /(1 − βt ) (Compute bias–corrected second raw moment estimate) t t 2 √ 9: θt ← θt−1 − α · mˆ t/( vˆt + ε) (Update parameters) 10: end while

11: Output: θt (Resulting parameters)

49 10 Optimization with Discontinuous Objective Function

10.1 Proximal operators and proximal algorithms

Reference: Parikh, N., Boyd, S., Proximal Algorithms, Foundations and Trends in Optimization, Vol. 1, No. 3 (2013), pp. 123–231. Also see http://stanford.edu/~boyd/papers/pdf/prox_slides.pdf Let R = R(v) be a convex function. With respect to the convex function R the d d proximal operator proxR(w): R → R is defined as  1  prox (w) = arg min R(v) + kw − vk2 . R v 2

d If R(w) = 0, then prox0(w) = w; if R(w) = aIC (w) (where C ⊆ R is usually a convex set, and the indicator function is defined as in (2.5)), then prox (w) = aIC arg min kw − vk2 degenerates to a projection operator. v∈C We have

p = proxR(w) ⇐⇒ w − p ∈ ∂R(p) . Also  1  prox (w) = arg min R(v) + kw − vk2 . aR v 2a Convex optimization problem

min f(w) = l(w) + R(w) , w∈W where l(w) is a convex differentiable function and R(w) is non–differentiable convex function (such as L1–regularization term).

Algorithm 10.1 Proximal Gradient Descent 1: Input: Input Initialization w0, convex set W, learning rates ηk > 0 2: for k = 0, 1, ..., K − 1 do

3: Calculate ∇l(wk)

4: Update parameter vk+1 = wk − ηk∇l(wk) 5: Apply proximal operator w = prox (v ) k+1 ηkR k+1 6: end for

7: Output: wK

50 11 Optimization and Generalization

11.1 Generalization Bounds in Statistical Machine Learning

Consider the framework. Let the training sample (x1, y1), ...,

(xN , yN ) follow the same distribution (x, y) ∼ F (x, y) under probability measure P (or

P(x,y), and the corresponding expectation is E or E(x,y)). Let the loss functions be d `i(θ) = `((xi, yi), θ), θ ∈ R , i = 1, 2, ..., N. Define the empirical loss function

N 1 X L (θ) = ` (θ) . (11.1) N N i i=1 ∗ Let the local minimizer of LN (θ) be θN , which may not be unique (in this case we ∗,1 ∗,k ∗ further denote them as θN , ..., θN ). Then the empirical loss function evaluated at θN ∗ is LN (θN ). The generalization gap is defined to be

N ∗ ∗ 1 X ∗ ∗ Gap ≡ LN (θ ) − E(x,y)`((x, y), θ ) = `((xi, yi), θ ) − E(x,y)`((x, y), θ ) . N N N N N N i=1 (11.2) We further define the two-sided generalization gaps as

+ ∗ ∗ GapN ≡ LN (θN ) − E(x,y)`((x, y), θN ) , (11.3)

− ∗ ∗ GapN ≡ E(x,y)`((x, y), θN ) − LN (θN ) . (11.4) In classical statistical learning theory, the basic estimates of generalization gap 1 bounds are bounding GapN in (11.2) in terms of features of the family of functions that consists of the loss functions `i(θ) in which we vary the parameters θ and the training data points (xi, yi), i = 1, 2, ..., N. Typical such features are the Vapnik-Chervonenkis dimension (VC dimension) and the Rademacher complexity. (a) VC dimension. Let the loss function be `((x, y), θ) = 1(h(x, θ) 6= y) for a family of functions h(x, θ), θ ∈ Θ. Then we define the growth function as

ΠΘ(N) = max |{(h(x1, θ), ..., h(xN , θ)), θ ∈ Θ}| . (11.5) x1,...,xN ,θ∈Θ

Then we have the following bound for the gap GapN in (11.2):

 Nε2  P (Gap > ε) ≤ 4Π (2N) exp − . (11.6) N Θ 8 The VC dimension is then defined using the growth function

1see Vapnik, V., Statistical Learning Theory, Wiley-Interscience, first edition, 1988.

51 N VC(Θ) = max{N :ΠΘ(N) = 2 } . (11.7) Suppose VC(Θ) = d, then for any N > d, 0 < δ < 1 we obtain the generalization gap estimate in terms of VC-dimension as

 s  8d ln 2eN + 8 ln 4 P Gap− ≤ d δ ≥ 1 − δ . (11.8)  N N 

The above bound is distribution-free and data-independent. The notion of VC dimension fails to be able to explain the effectiveness of over-parameterized deep neural network models, since in that case the VC dimension blows up yet the generalization gap remains reasonably small. (b) Rademacher complexity. Let the loss function be `((x, y), θ) = 1(h(x, θ) 6= y) for a family of functions h(x, θ), θ ∈ Θ. Given the training data points (x1, y1), ..., (xN , yN ), the empirical Rademacher complexity is defined as

" N # 1 X Rbx ,...,x (Θ) = Eσ ,...,σ sup σih(xi, θ) , (11.9) 1 N 1 N N θ∈Θ i=1 where the Rademacher variables σ1, ..., σN independently take values 1 or −1 with prob- abilities 1/2. Then we have the gap estimate

 s  2 − ln δ P Gap ≤ 2Rbx ,...,x (Θ) + 3 ≥ 1 − δ . (11.10)  N 1 N 2N 

Again, the above generalization error bound only makes use of the whole parameter space θ ∈ Θ instead of specifying the information provided by the training algorithm, such as SGD. Indeed, if we run the optimization process like SGD, which can be regarded ∗ ∗,1 ∗,k as a of θN , we may end up in finding among many optimizers θN , ...θN a specific one that achieves good generalization properties.

11.2 Training Matters: Connecting Optimization with Generalization

When applying the classical gap estimates such as (11.8) and (11.10) to deep learn- ing models, the immense size of the number of parameters will make the complexity measuraments such as VC-dimensional and Rademacher complexity very large, and thus estimates (11.8) and (11.10) become very rough and useless in practice. This phenomenon is called the “curse of dimensionality”, indicating that deep models may overfit the training data and cannot generalize well on test data. However, in practical experiments people found that deep learning models almost never have this bad gener- alization, and instead, models that are trained by such optimization methods as SGD

52 generalizes very well on test data, with test accuracies often reaching 80-85%+. This seemingly mysterious phenomenon has been one of the major challenges of the theory of deep learning. Currently (at the writing of these lecture notes, in the middle of 2020) we have made only shallow understanding of this phenomenon. It is commonly believed that such strange behavior may be due to the interaction of two effects: the implicit regularization of the optimization dynamics and the network architecture. A few papers/preprints suggested: Cao, Y. and Gu, Q., Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks, NeurIPS, 2019. Mei, S. and Montanari, A., The generalization error of random features regression: Precise asymptotics and double descent curve, arXiv:1908.05355 [math.ST] Jacot, A., Gabriel, F., and Hongler, C., Neural Tangent Kernel: Convergence and generalization in neural networks, NIPS, 2018. Wu, J., Hu, W., Xiong, H., Huan, J., Braverman, V. and Zhu, Z., On the Noisy Gradient Descent that Generalizes as SGD, arXiv:1906.07405 [cs.LG]

11.3 Experiments: Real Data Sets

• CIFAR-10 is a common benchmark in machine learning for image recognition. The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The CIFAR-10 and CIFAR-100 datasets are introduced at this webpage http://www.cs.toronto.edu/~kriz/cifar.html • Convolutional Neural Networks: see http://cs231n.github.io/convolutional-networks/ • Use TensorFlow to train and evaluate a convolutional neural network (CNN) on both CPU and GPU. We also demonstrate how to train a CNN over multiple GPUs. Availabe at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10 https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn. py Detailed instructions on how to get started available at: https://www.tensorflow.org/tutorials/images/deep_cnn • Make use of Google-Colab for GPU/TPU support. See

53 https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d • The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and methods on real-world data while spending minimal efforts on preprocessing and formatting. See http://yann.lecun.com/exdb/mnist/ https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py • More Datasets and Error/Accuracy results in recent Papers. See http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_ results.html.

54 A Big-O and Little-O notations

Big–O and Little–O notations are used very often in asymptotic analysis, in par- ticular to express asymptotic rates of convergence in the analysis of algorithms. See https://en.wikipedia.org/wiki/Big_O_notation#Infinite_asymptotics The symbol O(x), pronounced “Big–O of x”, is one of the Landau symbols and is used to symbolically express the asymptotic behavior of a given function. In particular, if n is an integer variable which tends to infinity and x is a continuous variable tending to some limit, if φ(n) and φ(x) are positive functions, and if f(n) and f(x) are arbitrary functions, then it is said that f ∈ O(φ) provided that |f| < Aφ for some constant A and all values n and x. Note that Big–O notation is the inverse of Big–Ω notation, i.e., that

f(n) ∈ O(φ(n)) ⇐⇒ φ(n) ∈ Ω(f(n)).

Additionally, Big–O notation is related to Little–O notation in that f ∈ o(φ) is stronger than and implies f ∈ O(φ). The symbol o(x), pronounced “Little–O of x,” is one of the Landau symbols and is used to symbolically express the asymptotic behavior of a given function. In particular, if n is an integer variable which tends to infinity and x is a continuous variable tending to some limit, if φ(n) and φ(x) are positive functions, and if f(n) and f f(x) are arbitrary functions, then it is said that f ∈ o(φ) provided that → 0. Thus, φ φ(n) or φ(x) grows much faster than f(n) or f(x). Note that Little–O notation is the inverse of Little–Ω notation, i.e., that

f(n) ∈ o(φ(n)) ⇐⇒ φ(n) ∈ ω(f(n)) .

Additionally, Little–O notation is related to Big–O notation in that f ∈ o(φ) is stronger than and implies f ∈ O(φ).

55