Nonlinear Optimization in Machine Learning A Series of Lecture Notes at Missouri S&T
Wenqing Hu ∗
Contents
1 Background on Machine Learning: Why Nonlinear Optimization?1 1.1 Empirical Risk Minimization...... 1 1.2 Loss Functions...... 2 1.3 Learning Models...... 2
2 Basic Information about Convex Functions6 2.1 Optimization problem and its solutions...... 6 2.2 Convexity...... 6 2.3 Taylor’s theorem and convexity in Taylor’s expansion...... 7 2.4 Optimality Conditions...... 9
3 Gradient Descent and Line Search Methods 11 3.1 Gradient Descent...... 11 3.2 Back Propagation and GD on Neural Networks...... 13 3.3 Online Principle Component Analysis (PCA)...... 14 3.4 Line Search Methods...... 15 3.5 Convergence to Approximate Second Order Necessary Points...... 16
4 Accelerated Gradient Methods 18 4.1 Polyak’s heavy ball method...... 18 4.2 Nesterov’s Accelerated Gradient Descent (AGD)...... 19 4.3 Conjugate gradient method...... 23
∗Department of Mathematics and Statistics, Missouri University of Science and Technology (formerly University of Missouri, Rolla). Web: http://web.mst.edu/~huwen/ Email: [email protected]
1 5 Stochastic Gradient Methods 24 5.1 SGD...... 24 5.2 Variance–reduction methods: SVRG, SAG, SAGA, SARAH, SPIDER, STORM...... 26 5.3 Escape from saddle points and nonconvex problems, SGLD, Perturbed GD 32 5.4 Coordinate Descent and Stochastic Coordinate Descent...... 33
6 Second Order Methods 35 6.1 Newton’s method...... 35 6.2 Quasi–Newton: BFGS...... 36 6.3 Stochastic Quasi–Newton...... 38
7 Dual Methods 41 7.1 Dual Coordinate Descent/Ascent...... 41 7.2 SDCA...... 42
8 Optimization with Constraints 44 8.1 Subgradient Projection Method, Frank–Wolfe Method...... 44 8.2 ADMM...... 45
9 Adaptive Methods 47 9.1 SGD with momentum...... 47 9.2 AdaGrad, RMSProp, AdaDelta...... 47 9.3 Adam...... 49
10 Optimization with Discontinuous Objective Function 50 10.1 Proximal operators and proximal algorithms...... 50
11 Optimization and Generalization 51 11.1 Generalization Bounds in Statistical Machine Learning...... 51 11.2 Training Matters: Connecting Optimization with Generalization.... 52 11.3 Experiments: Real Data Sets...... 53
A Big-O and Little-O notations 55
2 1 Background on Machine Learning: Why Nonlinear Op- timization?
1.1 Empirical Risk Minimization
Supervised Learning: Given training data points (x1, y1), ..., (xn, yn), construct a learning model y = g(x, ω) that best fits the training data. Here ω stands for the parameters of the learning model.
Here (xi, yi) comes from an independent identically distributed family (xi, yi) ∼ p(x, y), where p(x, y) is the joint density. The model x → y is a black-box p(y|x), which is to be fit by g(x, ω). “Loss function” L(g(x, ω), y), for example, can be L(g(x, ω), y) = (g(x, ω) − y)2. Empirical Risk Minimization (ERM):
n ∗ 1 X ω = arg min L(yi, g(xi, ω)) . (1.1) n ω n i=1 Regularized Empirical Risk Minimization (R-ERM):
n ∗ 1 X ω = arg min L(yi, g(xi, ω)) + λR(ω) . (1.2) n ω n i=1
In general let fi(ω) = L(yi, g(xi, ω)) or fi(ω) = L(yi, g(xi, ω)) + λR(ω), then the optimization problem is
n ∗ 1 X ω = arg min fi(ω) . (1.3) n ω n i=1 Key features of nonlinear optimization problem in machine learning: large–scale, nonconvex, ... etc. Key problems in machine learning: optimization combined with generalization. “Population Loss”: EL(g(x, ω), y), minimizer
ω∗ = arg min EL(g(x, ω), y) . ω
∗ ∗ ∗ Generalization Error: EL(y, g(x, ωn)). Consistency: Do we have ωn → ω ? At what speed? Key problems in optimization: convergence, acceleration, variance reduction. How can optimization be related to generalization? Vapnik–Chervonenkis dimen- sion. Radmacher complexity, geometry of the loss landscape.
1 1.2 Loss Functions
Classification Problems: Choice of Loss function L(y, g), y = 1, −1. 0/1 Loss:
`0/1(y, g) = 1 if yg < 0 and `0/1(y, g) = 0 otherwise. (1) Hinge Loss. L(y, g) = max(0, 1 − yg) ; (1.4)
(2) Exponential Loss. L(y, g) = exp(−yg) ; (1.5)
(3) Cross Entropy Loss.
eg e−g L(y, g) = − I ln + I ln . (1.6) {y=1} eg + e−g {y=−1} eg + e−g
Regression Problems: Choice of Loss function L(y, g). (1) L2–Loss. 2 L(y, g) = |y − g|2 . (1.7) (2) L1–Loss.
L(y, g) = |y − g|1 . (1.8) (3) L0–Loss.
L(y, g) = |y − g|0 . (1.9)
1 0 Regularized (penalize) term R(ω): L –regularized R(ω) = |ω|1; L –regularized
R(ω) = |ω|0.
1.3 Learning Models 1 (1) Linear regression: g(x, ω) = ωT x. g(x, ω) = . 1 + exp(−ωT x) Least squares problem:
m 1 X T 2 1 2 min (x ω − yj) = |Aω − y| (1.10) ω 2m j 2m 2 j=1
Here A = (x1, ..., xm). Tikhonov regularization: 1 min |Aω − y|2 + λ|ω|2 . (1.11) ω 2m 2 2 LASSO (Least Absolute Shrinkage and Selection Operator):
1 2 min |Aω − y| + λ|ω|1 . (1.12) ω 2m 2
2 n n (2) Support Vector Machines (SVM): xj ∈ R , yj ∈ {1, −1}. Goal is to seek ω ∈ R T T and β ∈ R such that ω xj + β ≥ 1 if yj = 1 and ω xj + β ≤ −1 if yj = −1. Hinge Loss m 1 P T L(ω, β) = max(0, 1 − yj(xj ω − β)). m j=1 Classification Problem: Goal is to find a hyperplane ωT x + β = 0 such that it classifies the two kinds of data points most efficiently. The signed distance of any point ωT x + β x ∈ n to the hyperplane is given by r = . If the classification is good enough, R |ω| T T we expect to have ω xj + β > 0 when y = 1 and ω xj + β < 0 when y = −1. We can T then formulate the problem as looking for optimal ω and β such that ω xj +β ≥ 1 when T yj = 1 and ω xj + β ≤ −1 when yi = −1. The closest few data points that match these two inequality are called “support vectors”. The distance to the separating hyperplane 2 created by two support vectors of opposite type is . So we have the constrained |ω| optimization problem
2 T max such that yj(ω xj + β) ≥ 1 for j = 1, 2, ..., m . ω∈Rn,β∈R |ω| Or in other words we have the constrained optimization problem
1 2 T min |ω| such that yj(ω xj + β) ≥ 1 for j = 1, 2, ..., m . (1.13) ω∈Rn,β∈R 2 “Soft margin”: We allow the SVM to make errors on some training data points but we T want to minimize the error. In fact, we allow some training data to violate yj(ω xj +β) ≥ 1, so that ideally we minimize
m 1 2 X T min |ω| + C `0/1(yj(ω xj + β) − 1) . ω∈ n,β∈ 2 R R j=1
Here the 0/1 loss is `0/1(z) = 1 if z < 0 and `0/1(z) = 0 otherwise. We can then turn the 0/1 loss to hinge loss, that is why hinge loss comes in:
m 1 2 X T min |ω| + C max(0, 1 − yj(ω xj + β)) . (1.14) ω∈ n,β∈ 2 R R j=1
“slack variables” ξi ≥ 0, “Soft margin SVM”
m 1 2 X T min |ω| + C ξj s.t. yj(ω xj + b) ≥ 1 − ξj , ξj ≥ 0 , j = 1, 2, ..., m . (1.15) ω∈ n,β∈ 2 R R j=1
(3) Neural Network: “activation function” σ. Sigmoid: 1 σ(z) = ; (1.16) 1 + exp(−z)
3 tanh: exp(z) − exp(−z) σ(z) = ; (1.17) exp(z) + exp(−z) ReLU (Rectified Linear Unit):
σ(z) = max(0, z) . (1.18)
n n n Vector–valued activation: if z ∈ R then σ : R → R is defined by (σ(z))i = σ(zi) where each σ(zi) is the scalar activation function. Fully connected neural network prediction function
T (H) (H−1) (1) g(x, ω) = a σ W (σ(W (...(σ(W x + b1))...) + bH−1) + bH . (1.19)
Optimization n 1 X 2 min (g(xi, ω) − yi) . ω 2 i=1 Many other different network structures: convolutional, recurrent (Gate: GRU, LSTM), ResNet, ... Two layer (one hidden layer) fully connected ReLU neural network has specific loss T ω1 (1) function structure: our W = ... and T ωm
m X T g(x, ω) = ar max ωr x, 0 . (1.20) r=1 Optimization problem is given by
n m !2 1 X X T min ar max ω xi, 0 − yi . ω 2 r i=1 r=1 Non–convexity issues: see Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz, SGD learns over–parameterized networks that provably generalize on linearly separable data, arXiv:1710.10174, ICLR, 2018. Also see problem 1 in Assignment 1.
(4) Matrix Completion: Each Aj is n × p matrix, and we seek for another n × p matrix such that m 1 X 2 min (hAj,Xi − yj) (1.21) X 2m j=1 T where hA, Bi = tr(A B). We can think of the Aj as “probing” the unknown matrix X.
4 n×p (5) Nonnegative Matrix Factorization: If the full matrix Y ∈ R is observed, n×r p×r then we seek for L ∈ R and R ∈ R such that
T 2 min kLR − Y kF subject to L ≥ 0 and R ≥ 0 . (1.22) L,R