ECS289: Scalable Machine Learning

Cho-Jui Hsieh UC Davis

Sept 29, 2016 Outline

Convex vs Nonconvex Functions Newton’s method Stochastic Gradient Descent Numerical Optimization

Numerical Optimization: minX f (X ) Can be applied to computer science, economics, control engineering, operating research, . . . Machine Learning: find a model that minimizes the prediction error. Properties of the Function

Smooth function: functions with continuous derivative. Example: ridge regression 1 λ argmin kX w − yk2 + kwk2 w 2 2 Non-smooth function: Lasso, primal SVM

1 2 Lasso: argmin kX w − yk + λkwk1 w 2 n X T λ 2 SVM: argmin max(0, 1 − yi w xi )+ kwk w 2 i=1 Convex Functions

A function is convex if:

∀x1, x2, ∀t ∈ [0, 1], f (tx1 + (1 − t)x2) ≤ tf (x1) + (1 − t)f (x2)

No local optimum (why?)

Figure from Wikipedia Convex Functions

If f (x) is twice differentiable, then f is convex if and only if ∇2f (x)  0 Optimal solution may not be unique: has a set of optimal solutions S Gradient: capture the first order change of f :

f (x + αd) = f (x) + α∇f (x)T d + O(α2)

If f is convex and differentiable, we have the following optimality condition: x∗ ∈ S if and only if ∇f (x) = 0 Strongly Convex Functions

f is strongly convex if there exists an m > 0 such that m f (y) ≥ f (x) + ∇f (x)T (y − x) + ky − xk2 2 2 A strongly convex function has a unique global optimum x∗ (why?) If f is twice differentiable, then

f is strongly convex if and only if ∇2f (x) mI for all x

Gradient descent, coordinate descent will converge linearly (will see later) Nonconvex Functions

If f is nonconvex, most can only converge to stationary points x¯ is a stationary point if and only if ∇f (x¯) = 0 Three types of stationary points: (1) Global optimum (2) Local optimum (3) Saddle point Example: matrix completion, neural network, . . . 1 2 Example: f (x, y) = 2 (xy − a) Coordinate Descent Coordinate Descent

Update one variable at a time Coordinate Descent: repeatedly perform the following loop Step 1: pick an index i Step 2: compute a step size δ∗ by (approximately) minimizing

argmin f (x + δei ) δ ∗ Step 3: xi ← xi + δ Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015

Coordinate Descent (update sequence)

Three types of updating order: Cyclic: update sequence

x1, x2,..., xn , x1, x2,..., xn ,... | {z } | {z } 1st outer iteration 2nd outer iteration

Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016 Coordinate Descent (update sequence)

Three types of updating order: Cyclic: update sequence

x1, x2,..., xn , x1, x2,..., xn ,... | {z } | {z } 1st outer iteration 2nd outer iteration

Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016 Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015 Greedy Coordinate Descent

Greedy: choose the most “important” coordinate to update How to measure the importance?

By first derivative: |∇i f (x)| 2 By first and second derivative: |∇i f (x)/∇ii f (x)| By maximum reduction of objective function   ∗ i = argmax f (x) − min f (x + δei ) i=1,...,n δ Need to consider the time complexity for variable selection Useful for kernel SVM (see lecture 6) Extension: block coordinate descent

Variables are divided into blocks {X1,..., Xp}, where each Xi is a subset of variables and

X1 ∪ X2,..., Xp = {1,..., n}, Xi ∩ Xj = ϕ, ∀i, j

Each time update a Xi by (approximately) solving the subproblem within the block Example: alternating minimization for matrix completion (2 blocks). (See lecture 7) Coordinate Descent (convergence)

Converge to optimal if f (·) is convex and smooth Has a linear convergence rate if f (·) is strongly convex Linear convergence: error f (xt ) − f (x∗) decays as β, β2, β3,... for some β < 1. Local linear convergence: an algorithm converges linearly after kx − x∗k ≤ K for some K > 0 Coordinate Descent: other names

Alternating minimization (matrix completion) Iterative scaling (for log-linear models) Decomposition method (for kernel SVM) Gauss Seidel (for linear system when the matrix is positive definite) ... Gradient Descent Gradient Descent

Gradient descent : repeatedly conduct the following update: xt+1 ← xt − α∇f (xt ) where α > 0 is the step size It is a fixed point iteration method: x − α∇f (x) = x if and only if x is an optimal solution Step size too large ⇒ diverge; too small ⇒ slow convergence Gradient Descent: successive approximation

At each iteration, form an approximation of f (·):

t t t T 1 T 1 f (x + d) ≈ f˜ t (d) :=f (x ) + ∇f (x ) d + d ( I )d x 2 α 1 =f (xt ) + ∇f (xt )T d + d T d 2α t+1 t ˜ Update solution by x ← x + argmind fxt (d) ∗ t ˜ d = −α∇f (x ) is the minimizer of argmind fxt (d) d ∗ will decrease the original objective function f if α (step size) is small enough Gradient Descent: successive approximation

However, the function value will decrease if

Condition 1: f˜x (d) ≥ f (x + d) for all d Condition 2: f˜x (0) = f (x) Why?

t ∗ ˜ ∗ f (x + d ) ≤ fxt (d ) ˜ ≤ fxt (0) = f (xt )

˜ Condition 2 is satisfied by construction of fxt 1 2 Condition 1 is satisfied if α I  ∇ f (x) for all x Gradient Descent: step size

A twice differentiable function has L-Lipchitz continuous gradient if and only if ∇2f (x) ≤ LI ∀x 1 In this case, Condition 2 is satisfied if α < L 1 Theorem: gradient descent converges if α < L 1 Theorem: gradient descent converges linearly with α < L if f is strongly convex Gradient Descent

In practice, we do not know L...... Step size α too large: the algorithm diverges Step size α too small: the algorithm converges very slowly Gradient Descent:

d ∗ is a “descent direction” if and only if (d ∗)T ∇f (x) < 0 Armijo rule bakctracking line search: 1 1 Try α = 1, 2 , 4 ,... until it staisfies f (x + αd ∗) ≤ f (x) + γα(d ∗)T ∇f (x) where 0 < γ < 1

Figure from http://ool.sourceforge.net/ool-ref.html Gradient Descent: line search

Gradient descent with line search: Converges to optimal solutions if f is smooth Converges linearly if f is strongly convex However, each iteration requires evaluating f several times Several other step-size selection approaches (an ongoing research topic, especially for stochastic gradient descent) Gradient Descent: applying to ridge regression

N×d N (0) Input: X ∈ R , y ∈ R , initial w ∗ 1 2 λ 2 Output: Solution w := argminw 2 kX w − yk + 2 kwk 1: t = 0 2: while not converged do 3: Compute the gradient

g = X T (X w − y) + λw

4: Choose step size αt 5: Update w ← w − αt g 6: t ← t + 1 7: end while Time complexity: O(nnz(X )) per iteration Proximal Gradient Descent

How can we apply gradient descent to solve the Lasso problem?

1 2 argmin kX w − yk + λ kwk1 w 2 | {z } non−differentiable

General composite function minimization:

argmin f (x) := {g(x) + h(x)} x where g is smooth and convex, h is convex but may be non-differentiable Usually assume h is simple (for computational efficiency) Proximal Gradient Descent: successive approximation

At each iteration, form an approximation of f (·):

t t t T 1 T 1 t f (x + d) ≈ f˜ t (d) :=g(x ) + ∇g(x ) d + d ( I )d + h(x + d) x 2 α 1 =g(xt ) + ∇g(xt )T d + d T d + h(xt + d) 2α t+1 t ˜ Update solution by x ← x + argmind fxt (d) This is called “proximal” gradient descent ∗ ˜ Usually d = argmind fxt (d) has a closed form solution Proximal Gradient Descent: `1-regularization

The subproblem:

t+1 t t T 1 T t x =x + argmin ∇g(x ) d + d d + λkx + dk1 d 2α 1 t t 2 = argmin ku − (x − α∇g(x ))k + λαkuk1 u 2 =S(xt − α∇g(xt ), αλ),

where S is the soft-thresholding operator defined by  a − z if a > z  S(a, z) = a + z if a < −z 0 if a ∈ [−z, z] Proximal Gradient: soft-thresholding

Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/ Proximal Gradient Descent for Lasso

N×d N (0) Input: X ∈ R , y ∈ R , initial w ∗ 1 2 Output: Solution w := argminw 2 kX w − yk + λkwk1 1: t = 0 2: while not converged do 3: Compute the gradient

g = X T (X w − y)

4: Choose step size αt 5: Update w ← S(w − αt g, αt λ) 6: t ← t + 1 7: end while Time complexity: O(nnz(X )) per iteration Newton’s Method Newton’s Method

Iteratively conduct the following updates:

x ← x − α∇2f (x)−1∇f (x)

where α is the step size If α = 1: converges quadratically when xt is close enough to x∗:

kxt+1 − x∗k ≤ Kkxt − x∗k2

for some constant K. This means the error f (xt ) − f (x∗) decays quadratically: β, β2, β4, β8, β16,... Only need few iterations to converge in this “quadratic convergence region” Newton’s Method

However, Newton’s update rule is more expensive than gradient descent/coordinate descent Newton’s Method

Need to compute ∇2f (x)−1∇f (x) Closed form solution: O(d3) for solving a d dimensional linear system Usually solved by another iterative solver: gradient descent coordinate descent conjugate gradient method ...

Examples: primal L2-SVM/logistic regression, `1-regularized logistic regression, . . . Newton’s Method

At each iteration, form an approximation of f (·):

t t t T 1 T 2 f (x + d) ≈ f˜ t (d) := f (x ) + ∇f (x ) d + d ∇ f (x)d x 2α t+1 t ˜ Update solution by x ← x + argmind fxt (d) When x is far away from x∗, needs line search to gaurantee convergence 2 m Assume LI  ∇ f (x)  mI for all x, then α ≤ L gaurantee the objective function value deacrease because L ∇2f (x)  ∇2f (y) ∀x, y m In practice, we often just use line search. Proximal Newton

What if f (x) = g(x) + h(x) and h(x) is non-smooth (h(x) = kxk1)? At each iteration, form an approximation of f (·):

t t t T α T 2 f (x + d) ≈ f˜ t (d) := g(x ) + ∇g(x ) d + d ∇ g(x)d + h(x + d) x 2 t+1 t ˜ Update solution by x ← x + argmind fxt (d) Need another iterative solver for solving the subproblem Stochastic Gradient Method Stochastic Gradient Method: Motivation

Widely used for machine learning problems (with large number of samples)

Given training samples x1,..., xn, we usually want to solve the following empirical risk minimization (ERM) problem:

n X argmin `i (xi ), w i=1

where each `i (·) is the loss function Minimize the summation of individual loss on each sample Why does SG work? n 1 X E [∇f (x)] = ∇f (x) = ∇f (x) i i n i i=1

∗ ∗ ∗ Is it a fixed point method? No if η > 0 because x − η∇i f (x ) 6= x Is it a descent method? No, because f (xt+1) 6< f (xt )

Stochastic Gradient Method

Assume the objective function can be written as n 1 X f (x) = f (x) n i i=1 Stochastic gradient method: Iterative conducts the following updates 1 Choose an index i (uniform) randomly t+1 t t t 2 x ← x − η ∇fi (x ) ηt > 0 is the step size ∗ ∗ ∗ Is it a fixed point method? No if η > 0 because x − η∇i f (x ) 6= x Is it a descent method? No, because f (xt+1) 6< f (xt )

Stochastic Gradient Method

Assume the objective function can be written as n 1 X f (x) = f (x) n i i=1 Stochastic gradient method: Iterative conducts the following updates 1 Choose an index i (uniform) randomly t+1 t t t 2 x ← x − η ∇fi (x ) ηt > 0 is the step size Why does SG work? n 1 X E [∇f (x)] = ∇f (x) = ∇f (x) i i n i i=1 Stochastic Gradient Method

Assume the objective function can be written as n 1 X f (x) = f (x) n i i=1 Stochastic gradient method: Iterative conducts the following updates 1 Choose an index i (uniform) randomly t+1 t t t 2 x ← x − η ∇fi (x ) ηt > 0 is the step size Why does SG work? n 1 X E [∇f (x)] = ∇f (x) = ∇f (x) i i n i i=1

∗ ∗ ∗ Is it a fixed point method? No if η > 0 because x − η∇i f (x ) 6= x Is it a descent method? No, because f (xt+1) 6< f (xt ) Stochastic Gradient

Step size η has to decay to 0 (e.g., ηt = Ct−a for some constant a, C) SGD converges sub-linearly Many variants proposed recently SVRG, SAGA (2013, 2014): variance reduction AdaGrad (2011): adaptive learning rate RMSProp (2012): estimate learning rate by a running average of gradient. Adam (2015): adaptive moment estimation Widely used in machine learning SGD in Deep Learning

AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = ∇f (x) 2 Estimate the second moment: Gii = Gii + gi for all i η Parameter update: xi ← xi − √ gi for all i Gii Proposed for in:

“Adaptive subgradient methods for online learning and stochastic optimization” (JMLR 2011)

“Adaptive Bound Optimization for Online Convex Optimization” (COLT 2010) SGD in Deep Learning

AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = ∇f (x) 2 Estimate the second moment: Gii = Gii + gi for all i η Parameter update: xi ← xi − √ gi for all i Gii Adam (Kingma and Ba, 2015): maintain both first moment and second moment Update rule at the t-th iteration: Compute g = ∇f (x) m = β1m + (1 − β1)g t mˆ = mt /(1 − β1) (estimate of first moment) 2 v = β2v + (1 − β2)g t vˆ = vt /(1 − β2) (estimate of second moment) x ← x − ηmˆ/(vˆ + ) All the operations are element-wise Stochastic Gradient: Variance Reduction

SGD is not a fixed point method: even if x∗ is optimal,

∗ ∗ ∗ x − η∇i f (x ) 6= x

∗ ∗ Reason: E[∇i f (x )] = ∇f (x ) = 0 but variance can be large Variance reduction: reduce the variance so that variance → 0 when t → ∞. SVRG

SVRG (Johnson and Zhang, 2013) For t = 1, 2, ··· 1 x˜ = x n 2 1 P g˜ = n i=1 ∇fi (x˜) 3 For s = 1, 2, ··· 1 Randomly pick i 2 x ← x − η(∇fi (x) − ∇fi (x˜) + g˜) | {z } ai Can easily show

E[ai ] = 0 (unbiased) ∗ Var[ai ] = 0 when x = x˜ = x Have linear convergence rate Stochastic Gradient: applying to ridge regression

Objective function:

n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1

1 Pn How to write as argminw n i=1 fi (w)? How to decompose into n components? Stochastic Gradient: applying to ridge regression

Objective function:

n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1

1 Pn How to write as argminw n i=1 fi (w)? T 2 2 First approach: fi (w) = (w xi − yi ) + λkwk Update rule:

t+1 t t T t w ← w − 2η (w xi − yi )xi − 2η λw t t T = (1 − 2η λ)w − 2η (w xi − yi )xi Stochastic Gradient: applying to ridge regression

Objective function:

n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1

1 Pn How to write as argminw n i=1 fi (w)? T 2 2 First approach: fi (w) = (w xi − yi ) + λkwk Update rule:

t+1 t t T t w ← w − 2η (w xi − yi )xi − 2η λw t t T = (1 − 2η λ)w − 2η (w xi − yi )xi

Need O(d) complexity per iteration even if data is sparse Stochastic Gradient: applying to ridge regression

Objective function:

n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1

1 Pn How to write as argminw n i=1 fi (w)? T 2 2 First approach: fi (w) = (w xi − yi ) + λkwk Update rule:

t+1 t t T t w ← w − 2η (w xi − yi )xi − 2η λw t t T = (1 − 2η λ)w − 2η (w xi − yi )xi

Need O(d) complexity per iteration even if data is sparse Solution: store w = sv where s is a scalar Update rule when selecting index i:

t t+1 t t T t 2η λn t wj ← wj − 2η (xi w − yi )Xij − wj , ∀j ∈ Ωi nj

Solution: update can be done in O(|Ωi |) operations

Stochastic Gradient: applying to ridge regression

Objective function:

n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1 Second approach:

define Ωi = {j | Xij 6= 0} for i = 1,..., n

define nj = |{i | Xij 6= 0}| for j = 1,..., d T 2 P λn 2 define fi (w) = (w xi − yi ) + w j∈Ωi nj j Stochastic Gradient: applying to ridge regression

Objective function:

n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1 Second approach:

define Ωi = {j | Xij 6= 0} for i = 1,..., n

define nj = |{i | Xij 6= 0}| for j = 1,..., d T 2 P λn 2 define fi (w) = (w xi − yi ) + w j∈Ωi nj j Update rule when selecting index i:

t t+1 t t T t 2η λn t wj ← wj − 2η (xi w − yi )Xij − wj , ∀j ∈ Ωi nj

Solution: update can be done in O(|Ωi |) operations Coming up

Next class: Parallel Optimization

Questions?