ECS289: Scalable Machine Learning
Cho-Jui Hsieh UC Davis
Sept 29, 2016 Outline
Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton’s method Stochastic Gradient Descent Numerical Optimization
Numerical Optimization: minX f (X ) Can be applied to computer science, economics, control engineering, operating research, . . . Machine Learning: find a model that minimizes the prediction error. Properties of the Function
Smooth function: functions with continuous derivative. Example: ridge regression 1 λ argmin kX w − yk2 + kwk2 w 2 2 Non-smooth function: Lasso, primal SVM
1 2 Lasso: argmin kX w − yk + λkwk1 w 2 n X T λ 2 SVM: argmin max(0, 1 − yi w xi )+ kwk w 2 i=1 Convex Functions
A function is convex if:
∀x1, x2, ∀t ∈ [0, 1], f (tx1 + (1 − t)x2) ≤ tf (x1) + (1 − t)f (x2)
No local optimum (why?)
Figure from Wikipedia Convex Functions
If f (x) is twice differentiable, then f is convex if and only if ∇2f (x) 0 Optimal solution may not be unique: has a set of optimal solutions S Gradient: capture the first order change of f :
f (x + αd) = f (x) + α∇f (x)T d + O(α2)
If f is convex and differentiable, we have the following optimality condition: x∗ ∈ S if and only if ∇f (x) = 0 Strongly Convex Functions
f is strongly convex if there exists an m > 0 such that m f (y) ≥ f (x) + ∇f (x)T (y − x) + ky − xk2 2 2 A strongly convex function has a unique global optimum x∗ (why?) If f is twice differentiable, then
f is strongly convex if and only if ∇2f (x) mI for all x
Gradient descent, coordinate descent will converge linearly (will see later) Nonconvex Functions
If f is nonconvex, most algorithms can only converge to stationary points x¯ is a stationary point if and only if ∇f (x¯) = 0 Three types of stationary points: (1) Global optimum (2) Local optimum (3) Saddle point Example: matrix completion, neural network, . . . 1 2 Example: f (x, y) = 2 (xy − a) Coordinate Descent Coordinate Descent
Update one variable at a time Coordinate Descent: repeatedly perform the following loop Step 1: pick an index i Step 2: compute a step size δ∗ by (approximately) minimizing
argmin f (x + δei ) δ ∗ Step 3: xi ← xi + δ Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015
Coordinate Descent (update sequence)
Three types of updating order: Cyclic: update sequence
x1, x2,..., xn , x1, x2,..., xn ,... | {z } | {z } 1st outer iteration 2nd outer iteration
Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016 Coordinate Descent (update sequence)
Three types of updating order: Cyclic: update sequence
x1, x2,..., xn , x1, x2,..., xn ,... | {z } | {z } 1st outer iteration 2nd outer iteration
Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016 Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015 Greedy Coordinate Descent
Greedy: choose the most “important” coordinate to update How to measure the importance?
By first derivative: |∇i f (x)| 2 By first and second derivative: |∇i f (x)/∇ii f (x)| By maximum reduction of objective function ∗ i = argmax f (x) − min f (x + δei ) i=1,...,n δ Need to consider the time complexity for variable selection Useful for kernel SVM (see lecture 6) Extension: block coordinate descent
Variables are divided into blocks {X1,..., Xp}, where each Xi is a subset of variables and
X1 ∪ X2,..., Xp = {1,..., n}, Xi ∩ Xj = ϕ, ∀i, j
Each time update a Xi by (approximately) solving the subproblem within the block Example: alternating minimization for matrix completion (2 blocks). (See lecture 7) Coordinate Descent (convergence)
Converge to optimal if f (·) is convex and smooth Has a linear convergence rate if f (·) is strongly convex Linear convergence: error f (xt ) − f (x∗) decays as β, β2, β3,... for some β < 1. Local linear convergence: an algorithm converges linearly after kx − x∗k ≤ K for some K > 0 Coordinate Descent: other names
Alternating minimization (matrix completion) Iterative scaling (for log-linear models) Decomposition method (for kernel SVM) Gauss Seidel (for linear system when the matrix is positive definite) ... Gradient Descent Gradient Descent
Gradient descent algorithm: repeatedly conduct the following update: xt+1 ← xt − α∇f (xt ) where α > 0 is the step size It is a fixed point iteration method: x − α∇f (x) = x if and only if x is an optimal solution Step size too large ⇒ diverge; too small ⇒ slow convergence Gradient Descent: successive approximation
At each iteration, form an approximation of f (·):
t t t T 1 T 1 f (x + d) ≈ f˜ t (d) :=f (x ) + ∇f (x ) d + d ( I )d x 2 α 1 =f (xt ) + ∇f (xt )T d + d T d 2α t+1 t ˜ Update solution by x ← x + argmind fxt (d) ∗ t ˜ d = −α∇f (x ) is the minimizer of argmind fxt (d) d ∗ will decrease the original objective function f if α (step size) is small enough Gradient Descent: successive approximation
However, the function value will decrease if
Condition 1: f˜x (d) ≥ f (x + d) for all d Condition 2: f˜x (0) = f (x) Why?
t ∗ ˜ ∗ f (x + d ) ≤ fxt (d ) ˜ ≤ fxt (0) = f (xt )
˜ Condition 2 is satisfied by construction of fxt 1 2 Condition 1 is satisfied if α I ∇ f (x) for all x Gradient Descent: step size
A twice differentiable function has L-Lipchitz continuous gradient if and only if ∇2f (x) ≤ LI ∀x 1 In this case, Condition 2 is satisfied if α < L 1 Theorem: gradient descent converges if α < L 1 Theorem: gradient descent converges linearly with α < L if f is strongly convex Gradient Descent
In practice, we do not know L...... Step size α too large: the algorithm diverges Step size α too small: the algorithm converges very slowly Gradient Descent: line search
d ∗ is a “descent direction” if and only if (d ∗)T ∇f (x) < 0 Armijo rule bakctracking line search: 1 1 Try α = 1, 2 , 4 ,... until it staisfies f (x + αd ∗) ≤ f (x) + γα(d ∗)T ∇f (x) where 0 < γ < 1
Figure from http://ool.sourceforge.net/ool-ref.html Gradient Descent: line search
Gradient descent with line search: Converges to optimal solutions if f is smooth Converges linearly if f is strongly convex However, each iteration requires evaluating f several times Several other step-size selection approaches (an ongoing research topic, especially for stochastic gradient descent) Gradient Descent: applying to ridge regression
N×d N (0) Input: X ∈ R , y ∈ R , initial w ∗ 1 2 λ 2 Output: Solution w := argminw 2 kX w − yk + 2 kwk 1: t = 0 2: while not converged do 3: Compute the gradient
g = X T (X w − y) + λw
4: Choose step size αt 5: Update w ← w − αt g 6: t ← t + 1 7: end while Time complexity: O(nnz(X )) per iteration Proximal Gradient Descent
How can we apply gradient descent to solve the Lasso problem?
1 2 argmin kX w − yk + λ kwk1 w 2 | {z } non−differentiable
General composite function minimization:
argmin f (x) := {g(x) + h(x)} x where g is smooth and convex, h is convex but may be non-differentiable Usually assume h is simple (for computational efficiency) Proximal Gradient Descent: successive approximation
At each iteration, form an approximation of f (·):
t t t T 1 T 1 t f (x + d) ≈ f˜ t (d) :=g(x ) + ∇g(x ) d + d ( I )d + h(x + d) x 2 α 1 =g(xt ) + ∇g(xt )T d + d T d + h(xt + d) 2α t+1 t ˜ Update solution by x ← x + argmind fxt (d) This is called “proximal” gradient descent ∗ ˜ Usually d = argmind fxt (d) has a closed form solution Proximal Gradient Descent: `1-regularization
The subproblem:
t+1 t t T 1 T t x =x + argmin ∇g(x ) d + d d + λkx + dk1 d 2α 1 t t 2 = argmin ku − (x − α∇g(x ))k + λαkuk1 u 2 =S(xt − α∇g(xt ), αλ),
where S is the soft-thresholding operator defined by a − z if a > z S(a, z) = a + z if a < −z 0 if a ∈ [−z, z] Proximal Gradient: soft-thresholding
Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/ Proximal Gradient Descent for Lasso
N×d N (0) Input: X ∈ R , y ∈ R , initial w ∗ 1 2 Output: Solution w := argminw 2 kX w − yk + λkwk1 1: t = 0 2: while not converged do 3: Compute the gradient
g = X T (X w − y)
4: Choose step size αt 5: Update w ← S(w − αt g, αt λ) 6: t ← t + 1 7: end while Time complexity: O(nnz(X )) per iteration Newton’s Method Newton’s Method
Iteratively conduct the following updates:
x ← x − α∇2f (x)−1∇f (x)
where α is the step size If α = 1: converges quadratically when xt is close enough to x∗:
kxt+1 − x∗k ≤ Kkxt − x∗k2
for some constant K. This means the error f (xt ) − f (x∗) decays quadratically: β, β2, β4, β8, β16,... Only need few iterations to converge in this “quadratic convergence region” Newton’s Method
However, Newton’s update rule is more expensive than gradient descent/coordinate descent Newton’s Method
Need to compute ∇2f (x)−1∇f (x) Closed form solution: O(d3) for solving a d dimensional linear system Usually solved by another iterative solver: gradient descent coordinate descent conjugate gradient method ...
Examples: primal L2-SVM/logistic regression, `1-regularized logistic regression, . . . Newton’s Method
At each iteration, form an approximation of f (·):
t t t T 1 T 2 f (x + d) ≈ f˜ t (d) := f (x ) + ∇f (x ) d + d ∇ f (x)d x 2α t+1 t ˜ Update solution by x ← x + argmind fxt (d) When x is far away from x∗, needs line search to gaurantee convergence 2 m Assume LI ∇ f (x) mI for all x, then α ≤ L gaurantee the objective function value deacrease because L ∇2f (x) ∇2f (y) ∀x, y m In practice, we often just use line search. Proximal Newton
What if f (x) = g(x) + h(x) and h(x) is non-smooth (h(x) = kxk1)? At each iteration, form an approximation of f (·):
t t t T α T 2 f (x + d) ≈ f˜ t (d) := g(x ) + ∇g(x ) d + d ∇ g(x)d + h(x + d) x 2 t+1 t ˜ Update solution by x ← x + argmind fxt (d) Need another iterative solver for solving the subproblem Stochastic Gradient Method Stochastic Gradient Method: Motivation
Widely used for machine learning problems (with large number of samples)
Given training samples x1,..., xn, we usually want to solve the following empirical risk minimization (ERM) problem:
n X argmin `i (xi ), w i=1
where each `i (·) is the loss function Minimize the summation of individual loss on each sample Why does SG work? n 1 X E [∇f (x)] = ∇f (x) = ∇f (x) i i n i i=1
∗ ∗ ∗ Is it a fixed point method? No if η > 0 because x − η∇i f (x ) 6= x Is it a descent method? No, because f (xt+1) 6< f (xt )
Stochastic Gradient Method
Assume the objective function can be written as n 1 X f (x) = f (x) n i i=1 Stochastic gradient method: Iterative conducts the following updates 1 Choose an index i (uniform) randomly t+1 t t t 2 x ← x − η ∇fi (x ) ηt > 0 is the step size ∗ ∗ ∗ Is it a fixed point method? No if η > 0 because x − η∇i f (x ) 6= x Is it a descent method? No, because f (xt+1) 6< f (xt )
Stochastic Gradient Method
Assume the objective function can be written as n 1 X f (x) = f (x) n i i=1 Stochastic gradient method: Iterative conducts the following updates 1 Choose an index i (uniform) randomly t+1 t t t 2 x ← x − η ∇fi (x ) ηt > 0 is the step size Why does SG work? n 1 X E [∇f (x)] = ∇f (x) = ∇f (x) i i n i i=1 Stochastic Gradient Method
Assume the objective function can be written as n 1 X f (x) = f (x) n i i=1 Stochastic gradient method: Iterative conducts the following updates 1 Choose an index i (uniform) randomly t+1 t t t 2 x ← x − η ∇fi (x ) ηt > 0 is the step size Why does SG work? n 1 X E [∇f (x)] = ∇f (x) = ∇f (x) i i n i i=1
∗ ∗ ∗ Is it a fixed point method? No if η > 0 because x − η∇i f (x ) 6= x Is it a descent method? No, because f (xt+1) 6< f (xt ) Stochastic Gradient
Step size η has to decay to 0 (e.g., ηt = Ct−a for some constant a, C) SGD converges sub-linearly Many variants proposed recently SVRG, SAGA (2013, 2014): variance reduction AdaGrad (2011): adaptive learning rate RMSProp (2012): estimate learning rate by a running average of gradient. Adam (2015): adaptive moment estimation Widely used in machine learning SGD in Deep Learning
AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = ∇f (x) 2 Estimate the second moment: Gii = Gii + gi for all i η Parameter update: xi ← xi − √ gi for all i Gii Proposed for convex optimization in:
“Adaptive subgradient methods for online learning and stochastic optimization” (JMLR 2011)
“Adaptive Bound Optimization for Online Convex Optimization” (COLT 2010) SGD in Deep Learning
AdaGrad: adaptive step size for each parameter Update rule at the t-th iteration: Compute g = ∇f (x) 2 Estimate the second moment: Gii = Gii + gi for all i η Parameter update: xi ← xi − √ gi for all i Gii Adam (Kingma and Ba, 2015): maintain both first moment and second moment Update rule at the t-th iteration: Compute g = ∇f (x) m = β1m + (1 − β1)g t mˆ = mt /(1 − β1) (estimate of first moment) 2 v = β2v + (1 − β2)g t vˆ = vt /(1 − β2) (estimate of second moment) x ← x − ηmˆ/(vˆ + ) All the operations are element-wise Stochastic Gradient: Variance Reduction
SGD is not a fixed point method: even if x∗ is optimal,
∗ ∗ ∗ x − η∇i f (x ) 6= x
∗ ∗ Reason: E[∇i f (x )] = ∇f (x ) = 0 but variance can be large Variance reduction: reduce the variance so that variance → 0 when t → ∞. SVRG
SVRG (Johnson and Zhang, 2013) For t = 1, 2, ··· 1 x˜ = x n 2 1 P g˜ = n i=1 ∇fi (x˜) 3 For s = 1, 2, ··· 1 Randomly pick i 2 x ← x − η(∇fi (x) − ∇fi (x˜) + g˜) | {z } ai Can easily show
E[ai ] = 0 (unbiased) ∗ Var[ai ] = 0 when x = x˜ = x Have linear convergence rate Stochastic Gradient: applying to ridge regression
Objective function:
n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1
1 Pn How to write as argminw n i=1 fi (w)? How to decompose into n components? Stochastic Gradient: applying to ridge regression
Objective function:
n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1
1 Pn How to write as argminw n i=1 fi (w)? T 2 2 First approach: fi (w) = (w xi − yi ) + λkwk Update rule:
t+1 t t T t w ← w − 2η (w xi − yi )xi − 2η λw t t T = (1 − 2η λ)w − 2η (w xi − yi )xi Stochastic Gradient: applying to ridge regression
Objective function:
n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1
1 Pn How to write as argminw n i=1 fi (w)? T 2 2 First approach: fi (w) = (w xi − yi ) + λkwk Update rule:
t+1 t t T t w ← w − 2η (w xi − yi )xi − 2η λw t t T = (1 − 2η λ)w − 2η (w xi − yi )xi
Need O(d) complexity per iteration even if data is sparse Stochastic Gradient: applying to ridge regression
Objective function:
n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1
1 Pn How to write as argminw n i=1 fi (w)? T 2 2 First approach: fi (w) = (w xi − yi ) + λkwk Update rule:
t+1 t t T t w ← w − 2η (w xi − yi )xi − 2η λw t t T = (1 − 2η λ)w − 2η (w xi − yi )xi
Need O(d) complexity per iteration even if data is sparse Solution: store w = sv where s is a scalar Update rule when selecting index i:
t t+1 t t T t 2η λn t wj ← wj − 2η (xi w − yi )Xij − wj , ∀j ∈ Ωi nj
Solution: update can be done in O(|Ωi |) operations
Stochastic Gradient: applying to ridge regression
Objective function:
n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1 Second approach:
define Ωi = {j | Xij 6= 0} for i = 1,..., n
define nj = |{i | Xij 6= 0}| for j = 1,..., d T 2 P λn 2 define fi (w) = (w xi − yi ) + w j∈Ωi nj j Stochastic Gradient: applying to ridge regression
Objective function:
n 1 X T 2 2 argmin (w xi − yi ) + λkwk w n i=1 Second approach:
define Ωi = {j | Xij 6= 0} for i = 1,..., n
define nj = |{i | Xij 6= 0}| for j = 1,..., d T 2 P λn 2 define fi (w) = (w xi − yi ) + w j∈Ωi nj j Update rule when selecting index i:
t t+1 t t T t 2η λn t wj ← wj − 2η (xi w − yi )Xij − wj , ∀j ∈ Ωi nj
Solution: update can be done in O(|Ωi |) operations Coming up
Next class: Parallel Optimization
Questions?