ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton's method Stochastic Gradient Descent Numerical Optimization Numerical Optimization: minX f (X ) Can be applied to computer science, economics, control engineering, operating research, . Machine Learning: find a model that minimizes the prediction error. Properties of the Function Smooth function: functions with continuous derivative. Example: ridge regression 1 λ argmin kX w − yk2 + kwk2 w 2 2 Non-smooth function: Lasso, primal SVM 1 2 Lasso: argmin kX w − yk + λkwk1 w 2 n X T λ 2 SVM: argmin max(0; 1 − yi w xi )+ kwk w 2 i=1 Convex Functions A function is convex if: 8x1; x2; 8t 2 [0; 1]; f (tx1 + (1 − t)x2) ≤ tf (x1) + (1 − t)f (x2) No local optimum (why?) Figure from Wikipedia Convex Functions If f (x) is twice differentiable, then f is convex if and only if r2f (x) 0 Optimal solution may not be unique: has a set of optimal solutions S Gradient: capture the first order change of f : f (x + αd) = f (x) + αrf (x)T d + O(α2) If f is convex and differentiable, we have the following optimality condition: x∗ 2 S if and only if rf (x) = 0 Strongly Convex Functions f is strongly convex if there exists an m > 0 such that m f (y) ≥ f (x) + rf (x)T (y − x) + ky − xk2 2 2 A strongly convex function has a unique global optimum x∗ (why?) If f is twice differentiable, then f is strongly convex if and only if r2f (x) mI for all x Gradient descent, coordinate descent will converge linearly (will see later) Nonconvex Functions If f is nonconvex, most algorithms can only converge to stationary points x¯ is a stationary point if and only if rf (x¯) = 0 Three types of stationary points: (1) Global optimum (2) Local optimum (3) Saddle point Example: matrix completion, neural network, . 1 2 Example: f (x; y) = 2 (xy − a) Coordinate Descent Coordinate Descent Update one variable at a time Coordinate Descent: repeatedly perform the following loop Step 1: pick an index i Step 2: compute a step size δ∗ by (approximately) minimizing argmin f (x + δei ) δ ∗ Step 3: xi xi + δ Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015 Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x1; x2;:::; xn ; x1; x2;:::; xn ;::: | {z } | {z } 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016 Coordinate Descent (update sequence) Three types of updating order: Cyclic: update sequence x1; x2;:::; xn ; x1; x2;:::; xn ;::: | {z } | {z } 1st outer iteration 2nd outer iteration Randomly permute the sequence for each outer iteration (faster convergence in practice) Some interesting theoretical analysis for this recently C.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016 Random: each time pick a random coordinate to update Typical way: sample from uniform distribution Sample from uniform distribution vs sample from biased distribution P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML 2015 D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015 Greedy Coordinate Descent Greedy: choose the most \important" coordinate to update How to measure the importance? By first derivative: jri f (x)j 2 By first and second derivative: jri f (x)=rii f (x)j By maximum reduction of objective function ∗ i = argmax f (x) − min f (x + δei ) i=1;:::;n δ Need to consider the time complexity for variable selection Useful for kernel SVM (see lecture 6) Extension: block coordinate descent Variables are divided into blocks fX1;:::; Xpg, where each Xi is a subset of variables and X1 [X2;:::; Xp = f1;:::; ng; Xi \Xj = '; 8i; j Each time update a Xi by (approximately) solving the subproblem within the block Example: alternating minimization for matrix completion (2 blocks). (See lecture 7) Coordinate Descent (convergence) Converge to optimal if f (·) is convex and smooth Has a linear convergence rate if f (·) is strongly convex Linear convergence: error f (xt ) − f (x∗) decays as β; β2; β3;::: for some β < 1. Local linear convergence: an algorithm converges linearly after kx − x∗k ≤ K for some K > 0 Coordinate Descent: other names Alternating minimization (matrix completion) Iterative scaling (for log-linear models) Decomposition method (for kernel SVM) Gauss Seidel (for linear system when the matrix is positive definite) ... Gradient Descent Gradient Descent Gradient descent algorithm: repeatedly conduct the following update: xt+1 xt − αrf (xt ) where α > 0 is the step size It is a fixed point iteration method: x − αrf (x) = x if and only if x is an optimal solution Step size too large ) diverge; too small ) slow convergence Gradient Descent: successive approximation At each iteration, form an approximation of f (·): t t t T 1 T 1 f (x + d) ≈ f~ t (d) :=f (x ) + rf (x ) d + d ( I )d x 2 α 1 =f (xt ) + rf (xt )T d + d T d 2α t+1 t ~ Update solution by x x + argmind fxt (d) ∗ t ~ d = −αrf (x ) is the minimizer of argmind fxt (d) d ∗ will decrease the original objective function f if α (step size) is small enough Gradient Descent: successive approximation However, the function value will decrease if Condition 1: f~x (d) ≥ f (x + d) for all d Condition 2: f~x (0) = f (x) Why? t ∗ ~ ∗ f (x + d ) ≤ fxt (d ) ~ ≤ fxt (0) = f (xt ) ~ Condition 2 is satisfied by construction of fxt 1 2 Condition 1 is satisfied if α I r f (x) for all x Gradient Descent: step size A twice differentiable function has L-Lipchitz continuous gradient if and only if r2f (x) ≤ LI 8x 1 In this case, Condition 2 is satisfied if α < L 1 Theorem: gradient descent converges if α < L 1 Theorem: gradient descent converges linearly with α < L if f is strongly convex Gradient Descent In practice, we do not know L...... Step size α too large: the algorithm diverges Step size α too small: the algorithm converges very slowly Gradient Descent: line search d ∗ is a \descent direction" if and only if (d ∗)T rf (x) < 0 Armijo rule bakctracking line search: 1 1 Try α = 1; 2 ; 4 ;::: until it staisfies f (x + αd ∗) ≤ f (x) + γα(d ∗)T rf (x) where 0 < γ < 1 Figure from http://ool.sourceforge.net/ool-ref.html Gradient Descent: line search Gradient descent with line search: Converges to optimal solutions if f is smooth Converges linearly if f is strongly convex However, each iteration requires evaluating f several times Several other step-size selection approaches (an ongoing research topic, especially for stochastic gradient descent) Gradient Descent: applying to ridge regression N×d N (0) Input: X 2 R , y 2 R , initial w ∗ 1 2 λ 2 Output: Solution w := argminw 2 kX w − yk + 2 kwk 1: t = 0 2: while not converged do 3: Compute the gradient g = X T (X w − y) + λw 4: Choose step size αt 5: Update w w − αt g 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration Proximal Gradient Descent How can we apply gradient descent to solve the Lasso problem? 1 2 argmin kX w − yk + λ kwk1 w 2 | {z } non−differentiable General composite function minimization: argmin f (x) := fg(x) + h(x)g x where g is smooth and convex, h is convex but may be non-differentiable Usually assume h is simple (for computational efficiency) Proximal Gradient Descent: successive approximation At each iteration, form an approximation of f (·): t t t T 1 T 1 t f (x + d) ≈ f~ t (d) :=g(x ) + rg(x ) d + d ( I )d + h(x + d) x 2 α 1 =g(xt ) + rg(xt )T d + d T d + h(xt + d) 2α t+1 t ~ Update solution by x x + argmind fxt (d) This is called \proximal" gradient descent ∗ ~ Usually d = argmind fxt (d) has a closed form solution Proximal Gradient Descent: `1-regularization The subproblem: t+1 t t T 1 T t x =x + argmin rg(x ) d + d d + λkx + dk1 d 2α 1 t t 2 = argmin ku − (x − αrg(x ))k + λαkuk1 u 2 =S(xt − αrg(xt ); αλ); where S is the soft-thresholding operator defined by 8 a − z if a > z <> S(a; z) = a + z if a < −z :>0 if a 2 [−z; z] Proximal Gradient: soft-thresholding Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/ Proximal Gradient Descent for Lasso N×d N (0) Input: X 2 R , y 2 R , initial w ∗ 1 2 Output: Solution w := argminw 2 kX w − yk + λkwk1 1: t = 0 2: while not converged do 3: Compute the gradient g = X T (X w − y) 4: Choose step size αt 5: Update w S(w − αt g; αt λ) 6: t t + 1 7: end while Time complexity: O(nnz(X )) per iteration Newton's Method Newton's Method Iteratively conduct the following updates: x x − αr2f (x)−1rf (x) where α is the step size If α = 1: converges quadratically when xt is close enough to x∗: kxt+1 − x∗k ≤ Kkxt − x∗k2 for some constant K.

Load more