Accelerated Stochastic Block Coordinate Descent with Optimal Sampling

Accelerated Stochastic Block Coordinate Descent with Optimal Sampling Aston Zhang Quanquan Gu Dept. of Computer Science Dept. of Systems and Information Engineering University of Illinois at Urbana-Champaign University of Virginia IL, USA 61801 VA, USA 22904 [email protected] [email protected] ABSTRACT where Gj = fk1; : : : ; kjGj jg and 1 ≤ j ≤ m. The fact that R(w) We study the composite minimization problem where the objective is block separable is equivalent to function is the sum of two convex functions: one is the sum of a fi- m X nite number of strongly convex and smooth functions, and the other R(w) = rj (wGj ): (1.2) is a general convex function that is non-differentiable. Specifically, j=1 we consider the case where the non-differentiable function is block The above problem is common in data mining and machine separable and admits a simple proximal mapping for each block. learning, such as the regularized empirical risk minimization, This type of composite optimization is common in many data min- where F (w) is the empirical loss function averaged over the training and machine learning problems, and can be solved by block co- ing data sets, and R(w) is a regularization term. For example, ordinate descent algorithms. We propose an accelerated stochastic suppose that for a data mining problem there are n instances in a block coordinate descent (ASBCD) algorithm, which incorporates training data set f(x ; y ); (x ; y );:::; (x ; y )g. By choosing the incrementally averaged partial derivative into the stochastic par- 1 1 2 2 n n the squared loss f (w) = (hw; x i − y )2=2 and R(w) = 0, a tial derivative and exploits optimal sampling. We prove that AS- i i i least square regression is obtained. If R(w) is chosen to be the BCD attains a linear rate of convergence. In contrast to uniform sum of the absolute value of each coordinate in w, it becomes a sampling, we reveal that the optimal non-uniform sampling can be lasso regression [46]. In general, the problem in (1.1) can be ap- employed to achieve a lower iteration complexity. Experimental proximately solved by proximal gradient descent algorithms [32] results on different large-scale real data sets support our theory. and proximal coordinate descent algorithms [23]. Coordinate descent algorithms have received increasing atten- CCS Concepts tion in the past decade in data mining and machine learning due •Information systems ! Data mining; •Computing methodo- to their successful applications in high dimensional problems with logies ! Machine learning; structural regularizers [12, 11, 28, 2, 47]. Randomized block coordinate descent (RBCD) [31, 36, 26, 39, 4, 14, 21] is a special block coordinate descent algorithm. At each iteration, it updates a Keywords block of coordinates in vector w based on evaluation of a random Stochastic block coordinate descent; Sampling feature subset from the entire training data instances. The iteration complexity of RBCD was established and extended to composite minimization problems [31, 36, 26]. RBCD can choose a con- 1. INTRODUCTION stant step size and converge at the same rate as gradient descent We consider the problem of minimizing a composite function, algorithms [31, 36, 26]. Compared with gradient descent, the per- which is the sum of two convex functions: iteration time complexity of RBCD is much lower. This is because ∗ RBCD computes a partial derivative restricted to only a single co- w = argmin P (w) = F (w) + R(w); d (1.1) ordinate block at each iteration and updates just a single coordinate w2R block of vector w. However, it is still computationally expensive −1 Pn where F (w) = n i=1 fi(w) is a sum of a finite number of because at each iteration it requires evaluation of the gradient for strongly convex and smooth functions, and R(w) is a block sep- all the n component functions fi: the per-iteration computational arable non-differential function. To explain block separability, let complexity scales linearly with the training data set size n. fG1;:::; Gkg be a partition of all the d coordinates where Gj is In view of this, stochastic block coordinate descent was proposed > a block of coordinates. A subvector wG is [wk ; : : : ; wk ] , recently [8, 51, 48, 35]. Such algorithms compute the stochastic j 1 jGj j partial derivative restricted to one coordinate block with respect Permission to make digital or hard copies of all or part of this work for personal or to one component function, rather than the full partial derivative classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation with respect to all the component functions. Essentially, these al- on the first page. Copyrights for components of this work owned by others than the gorithms employ sampling of both features and data instances at author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or each iteration. However, they can only achieve a sublinear rate of republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. convergence. KDD’16, August 13–17, 2016, San Francisco, CA, USA We propose an algorithm for stochastic block coordinate descent using optimal sampling, namely accelerated stochastic block co- c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4232-2/16/08. $15.00 ordinate descent with optimal sampling (ASBCD). On one hand, DOI: http://dx.doi.org/10.1145/2939672.2939819 ASBCD employs a simple gradient update with optimal non- Algorithm 1 ASBCD: Accelerated Stochastic Block Coordinate Pn (t−1) term k=1 rGj fk(φk ) is efficiently updated by subtracting Descent with Optimal Sampling (t−2) (t−1) rGj fi(φi ) from itself while adding rGj fi(φi ) to itself. 1: Inputs: step size η and sampling probability set P = fp1; : : : ; png of component functions f1; : : : ; fn REMARK 2.1. For many empirical risk minimization problems (0) (0) d with each training data instance (xi; yi) and a loss function `, the 2: Initialize: φi = w 2 R 3: for t = 1; 2;::: do gradient of fi(w) with respect to w is a multiple of xi: rfi(w) = 0 4: Sample a component function index i from f1; : : : ; ng at ` (hw; xii; yi)xi. Therefore, rfi(φi) can be compactly saved in memory by only saving scalars `0(hφ ; x i; y ) with the same space probability pi 2 P with replacement i i i (t) (t−1) cost as those of many other related algorithms MRBCD, SVRG, 5: φi w 6: Sample a coordinate block index j from f1; : : : ; mg uni- SAGA, SDCA, and SAG described in Section 5. formly at random with replacement (t) (t−1) −1 (t) REMARK 2.2. The sampling probability of component func- 7: w prox w − η (npi) rG fi(φ ) − Gj η;j Gj j i tions fi in Line 4 of Algorithm 1 is according to a given prob- −1 (t−1) −1 Pn (t−1) (npi) rGj fi(φi ) + n k=1 rGj fk(φk ) ability set P = fp1; : : : ; png. The uniform sampling scheme em- 8: w(t) w(t−1) ployed by stochastic block coordinate descent methods fits under nGj nGj this more generalized sampling framework as a special case, where 9: end for pi = 1=n. We reveal that the optimal non-uniform sampling can be employed to lower the iteration complexity in Section 3. uniform sampling, which is in sharp contrast to the aforementioned When taking the expectation of the squared gap between the it- stochastic block coordinate descent algorithms based on uniform (t) ∗ erate w and the optimal solution w in (1.1) with respect to sampling. On the other hand, we incorporate the incrementally the stochastic coordinate block index, the obtained upper bound averaged partial derivative into the stochastic partial derivative to does not depend on such an index or the proximal operator. This achieve a linear rate of convergence rather than a sublinear rate. property may lead to additional algorithmic development and here To be specific, given error and number of coordinate blocks m, it is important for deriving a linear rate of convergence for Al- for strongly convex f (w) with the convexity parameter µ and the i gorithm 1. We prove the rate of convergence bound in Appendix A Lipschitz continuous gradient constant Li (LM = max Li), the i after presenting and discussing the main theory in Section 3. iteration complexity of ASBCD is n 1 X Li 1 3. MAIN THEORY O m + n log : n µ In this section, we present and discuss the main theory of our i=1 proposed algorithm (Algorithm 1). The proof of the main theory is Notation. Here we define and describe the notation used through presented in the appendix. th this paper. Let wk be the k element of a vector w = We begin with the following assumptions on F (w) and R(w) > d Pd 21=2 [w1; : : : ; wd] 2 R . We use kwk = kwk2 = k=1 wk to in the composite objective optimization problem as characterized Pd denote the `2 norm of a vector w and kwk = jwkj. The in (1.1). These assumptions are mild and can be verified in many 1 k=1 regularized empirical risk minimization problems in data mining subvector of w excluding wGj is denoted by wnGj . The simple proximal mapping for each coordinate block, also known as the and machine learning.

Accelerated Stochastic Block Coordinate Descent with Optimal Sampling

A Generic Coordinate Descent Solver for Nonsmooth Convex Optimization Olivier Fercoq

12. Coordinate Descent Methods

Process Optimization

The Gradient Sampling Methodology

Decomposable Submodular Function Minimization: Discrete And

Arxiv:1706.05795V4 [Math.OC] 4 Nov 2018

A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*

Research Article Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization

A Hybrid Method of Combinatorial Search and Coordinate Descent For

On the Links Between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri

Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text

An Iterated Coordinate Descent Algorithm for Regularized 0/1 Loss Minimization