Efficient Greedy Coordinate Descent Via Variable Partitioning
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Greedy Coordinate Descent via Variable Partitioning Huang Fang, Guanhua Fang, Tan Yu, Ping Li Cognitive Computing Lab Baidu Research 10900 NE 8th St. Bellevue, WA 98004, USA {fangazq877, fanggh2018, Tanyuuynat, pingli98}@gmail.com Abstract large-scale optimization problems. Due to the low numer- ical accuracy required by most machine learning and data mining tasks, first-order methods with cheap cost per itera- Greedy coordinate descent (GCD) is an efficient tion dominate certain fields recently. In particular stochastic optimization algorithm for a wide range of ma- gradient descent (SGD) and coordinate descent (CD) are chine learning and data mining applications. GCD two most important representatives. SGD and its variants are could be significantly faster than randomized co- dominant for the big-N problems i.e., problems with a large ordinate descent (RCD) if they have similar per number of samples (Robbins and Monro, 1951; Ghadimi iteration cost. Nevertheless, in some cases, the and Lan, 2013; Johnson and Zhang, 2013; Defazio et al., greedy rule used in GCD cannot be efficiently im- 2014; Nguyen et al., 2017; Schmidt et al., 2017) while CD plemented, leading to huge per iteration cost and and its variants are highly effective in handling the struc- making GCD slower than RCD. To alleviate the tured big-p problems i.e., problems with a large number cost per iteration, the existing solutions rely on of features (Bertsekas and Tsitsiklis, 1989; Luo and Tseng, maximum inner product search (MIPS) as an ap- 1992, 1993; Tseng, 2001; Nesterov, 2012; Lee and Sid- proximate greedy rule. But it has been empirically ford, 2013; Shalev-Shwartz and Zhang, 2013; Richtárik and shown that GCD with approximate greedy rule Takác, 2014; Wright, 2015; Lu and Xiao, 2015; Allen-Zhu could suffer from slow convergence even with the et al., 2016; Zhang and Xiao, 2017). Starting with Nes- state-of-the-art MIPS algorithms. We propose a terov’s seminal work (Nesterov, 2012), coordinate descent hybrid coordinate descent algorithm with a simple and its variants has been an active research topic in both variable partition strategy to tackle the cases when academia community and industrial practice. greedy rule cannot be implemented efficiently. The convergence rate and theoretical properties of the Greedy coordinate descent (GCD) belongs to the family of new algorithm are presented. The proposed method CD. Differentiating from randomized coordinate descent is shown to be especially useful when the data ma- (RCD), which randomly selects a coordinate to update in trix has a group structure. Numerical experiments each iteration, GCD selects the coordinate that makes the with both synthetic and real-world data demon- most progress. This selection rule is also known as the strate that our new algorithm is competitive against Gauss-Southwell (GS) rule. It has been theoretically verified RCD, GCD, approximate GCD with MIPS and that GCD could converge faster than RCD if the GS rule can their accelerated variants. be implemented efficiently (Nutini et al., 2015). However, different from RCD which can be implemented efficiently for a wide range of problems, only a small class of problems are suitable for the GS rule (Nutini et al., 2015). For the 1 INTRODUCTION cases when the GS rule cannot be implemented efficiently, one could only resort to the approximate GS rule based on With an immense growth of data in recent years, classical maximum inner product search (MIPS) algorithms (Dhillon optimization algorithms tend to struggle with the current et al., 2011; Shrivastava and Li, 2014; Karimireddy et al., huge amount of data. For example, some of today’s social 2019). However, most existing algorithms for MIPS do not networks could have hundreds of millions of users and even have strong theoretical guarantee and extensive numerical solving a linear system using the classical approach in this experiments in the literature (Karimireddy et al., 2019) have scale is challenging. Many efficient algorithms are proposed empirically shown that the approximate GS rule with the or re-discovered in the last two decades to solve today’s Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). state-of-the-art MIPS algorithm usually performs worse than filled with refined analysis (Nutini et al., 2015) of GCD RCD on real-world datasets. (see Tseng and Yun(2009); Nutini et al.(2017) for an ex- tension to blocked GCD), in which Nutini et al.(2015) In this work, we propose a hybrid coordinate descent (hy- theoretically explained why GCD could be faster than RCD bridCD) algorithm with a simple variable partitioning strat- for certain class of problems. Subsequent works (Karim- egy as an alternative of the approximate GS rule when GCD ireddy et al., 2019; Fang et al., 2020) further extended the cannot be implemented efficiently. The proposed hybridCD refined analysis to composite optimization problems. algorithm aims to achieve the fast convergence rate as GCD while preserving the low iteration cost as randomized CD When the GS rule cannot be implemented effi- via devised variable partitioning. On the theory side, we ciently, Dhillon et al.(2011) proposed to use MIPS show that the convergence rate of our hybridCD algorithm algorithm as an approximate GS rule for least square could match GCD and outperform RCD when the data ma- problems, Karimireddy et al.(2019) further discussed trix has an underlying group structure. We also provide an how to map the approximate GS rule to MIPS for a richer accelerated hybridCD scheme which further improves the family of problems beside least square problems. These two convergence rate. On the empirical side, we evaluate the papers are closely related to this work as all of them are new algorithm with ridge regression, logistic regression, lin- considering how to improve GCD when the GS rule cannot ear support vector machine (SVM) and graph-based label be implemented efficiently. propagation models. The results show that the proposed method empirically outperforms RCD, GCD, approximate GCD with MIPS and their accelerated variants on syn- 2.3 PARALLEL COORDINATE DESCENT thetic datasets when the data group structure exists and also achieves the better performance on real-world datasets. Another line of work in recent years is to accelerate CD under multi-core or multi-machine environments (Liu et al., 2015; Richtárik and Takác, 2016). This line of work is or- 2 RELATED WORK thogonal to the purpose of this work, but it is worth to note that the variable partitioning strategy used in this work has 2.1 COORDINATE DESCENT been mentioned as heuristics for parallel greedy coordi- nate algorithms (Scherrer et al., 2012b,a; You et al., 2016; Coordinate descent is an old optimization method that can Moreau et al., 2018). It is also worth to note that these works be traced back to 1940s (Southwell, 1940). It is gaining in- apply greedy rule within each partition in a parallel fashion creasing interest in the past decade due to its superior empir- and is different from the algorithm proposed in this paper ical performance on the structured big-p (high dimensional) which advocates greedy rule among all partitions. machine learning and data mining applications including LASSO (Friedman et al., 2007; Hastie et al., 2008; Fried- 3 PRELIMINARIES man et al., 2010), support vector machine (SVM) (Platt, 1999; Tseng and Yun, 2010), non-negative matrix factoriza- We consider the following optimization problem: tion (Cichocki and Phan, 2009), graph-based label propaga- tion (Bengio et al., 2006), Mazumder et al.(2011) etc. De- min f(x); (3.1) d spite its popularity in practice, the convergence rate of CD x2R is not clear until Nesterov(2012)’s work in 2012, in which Nesterov presented the first non-asymptotic convergence where d is the number of variables, f : Rd ! R is a rate of RCD. Subsequent works include block CD (Beck convex and smooth function. Throughout the paper, we and Tetruashvili, 2013) proximal RCD (Richtárik and Takác, use x∗ to denote a solution of problem 3.1 and f ∗ as the 2014), accelerated RCD (Lee and Sidford, 2013; Lin et al., optimal objective value. In addition, we make the following 2015; Allen-Zhu et al., 2016; Nesterov and Stich, 2017). assumptions on f. Assumption 3.1. f(x + αei) is Li-smoothness in terms of 2.2 GREEDY COORDINATE DESCENT α 8i 2 [d]: d Nesterov(2012) demonstrated that GCD has the same con- jrif(x + αei) − rif(x)j ≤ Lijαj; 8x 2 R ; α 2 R; vergence rate as RCD. However, GCD is observed to be much faster than RCD in terms of convergence rate empiri- where ei is the i-th unit vector, rif(x) denotes the i-th cally, and some clever implementations of GCD do consti- entry of rf(x). tute the state-of-the-art solvers for machine learning prob- Assumption 3.2. f(x) is L-smoothness: lems like kernel SVM (Joachims, 1999; Chang and Lin, 2011) and non-negative matrix factorization (Cichocki and L f(x) ≤ f(y)+hrf(y); x−yi+ kx−yk2; 8x; y 2 d: Phan, 2009). This gap between practice and theory was 2 2 R Algorithm 1 Pseudocode for coordinate descent Algorithm 2 Hybrid coordinate descent Input: x . (0) k 0 Input: x ; B = fBigi=1. for t = 0; 1; 2; ::: do for t = 0; 1; 2;::: do [RCD rule] uniform randomly choose a i from I = ; f1; 2; : : : ; dg for j = 1; 2; : : : ; k do (t) [GS rule] i 2 arg maxj2[d] jrif(x )j [Random rule] uniform randomly choose a ij 2 Bj [Approx-GS rule] choose i from f1; 2; : : : ; dg by and let I = I [ fijg MIPS algorithm. end for (t+1) (t) 1 (t) x = x − rfi(x )ei (t) Li [Greedy rule] i 2 arg maxj2I jrjf(x )j (t+1) (t) 1 (t) end for x = x − rfi(x )ei Li end for In the following content of this paper, we denote L := max L .