Parallel Coordinate Descent for L1-Regularized Loss Minimization

Parallel Coordinate Descent for L1-Regularized Loss Minimization Joseph K. Bradley y [email protected] Aapo Kyrola y [email protected] Danny Bickson [email protected] Carlos Guestrin [email protected] Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA Abstract ent. As we discuss in Sec. 2, theory (Shalev-Shwartz We propose Shotgun, a parallel coordi- & Tewari, 2009) and extensive empirical results (Yuan et al., 2010) have shown that variants of Shooting are nate descent algorithm for minimizing L1- regularized losses. Though coordinate de- particularly competitive for high-dimensional data. scent seems inherently sequential, we prove The need for scalable optimization is growing as more convergence bounds for Shotgun which pre- applications use high-dimensional data, but processor dict linear speedups, up to a problem- core speeds have stopped increasing in recent years. dependent limit. We present a comprehen- Instead, computers come with more cores, and the sive empirical study of Shotgun for Lasso and new challenge is utilizing them efficiently. Yet despite sparse logistic regression. Our theoretical the many sequential optimization algorithms for L1- predictions on the potential for parallelism regularized losses, few parallel algorithms exist. closely match behavior on real data. Shot- gun outperforms other published solvers on a Some algorithms, such as interior point methods, can range of large problems, proving to be one of benefit from parallel matrix-vector operations. How- the most scalable algorithms for L . ever, we found empirically that such algorithms were 1 often outperformed by Shooting. Recent work analyzes parallel stochastic gradient de- 1. Introduction scent for multicore (Langford et al., 2009b) and dis- tributed settings (Mann et al., 2009; Zinkevich et al., Many applications use L -regularized models such as 1 2010). These methods parallelize over samples. In ap- the Lasso (Tibshirani, 1996) and sparse logistic regres- plications using L regularization, though, there are sion (Ng, 2004). L regularization biases learning to- 1 1 often many more features than samples, so paralleliz- wards sparse solutions, and it is especially useful for ing over samples may be of limited utility. high-dimensional problems with large numbers of features. For example, in logistic regression, it allows We therefore take an orthogonal approach and paral- sample complexity to scale logarithmically w.r.t. the lelize over features, with a remarkable result: we can number of irrelevant features (Ng, 2004). parallelize coordinate descent|an algorithm which seems inherently sequential|for L -regularized losses. Much effort has been put into developing optimiza- 1 In Sec. 3, we propose Shotgun, a simple multicore al- tion algorithms for L models. These algorithms range 1 gorithm which makes P coordinate updates in paral- from coordinate minimization (Fu, 1998) and stochas- lel. We prove strong convergence bounds for Shotgun tic gradient (Shalev-Shwartz & Tewari, 2009) to more which predict speedups over Shooting which are linear complex interior point methods (Kim et al., 2007). in P, up to a problem-dependent maximum P∗. More- Coordinate descent, which we call Shooting after Fu over, our theory provides an estimate for this ideal P∗ (1998), is a simple but very effective algorithm which which may be easily computed from the data. updates one coordinate per iteration. It often requires Parallel coordinate descent was also considered by no tuning of parameters, unlike, e.g., stochastic gradi- Tsitsiklis et al. (1986), but for differentiable objec- Appearing in Proceedings of the 28 th International Con- tives in the asynchronous setting. They give a very ference on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). y These authors contributed equally to this work. Parallel Coordinate Descent for L1-Regularized Loss Minimization general analysis, proving asymptotic convergence but Algorithm 1 Shooting: Sequential SCD not convergence rates. We are able to prove rates and 2d Set x = 0 2 R+ . theoretical speedups for our class of objectives. while not converged do In Sec. 4, we compare multicore Shotgun with five Choose j 2 f1;:::; 2dg uniformly at random. state-of-the-art algorithms on 35 real and synthetic Set δxj − max{−xj; −(rF (x))j/βg. datasets. The results show that in large problems Update xj − xj + δxj. Shotgun outperforms the other algorithms. Our ex- end while periments also validate the theoretical predictions by showing that Shotgun requires only about 1=P as of Shooting for solving (1). SCD (Alg. 1) randomly many iterations as Shooting. We measure the parallel chooses one weight xj to update per iteration. It com- speedup in running time and analyze the limitations putes the update xj xj + δxj via imposed by the multicore hardware. δxj = max{−xj ; −(rF (x))j /βg ; (5) 2. L1-Regularized Loss Minimization where β > 0 is a loss-dependent constant. We consider optimization problems of the form To our knowledge, Shalev-Shwartz and Tewari (2009) n provide the best known convergence bounds for SCD. X T min F (x) = L(ai x; yi) + λkxk1 ; (1) Their analysis requires a uniform upper bound on the x2 d R i=1 change in the loss F (x) from updating a single weight: 2d where L(·) is a non-negative convex loss. Each of n Assumption 2.1. Let F (x): R+ −! R be a convex d function. Assume there exists β > 0 s.t., for all x and samples has a feature vector ai 2 R and observation n d single-weight updates δx , we have: yi (where y 2 Y ). x 2 R is an unknown vector j of weights for features. λ ≥ 0 is a regularization pa- β(δx )2 n×d F (x + (δx )ej) ≤ F (x) + δx (rF (x)) + j ; rameter. Let A 2 R be the design matrix, whose j j j 2 th i row is ai. Assume w.l.o.g. that columns of A are normalized s.t. diag(AT A) = 1.1 where ej is a unit vector with 1 in its jth entry. For An instance of (1) is the Lasso (Tibshirani, 1996) (in the losses in (2) and (3), Taylor expansions give penalty form), for which Y ≡ R and 1 β = 1 (squared loss) and β = 4 (logistic loss). (6) 1 2 F (x) = kAx − yk + λkxk1 ; (2) 2 2 Using this bound, they prove the following theorem. as well as sparse logistic regression (Ng, 2004), for Theorem 2.1. (Shalev-Shwartz & Tewari, 2009) Let which Y ≡ {−1; +1g and x∗ minimize (4) and x(T ) be the output of Alg. 1 after n T iterations. If F (x) satisfies Assumption 2.1, then X T F (x) = log 1 + exp −yiai x + λkxk1 : (3) h i d(βkx∗k2 + 2F (x(0))) i=1 E F (x(T )) − F (x∗) ≤ 2 ; (7) T + 1 For analysis, we follow Shalev-Shwartz and Tewari (2009) and transform (1) into an equivalent problem where E[·] is w.r.t. the random choices of weights j. 2d with a twice-differentiable regularizer. We let x^ 2 R+ , 2d As Shalev-Shwartz and Tewari (2009) argue, Theo- use duplicated features aî = [ai; −ai] 2 R , and solve rem 2.1 indicates that SCD scales well in the dimen- n 2d sionality d of the data. For example, it achieves bet- X T X min L(aî x^; yi) + λ x^j : (4) 2d ter runtime bounds w.r.t. d than stochastic gradient x^2R + i=1 j=1 methods such as SMIDAS (Shalev-Shwartz & Tewari, 2d 2009) and truncated gradient (Langford et al., 2009a). If ^x 2 R+ minimizes (4), then x : xi =x ^d+i −xî minimizes (1). Though our analysis uses duplicate features, they are not needed for an implementation. 3. Parallel Coordinate Descent 2.1. Sequential Coordinate Descent As the dimensionality d or sample size n increase, even fast sequential algorithms become expensive. To scale Shalev-Shwartz and Tewari (2009) analyze Stochas- to larger problems, we turn to parallel computation. tic Coordinate Descent (SCD), a stochastic version In this section, we present our main theoretical contri- 1Normalizing A does not change the objective if a sep- bution: we show coordinate descent can be parallelized arate, normalized λj is used for each xj . by proving strong convergence bounds. Parallel Coordinate Descent for L1-Regularized Loss Minimization Algorithm 2 Shotgun: Parallel SCD Choose number of parallel updates P ≥ 1. 2d Set x = 0 2 R+ while not converged do In parallel on P processors Choose j 2 f1;:::; 2dg uniformly at random. Set δxj − max{−xj; −(rF (x))j/βg. Update xj − xj + δxj. end while Figure 1. Intuition for parallel coordinate descent. Con- tour plots of two objectives, with darker meaning better. Left: Features are uncorrelated; parallel updates are useful. We parallelize stochastic Shooting and call our algo- Right: Features are correlated; parallel updates conflict. rithm Shotgun (Alg. 2). Shotgun initially chooses P, the number of weights to update in parallel. On each 3.1. Shotgun Convergence Analysis iteration, it chooses P weights independently and uniformly at random from f1;:::; 2dg; these form a mul- In this section, we present our convergence result for Shotgun. The result provides a problem-specific mea- tiset Pt. It updates each xij : ij 2 Pt; in parallel using the same update as Shooting (5). Let ∆x be the sure of the potential for parallelization: the spectral P radius ρ of AT A (i.e., the maximum of the magni- collective update to x, i.e., (∆x)k = i 2P : k=i δxij . j t j tudes of eigenvalues of AT A). Moreover, this measure Intuitively, parallel updates might increase the risk of is prescriptive: ρ may be estimated via, e.g., power divergence.

Parallel Coordinate Descent for L1-Regularized Loss Minimization

A Generic Coordinate Descent Solver for Nonsmooth Convex Optimization Olivier Fercoq

12. Coordinate Descent Methods

Decomposable Submodular Function Minimization: Discrete And

Arxiv:1706.05795V4 [Math.OC] 4 Nov 2018

Research Article Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization

A Hybrid Method of Combinatorial Search and Coordinate Descent For

On the Links Between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri

An Iterated Coordinate Descent Algorithm for Regularized 0/1 Loss Minimization

Differentially Private Stochastic Coordinate Descent

Large Scale Optimization for Machine Learning

Random Batch Methods (RBM) for Interacting Particle Systems

Efficient Greedy Coordinate Descent Via Variable Partitioning