A Dual Coordinate Descent Method for Large-Scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh [email protected] Kai-Wei Chang [email protected] Chih-Jen Lin [email protected] Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi [email protected] Yahoo! Research, Santa Clara, USA S. Sundararajan [email protected] Yahoo! Labs, Bangalore, India Abstract with a bias term b. One often deal with this term by appending each instance with an additional dimension: In many applications, data appear with a huge number of instances as well as features. T T T T xi [xi ; 1] w [w ; b]: (3) Linear Support Vector Machines (SVM) is one of the most popular tools to deal with Problem (1) is often referred to as the primal form of such large-scale sparse data. This paper SVM. One may instead solve its dual problem: presents a novel dual coordinate descent 1 method for linear SVM with L1- and L2- min f(α) = αT Q¯α − eT α loss functions. The proposed method is sim- α 2 ple and reaches an -accurate solution in subject to 0 ≤ αi ≤ U; 8i; (4) O(log(1/)) iterations. Experiments indicate ¯ that our method is much faster than state where Q = Q + D, D is a diagonal matrix, and Qij = T of the art solvers such as Pegasos, TRON, yiyjxi xj. For L1-SVM, U = C and Dii = 0; 8i. For perf SVM , and a recent primal coordinate de- L2-SVM, U = 1 and Dii = 1=(2C); 8i. scent implementation. An SVM usually maps training vectors into a high- dimensional space via a nonlinear function φ(x). Due to the high dimensionality of the vector variable w, 1. Introduction one solves the dual problem (4) by the kernel trick (i.e., using a closed form of φ(x )T φ(x )). We call Support vector machines (SVM) (Boser et al., 1992) i j such a problem as a nonlinear SVM. In some applica- are useful for data classification. Given a set of tions, data appear in a rich dimensional feature space, instance-label pairs (x ; y ); i = 1; : : : ; l; x 2 Rn; y 2 i i i i the performances are similar with/without nonlinear {−1; +1g, SVM requires the solution of the following mapping. If data are not mapped, we can often train unconstrained optimization problem: much larger data sets. We indicate such cases as linear l SVM; these are often encountered in applications such 1 T X min w w + C ξ(w; xi; yi); (1) as document classification. In this paper, we aim at w 2 i=1 solving very large linear SVM problems. Recently, many methods have been proposed for lin- where ξ(w; xi; yi) is a loss function, and C ≥ 0 is a penalty parameter. Two common loss functions are: ear SVM in large-scale scenarios. For L1-SVM, Zhang (2004), Shalev-Shwartz et al. (2007), Bottou (2007) T T 2 propose various stochastic gradient descent methods. max(1 − yiw xi; 0) and max(1 − yiw xi; 0) : (2) Collins et al. (2008) apply an exponentiated gradi- The former is called L1-SVM, while the latter is L2- ent method. SVMperf (Joachims, 2006) uses a cutting SVM. In some applications, an SVM problem appears plane technique. Smola et al. (2008) apply bundle methods, and view SVMperf as a special case. For Appearing in Proceedings of the 25 th International Confer- L2-SVM, Keerthi and DeCoste (2005) propose mod- ence on Machine Learning, Helsinki, Finland, 2008. Copy- ified Newton methods. A trust region Newton method right 2008 by the author(s)/owner(s). (TRON) (Lin et al., 2008) is proposed for logistic re- A Dual Coordinate Descent Method for Large-scale Linear SVM gression and L2-SVM. These algorithms focus on dif- explain why earlier studies on decomposition meth- ferent aspects of the training speed. Some aim at ods failed to modify their algorithms in an efficient quickly obtaining a usable model, but some achieve way like ours for large-scale linear SVM. We also dis- fast final convergence of solving the optimization prob- cuss the connection to other linear SVM works such as lem in (1) or (4). Moreover, among these methods, (Crammer & Singer, 2003; Collins et al., 2008; Shalev- Joachims (2006), Smola et al. (2008) and Collins et al. Shwartz et al., 2007). (2008) solve SVM via the dual (4). Others consider the This paper is organized as follows. In Section 2, we de- primal form (1). The decision of using primal or dual scribe our proposed algorithm. Implementation issues is of course related to the algorithm design. are investigated in Section 3. Section 4 discusses the Very recently, Chang et al. (2007) propose using co- connection to other methods. In Section 5, we compare ordinate descent methods for solving primal L2-SVM. our method with state of the art implementations for Experiments show that their approach more quickly large linear SVM. Results show that the new method obtains a useful model than some of the above meth- is more efficient. Proofs can be found at http://www. ods. Coordinate descent, a popular optimization tech- csie.ntu.edu.tw/~cjlin/papers/cddual.pdf. nique, updates one variable at a time by minimizing a single-variable sub-problem. If one can efficiently solve 2. A Dual Coordinate Descent Method this sub-problem, then it can be a competitive optimization method. Due to the non-differentiability of In this section, we describe our coordinate descent the primal L1-SVM, Chang et al's work is restricted to method for L1- and L2-SVM. The optimization pro- L2-SVM. Moreover, as primal L2-SVM is differentiable cess starts from an initial point α0 2 Rl and generates k 1 but not twice differentiable, certain considerations are a sequence of vectors fα gk=0. We refer to the process needed in solving the single-variable sub-problem. from αk to αk+1 as an outer iteration. In each outer iteration we have l inner iterations, so that sequen- While the dual form (4) involves bound constraints tially α ; α ; : : : ; α are updated. Each outer iteration 0≤α ≤U, its objective function is twice differentiable 1 2 l i thus generates vectors αk;i 2 Rl, i = 1; : : : ; l + 1, such for both L1- and L2-SVM. In this paper, we investi- that αk;1 = αk, αk;l+1 = αk+1, and gate coordinate descent methods for the dual problem k;i k+1 k+1 k k T (4). We prove that an -optimal solution is obtained α = [α1 ; : : : ; αi−1 ; αi ; : : : ; αl ] ; 8i = 2; : : : ; l: in O(log(1/)) iterations. We propose an implemen- k;i k;i+1 tation using a random order of sub-problems at each For updating α to α , we solve the following iteration, which leads to very fast training. Experi- one-variable sub-problem: ments indicate that our method is more efficient than k;i k min f(α + dei) subject to 0 ≤ αi + d ≤ U; (5) the primal coordinate descent method. As Chang et al. d (2007) solve the primal, they require the easy access T where ei = [0;:::; 0; 1; 0;:::; 0] . The objective func- of a feature's corresponding data values. However, in tion of (5) is a simple quadratic function of d: practice one often has an easier access of values per in- k;i 1 2 k;i stance. Solving the dual takes this advantage, so our f(α + dei) = Q¯iid + rif(α )d + constant; (6) implementation is simpler than Chang et al. (2007). 2 where r f is the ith component of the gradient rf. Early SVM papers (Mangasarian & Musicant, 1999; i One can easily see that (5) has an optimum at d = 0 Friess et al., 1998) have discussed coordinate descent (i.e., no need to update α ) if and only if methods for the SVM dual form. However, they do not i P k;i focus on large data using the linear kernel. Crammer ri f(α ) = 0; (7) and Singer (2003) proposed an online setting for multi- where rP f(α) means the projected gradient class SVM without considering large sparse data. Re- cently, Bordes et al. (2007) applied a coordinate de- 8 >rif(α) if 0 < αi < U; scent method to multi-class SVM, but they focus on < rP f(α) = min(0; r f(α)) if α = 0; (8) nonlinear kernels. In this paper, we point out that i i i :> dual coordinate descent methods make crucial advan- max(0; rif(α)) if αi = U: tage of the linear kernel and outperform other solvers If (7) holds, we move to the index i+1 without updat- when the numbers of data and features are both large. k;i ing αi . Otherwise, we must find the optimal solution ¯ Coordinate descent methods for (4) are related to the of (5). If Qii > 0, easily the solution is: popular decomposition methods for training nonlinear k;i k;i+1 k;i rif(α ) SVM. In this paper, we show their key differences and αi = min max αi − ; 0 ;U : (9) Q¯ii A Dual Coordinate Descent Method for Large-scale Linear SVM k;i Algorithm 1 A dual coordinate descent method for rif(α ) = −1: As U = C < 1 for L1-SVM, the Linear SVM solution of (5) makes the new αk;i+1 = U. We can P i • Given α and the corresponding w = i yiαixi. easily include this case in (9) by setting 1=Q¯ii = 1.

Load more