A Dual Coordinate Descent Method for Large-scale Linear SVM

Cho-Jui Hsieh [email protected] Kai-Wei Chang [email protected] Chih-Jen Lin [email protected] Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi [email protected] Yahoo! Research, Santa Clara, USA S. Sundararajan [email protected] Yahoo! Labs, Bangalore, India Abstract with a bias term b. One often deal with this term by appending each instance with an additional dimension: In many applications, data appear with a huge number of instances as well as features. T T T T xi ← [xi , 1] w ← [w , b]. (3) Linear Support Vector Machines (SVM) is one of the most popular tools to deal with Problem (1) is often referred to as the primal form of such large-scale sparse data. This paper SVM. One may instead solve its dual problem: presents a novel dual coordinate descent 1 method for linear SVM with L1- and L2- min f(α) = αT Q¯α − eT α loss functions. The proposed method is sim- α 2 ple and reaches an -accurate solution in subject to 0 ≤ αi ≤ U, ∀i, (4) O(log(1/)) iterations. Experiments indicate ¯ that our method is much faster than state where Q = Q + D, D is a diagonal matrix, and Qij = T of the art solvers such as Pegasos, TRON, yiyjxi xj. For L1-SVM, U = C and Dii = 0, ∀i. For perf SVM , and a recent primal coordinate de- L2-SVM, U = ∞ and Dii = 1/(2C), ∀i. scent implementation. An SVM usually maps training vectors into a high- dimensional space via a nonlinear function φ(x). Due to the high dimensionality of the vector variable w, 1. Introduction one solves the dual problem (4) by the kernel trick (i.e., using a closed form of φ(x )T φ(x )). We call Support vector machines (SVM) (Boser et al., 1992) i j such a problem as a nonlinear SVM. In some applica- are useful for data classification. Given a set of tions, data appear in a rich dimensional feature space, instance-label pairs (x , y ), i = 1, . . . , l, x ∈ Rn, y ∈ i i i i the performances are similar with/without nonlinear {−1, +1}, SVM requires the solution of the following mapping. If data are not mapped, we can often train unconstrained optimization problem: much larger data sets. We indicate such cases as linear l SVM; these are often encountered in applications such 1 T X min w w + C ξ(w; xi, yi), (1) as document classification. In this paper, we aim at w 2 i=1 solving very large linear SVM problems. Recently, many methods have been proposed for lin- where ξ(w; xi, yi) is a loss function, and C ≥ 0 is a penalty parameter. Two common loss functions are: ear SVM in large-scale scenarios. For L1-SVM, Zhang (2004), Shalev-Shwartz et al. (2007), Bottou (2007) T T 2 propose various stochastic descent methods. max(1 − yiw xi, 0) and max(1 − yiw xi, 0) . (2) Collins et al. (2008) apply an exponentiated gradi- The former is called L1-SVM, while the latter is L2- ent method. SVMperf (Joachims, 2006) uses a cutting SVM. In some applications, an SVM problem appears plane technique. Smola et al. (2008) apply bundle methods, and view SVMperf as a special case. For Appearing in Proceedings of the 25 th International Confer- L2-SVM, Keerthi and DeCoste (2005) propose mod- ence on , Helsinki, Finland, 2008. Copy- ified Newton methods. A Newton method right 2008 by the author(s)/owner(s). (TRON) (Lin et al., 2008) is proposed for logistic re- A Dual Coordinate Descent Method for Large-scale Linear SVM gression and L2-SVM. These algorithms focus on dif- explain why earlier studies on decomposition meth- ferent aspects of the training speed. Some aim at ods failed to modify their algorithms in an efficient quickly obtaining a usable model, but some achieve way like ours for large-scale linear SVM. We also dis- fast final convergence of solving the optimization prob- cuss the connection to other linear SVM works such as lem in (1) or (4). Moreover, among these methods, (Crammer & Singer, 2003; Collins et al., 2008; Shalev- Joachims (2006), Smola et al. (2008) and Collins et al. Shwartz et al., 2007). (2008) solve SVM via the dual (4). Others consider the This paper is organized as follows. In Section 2, we de- primal form (1). The decision of using primal or dual scribe our proposed algorithm. Implementation issues is of course related to the algorithm design. are investigated in Section 3. Section 4 discusses the Very recently, Chang et al. (2007) propose using co- connection to other methods. In Section 5, we compare ordinate descent methods for solving primal L2-SVM. our method with state of the art implementations for Experiments show that their approach more quickly large linear SVM. Results show that the new method obtains a useful model than some of the above meth- is more efficient. Proofs can be found at http://www. ods. Coordinate descent, a popular optimization tech- csie.ntu.edu.tw/~cjlin/papers/cddual.pdf. nique, updates one variable at a time by minimizing a single-variable sub-problem. If one can efficiently solve 2. A Dual Coordinate Descent Method this sub-problem, then it can be a competitive opti- mization method. Due to the non-differentiability of In this section, we describe our coordinate descent the primal L1-SVM, Chang et al’s work is restricted to method for L1- and L2-SVM. The optimization pro- L2-SVM. Moreover, as primal L2-SVM is differentiable cess starts from an initial point α0 ∈ Rl and generates k ∞ but not twice differentiable, certain considerations are a sequence of vectors {α }k=0. We refer to the process needed in solving the single-variable sub-problem. from αk to αk+1 as an outer iteration. In each outer iteration we have l inner iterations, so that sequen- While the dual form (4) involves bound constraints tially α , α , . . . , α are updated. Each outer iteration 0≤α ≤U, its objective function is twice differentiable 1 2 l i thus generates vectors αk,i ∈ Rl, i = 1, . . . , l + 1, such for both L1- and L2-SVM. In this paper, we investi- that αk,1 = αk, αk,l+1 = αk+1, and gate coordinate descent methods for the dual problem k,i k+1 k+1 k k T (4). We prove that an -optimal solution is obtained α = [α1 , . . . , αi−1 , αi , . . . , αl ] , ∀i = 2, . . . , l. in O(log(1/)) iterations. We propose an implemen- k,i k,i+1 tation using a random order of sub-problems at each For updating α to α , we solve the following iteration, which leads to very fast training. Experi- one-variable sub-problem: ments indicate that our method is more efficient than k,i k min f(α + dei) subject to 0 ≤ αi + d ≤ U, (5) the primal coordinate descent method. As Chang et al. d (2007) solve the primal, they require the easy access T where ei = [0,..., 0, 1, 0,..., 0] . The objective func- of a feature’s corresponding data values. However, in tion of (5) is a simple quadratic function of d: practice one often has an easier access of values per in- k,i 1 2 k,i stance. Solving the dual takes this advantage, so our f(α + dei) = Q¯iid + ∇if(α )d + constant, (6) implementation is simpler than Chang et al. (2007). 2 where ∇ f is the ith component of the gradient ∇f. Early SVM papers (Mangasarian & Musicant, 1999; i One can easily see that (5) has an optimum at d = 0 Friess et al., 1998) have discussed coordinate descent (i.e., no need to update α ) if and only if methods for the SVM dual form. However, they do not i P k,i focus on large data using the linear kernel. Crammer ∇i f(α ) = 0, (7) and Singer (2003) proposed an online setting for multi- where ∇P f(α) means the projected gradient class SVM without considering large sparse data. Re- cently, Bordes et al. (2007) applied a coordinate de-  ∇if(α) if 0 < αi < U, scent method to multi-class SVM, but they focus on  ∇P f(α) = min(0, ∇ f(α)) if α = 0, (8) nonlinear kernels. In this paper, we point out that i i i  dual coordinate descent methods make crucial advan- max(0, ∇if(α)) if αi = U. tage of the linear kernel and outperform other solvers If (7) holds, we move to the index i+1 without updat- when the numbers of data and features are both large. k,i ing αi . Otherwise, we must find the optimal solution ¯ Coordinate descent methods for (4) are related to the of (5). If Qii > 0, easily the solution is: popular decomposition methods for training nonlinear   k,i   k,i+1 k,i ∇if(α ) SVM. In this paper, we show their key differences and αi = min max αi − , 0 ,U . (9) Q¯ii A Dual Coordinate Descent Method for Large-scale Linear SVM

k,i Algorithm 1 A dual coordinate descent method for ∇if(α ) = −1. As U = C < ∞ for L1-SVM, the Linear SVM solution of (5) makes the new αk,i+1 = U. We can P i • Given α and the corresponding w = i yiαixi. easily include this case in (9) by setting 1/Q¯ii = ∞. • While α is not optimal Briefly, our algorithm uses (12) to compute ∇ f(αk,i), For i = 1, . . . , l i checks the optimality of the sub-problem (5) by (7), (a)α ¯i ← αi updates αi by (9), and then maintains w by (13). A T (b) G = yiw xi − 1 + Diiαi description is in Algorithm 1. The cost per iteration (c) (i.e., from αk to αk+1) is O(ln¯). The main memory  min(G, 0) if α = 0, requirement is on storing x ,..., x . For the conver-  i 1 l PG = max(G, 0) if αi = U, gence, we prove the following theorem using techniques  in (Luo & Tseng, 1992): G if 0 < αi < U k,i (d) If |PG|= 6 0, Theorem 1 For L1-SVM and L2-SVM, {α } gen- erated by Algorithm 1 globally converges to an optimal αi ← min(max(αi − G/Q¯ii, 0),U) solution α∗. The convergence rate is at least linear: w ← w + (αi − α¯i)yixi there are 0 < µ < 1 and an iteration k0 such that k+1 ∗ k ∗ f(α ) − f(α ) ≤ µ(f(α ) − f(α )), ∀k ≥ k0. (14) k,i We thus need to calculate Q¯ii and ∇if(α ). First, ¯ T Qii = xi xi + Dii can be precomputed and stored in The global convergence result is quite remarkable. k,i the memory. Second, to evaluate ∇if(α ), we use Usually for a convex but not strictly convex problem (e.g., L1-SVM), one can only obtain that any limit Xl point is optimal. We define an -accurate solution α ∇if(α) = (Q¯α)i − 1 = Q¯ijαj − 1. (10) j=1 if f(α) ≤ f(α∗) + . By (14), our algorithm obtains an -accurate solution in O(log(1/)) iterations. Q¯ may be too large to be stored, so one calculates Q¯’s ith row when doing (10). Ifn ¯ is the average number of nonzero elements per instance, and O(¯n) is needed 3. Implementation Issues for each kernel evaluation, then calculating the ith row 3.1. Random Permutation of Sub-problems of the kernel matrix takes O(ln¯). Such operations are expensive. However, for a linear SVM, we can define In Algorithm 1, the coordinate descent algorithm solves the one-variable sub-problems in the order of Xl w = yjαjxj, (11) α1, . . . , αl. Past results such as (Chang et al., 2007) j=1 show that solving sub-problems in an arbitrary order so (10) becomes may give faster convergence. This inspires us to ran- domly permute the sub-problems at each outer itera- T ∇if(α) = yiw xi − 1 + Diiαi. (12) tion. Formally, at the kth outer iteration, we permute {1, . . . , l} to {π(1), . . . , π(l)}, and solve sub-problems To evaluate (12), the main cost is O(¯n) for calculating in the order of απ(1), απ(2), . . . , απ(l). Similar to Al- T w xi. This is much smaller than O(ln¯). To apply gorithm 1, the algorithm generates a sequence {αk,i} (12), w must be maintained throughout the coordinate such that αk,1 = αk, αk,l+1 = αk+1,1 and descent procedure. Calculating w by (11) takes O(ln¯) ( k+1 −1 operations, which are too expensive. Fortunately, if k,i αt if πk (t) < i, αt = k −1 α¯i is the current value and αi is the value after the αt if πk (t) ≥ i. updating, we can maintain w by The update from αk,i to αk,i+1 is by w ← w + (α − α¯ )y x . (13) k,i+1 k,i k,i −1 i i i i α =α +arg min f(α +det) if π (t) = i. t t k,i k 0≤αt +d≤U The number of operations is only O(¯n). To have the We prove that Theorem 1 is still valid. Hence, the new first w, one can use α0 = 0 so w = 0. In the end, we setting obtains an -accurate solution in O(log(1/)) it- obtain the optimal w of the primal problem (1) as the erations. A simple experiment reveals that this setting primal-dual relationship implies (11). of permuting sub-problems is much faster than Algo- ¯ T If Qii = 0, we have Dii = 0, Qii = xi xi = 0, and rithm 1. The improvement is also bigger than that hence xi = 0. This occurs only in L1-SVM without observed in (Chang et al., 2007) for primal coordinate the bias term by (3). From (12), if xi = 0, then descent methods. A Dual Coordinate Descent Method for Large-scale Linear SVM

∗ ∗ Algorithm 2 Coordinate descent algorithm with ran- 1. If αi = 0 and ∇if(α ) > 0, then ∃ki such that domly selecting one instance at a time k,s ∀k ≥ ki, ∀s, αi = 0. P ∗ ∗ • Given α and the corresponding w = i yiαixi. 2. If αi = U and ∇if(α ) < 0, then ∃ki such that k,s • While α is not optimal ∀k ≥ ki, ∀s, αi = U. P k,j P k,j – Randomly choose i ∈ {1, . . . , l}. 3. lim max∇j f(α )= lim min∇j f(α )=0. k→∞ j k→∞ j – Do steps (a)-(d) of Algorithm 1 to update αi. During the optimization procedure, ∇P f(αk) 6= 0, so in general max ∇P f(αk) > 0 and min ∇P f(αk) < 0. 3.2. Shrinking j j j j These two values measure how the current solution vi-

Eq. (4) contains constraints 0 ≤ αi ≤ U. If an olates the optimality condition. In our iterative proce- k,i αi is 0 or U for many iterations, it may remain the dure, what we have are ∇if(α ), i = 1, . . . , l. Hence, same. To speed up decomposition methods for non- at the (k − 1)st iteration, we obtain linear SVM (discussed in Section 4.1), the shrinking k−1 P k−1,j k−1 P k−1,j M ≡ max ∇j f(α ), m ≡ min ∇j f(α ). technique (Joachims, 1998) reduces the size of the op- j j timization problem without considering some bounded Then at each inner step of the kth iteration, before variables. Below we show it is much easier to apply this k,i k,i+1 updating αi to αi , this element is shrunken if technique to linear SVM than the nonlinear case. one of the following two conditions holds: k,i k,i k−1 If A is the subset after removing some elements and α = 0 and ∇if(α ) > M¯ , ¯ i (16) A = {1, . . . , l}\ A, then the new problem is k,i k,i k−1 αi = U and ∇if(α ) < m¯ , 1 T T ( min α Q¯ α + (Q¯ ¯α ¯ − e ) α where k−1 k−1 A AA A AA A A A k−1 M if M > 0, αA 2 M¯ = ∞ otherwise, subject to 0 ≤ αi ≤ U, i ∈ A, (15) ( k−1 k−1 ¯ ¯ ¯ k−1 m if m < 0 where QAA, QAA¯ are sub-matrices of Q, and αA¯ is m¯ = considered as a constant vector. Solving this smaller −∞ otherwise. problem consumes less time and memory. Once (15) is In (16), M¯ k−1 must be strictly positive, so we set it be solved, we must check if the vector α is optimal for (4). ∞ if M k−1 = 0. From Theorem 2, elements satisfying This check needs the whole gradient ∇f(α). Since the “if condition” of properties 1 and 2 meet (16) after ¯ ¯ certain iterations, and are then correctly removed for ∇if(α) = Qi,AαA + Qi,A¯αA¯ − 1, optimization. To have a more aggressive shrinking, ¯ ¯ k−1 k−1 if i ∈ A, and one stores Qi,A¯αA¯ before solving (15), we one may multiply both M andm ¯ in (16) by a already have ∇if(α). However, for all i∈ / A, we must threshold smaller than one. calculate the corresponding rows of Q¯. This step, re- Property 3 of Theorem 2 indicates that with a toler- ferred to as the reconstruction of in training ance , nonlinear SVM, is very time consuming. It may cost M k − mk <  (17) up to O(l2n¯) if each kernel evaluation is O(¯n). is satisfied after a finite number of iterations. Hence For linear SVM, in solving the smaller problem (15), (17) is a valid stopping condition. We also use it for we still have the vector smaller problems (15). If at the kth iteration, (17) X X for (15) is reached, we enlarge A to {1, . . . , l}, set w = yiαixi + yiαixi i∈A i∈A¯ M¯ k = ∞, m¯ k = −∞ (so no shrinking at the (k + 1)st P iteration), and continue regular iterations. Thus, we though only the first part i∈A yiαixi is updated. Therefore, using (12), ∇f(α) is easily available. Below do shrinking without reconstructing gradients. we demonstrate a shrinking implementation so that re- constructing the whole ∇f(α) is never needed. 3.3. An Online Setting Our method is related to what LIBSVM (Chang & Lin, In some applications, the number of instances is huge, 2001) uses. From the optimality condition of bound- so going over all α1, . . . , αl causes an expensive outer constrained problems, α is optimal for (4) if and only if iteration. Instead, one can randomly choose an index P P ∇ f(α) = 0, where ∇ f(α) is the projected gradient ik at a time, and update only αik at the kth outer defined in (8). We then prove the following result: iteration. A description is in Algorithm 2. The setting is related to (Crammer & Singer, 2003; Collins et al., Theorem 2 Let α∗ be the convergent point of {αk,i}. 2008). See also the discussion in Section 4.2. A Dual Coordinate Descent Method for Large-scale Linear SVM Table 1. A comparison between decomposition methods whole gradient does not cost more. As using the whole (Decomp.) and dual coordinate descent (DCD). For both gradient implies fewer iterations (i.e., faster conver- methods, we consider that one αi is updated at a time. We assume Decomp. maintains gradients, but DCD does not. gence due to the ability to choose for updating the vari- The average number of nonzeros per instance isn ¯. able that violates optimality most), one should take Nonlinear SVM Linear SVM this advantage. However, the situation for linear SVM Decomp. DCD Decomp. DCD is very different. With the different way (12) to calcu- late ∇ f(α), the cost to update one α is only O(¯n). If Update α O(1) O(ln¯) O(1) O(¯n) i i i we still maintain the whole gradient, evaluating (12) l Maintain ∇f(α) O(ln¯) NA O(ln¯) NA times takes O(ln¯) effort. We gather this comparison of 4. Relations with Other Methods different situations in Table 1. Clearly, for nonlinear SVM, one should use decomposition methods by main- 4.1. Decomposition Methods for Nonlinear taining the whole gradient. However, for linear SVM, SVM if l is large, the cost per iteration without maintaining Decomposition methods are one of the most popular gradients is much smaller than that with. Hence, the approaches for training nonlinear SVM. As the kernel coordinate descent method can be faster than the de- matrix is dense and cannot be stored in the computer composition method by using many cheap iterations. memory, decomposition methods solve a sub-problem An earlier attempt to speed up decomposition methods of few variables at each iteration. Only a small num- for linear SVM is (Kao et al., 2004). However, it failed ber of corresponding kernel columns are needed, so the to derive our method here because it does not give up memory problem is resolved. If the number of vari- maintaining gradients. ables is restricted to one, a decomposition method is like the online coordinate descent in Section 3.3, but 4.2. Existing Linear SVM Methods it differs in the way it selects variables for updating. It has been shown (Keerthi & DeCoste, 2005) that, We discussed in Section 1 and other places the dif- for linear SVM decomposition methods are inefficient. ference between our method and a primal coordinate On the other hand, here we are pointing out that dual descent method (Chang et al., 2007). Below we de- coordinate descent is efficient for linear SVM. There- scribe the relations with other linear SVM methods. fore, it is important to discuss the relationship between decomposition methods and our method. We mentioned in Section 3.3 that our Algorithm 2 is related to the online mode in (Collins et al., 2008). In early decomposition methods that were first pro- They aim at solving multi-class and structured prob- posed (Osuna et al., 1997; Platt, 1998), variables min- lems. At each iteration an instance is used; then a imized at an iteration are selected by certain heuristics. sub-problem of several variables is solved. They ap- However, subsequent developments (Joachims, 1998; proximately minimize the sub-problem, but for two- Chang & Lin, 2001; Keerthi et al., 2001) all use gra- class case, one can exactly solve it by (9). For the dient information to conduct the selection. The main batch setting, our approach is different from theirs. reason is that maintaining the whole gradient does not The algorithm for multi-class problems in (Crammer & introduce extra cost. Here we explain the detail by as- Singer, 2003) is also similar to our online setting. For suming that one variable of α is chosen and updated at the two-class case, it solves (1) with the loss function 1 T a time . To set-up and solve the sub-problem (6), one max(−yiw xi, 0), which is different from (2). They uses (10) to calculate ∇if(α). If O(¯n) effort is needed do not study data with a large number of features. for each kernel evaluation, obtaining the ith row of the kernel matrix takes O(ln¯) effort. If instead one Next, we discuss the connection to stochastic gradient maintains the whole gradient, then ∇ f(α) is directly descent (Shalev-Shwartz et al., 2007; Bottou, 2007). i The most important step of this method is the follow- available. After updating αk,i to αk,i+1, we obtain Q¯’s i i ing update of w: ith column (same as the ith row due to the symmetry of Q¯), and calculate the new whole gradient: w ← w − η∇w(yi, xi), (19) k,i+1 k,i k,i+1 k,i ∇f(α ) = ∇f(α ) + Q¯:,i(α − α ), (18) i i where ∇w(yi, xi) is the sub-gradient of the approxi- mate objective function: where Q¯:,i is the ith column of Q¯. The cost is O(ln¯) ¯ for Q:,i and O(l) for (18). Therefore, maintaining the T T w w/2 + C max(1 − yiw xi, 0), 1Solvers like LIBSVM update at least two variables due to a linear constraint in their dual problems. Here (4) has and η is the learning rate (or the step size). While our no such a constraint, so selecting one variable is possible. method is dual-based, throughout the iterations we A Dual Coordinate Descent Method for Large-scale Linear SVM Table 2. On the right training time for a solver to reduce the primal objective value to within 1% of the optimal value; see (20). Time is in seconds. The method with the shortest running time is boldfaced. Listed on the left are the statistics of data sets: l is the number of instances and n is the number of features. Data statistics Linear L1-SVM Linear L2-SVM Data set l n # nonzeros DCDL1 Pegasos SVMperf DCDL2 PCD TRON a9a 32,561 123 451,592 0.2 1.1 6.0 0.4 0.1 0.1 astro-physic 62,369 99,757 4,834,550 0.2 2.8 2.6 0.2 0.5 1.2 real-sim 72,309 20,958 3,709,083 0.2 2.4 2.4 0.1 0.2 0.9 news20 19,996 1,355,191 9,097,916 0.5 10.3 20.0 0.2 2.4 5.2 yahoo-japan 176,203 832,026 23,506,415 1.1 12.7 69.4 1.0 2.9 38.2 rcv1 677,399 47,236 49,556,258 2.6 21.9 72.0 2.7 5.1 18.6 yahoo-korea 460,554 3,052,939 156,436,656 8.3 79.7 656.8 7.1 18.4 286.1 maintain w by (13). Both (13) and (19) use one single is at http://ttic.uchicago.edu/~shai/code. perf instance xi, but they take different directions yixi and SVM (Joachims, 2006): a cutting plane method for ∇w(yi, xi). The selection of the learning rate η may be L1-SVM. We use the latest version 2.1. The source is the subtlest thing in stochastic , but at http://svmlight.joachims.org/svm_perf.html. for our method this is never a concern. The step size TRON: a trust region Newton method (Lin et al., 2008) (αi − α¯i) in (13) is governed by solving a sub-problem for L2-SVM. We use the software LIBLINEAR version from the dual. 1.22 with option -s 2 (http://www.csie.ntu.edu. tw/~cjlin/liblinear). 5. Experiments PCD: a primal coordinate descent method for L2-SVM (Chang et al., 2007). In this section, we analyze the performance of our dual coordinate descent algorithm for L1- and L2-SVM. We Since (Bottou, 2007) is related to Pegasos, we do not compare our implementation with state of the art lin- present its results. We do not compare with another ear SVM solvers. We also investigate how the shrink- online method Vowpal Wabbit (Langford et al., 2007) ing technique improves our algorithm. either as it has been made available only very recently. Though a code for the bundle method (Smola et al., Table 2 lists the statistics of data sets. Four of them 2008) is available, we do not include it for comparison (a9a, real-sim, news20, rcv1) are at http://www.csie. due to its closeness to SVMperf . All sources used for ntu.edu.tw/ cjlin/libsvmtools/datasets. The ~ our comparisons are available at http://csie.ntu. set astro-physic is available upon request from edu.tw/ cjlin/liblinear/exp.html. Thorsten Joachims. Except a9a, all others are from ~ document classification. Past results show that lin- We set the penalty parameter C = 1 for comparison2. ear SVM performs as well as kernelized ones for such For all data sets, the testing accuracy does not increase data. To estimate the testing accuracy, we use a strat- after C ≥ 4. All the above methods are implemented ified selection to split each set to 4/5 training and 1/5 in C/C++ with double precision. Some implementa- testing. We briefly describe each set below. Details tions such as (Bottou, 2007) use single precision to can be found in (Joachims, 2006) (astro-physic) and reduce training time, but numerical inaccuracy may (Lin et al., 2008) (others). a9a is from the UCI “adult” occur. We do not include the bias term by (3). data set. real-sim includes Usenet articles. astro-physic To compare these solvers, we consider the CPU time includes documents from Physics Arxiv. news20 is a of reducing the relative difference between the primal collection of news documents. yahoo-japan and yahoo- objective value and the optimum to within 0.01: korea are obtained from Yahoo!. rcv1 is an archive of manually categorized newswire stories from Reuters. |f P (w) − f P (w∗)|/|f P (w∗)| ≤ 0.01, (20) We compare six implementations of linear SVM. Three where f P is the objective function of (1), and f P (w∗) solve L1-SVM, and three solve L2-SVM. is the optimal value. Note that for consistency, we use primal objective values even for dual solvers. The ref- DCDL1 and DCDL2: the dual coordinate descent erence solutions of L1- and L2-SVM are respectively method with sub-problems permuted at each outer it- obtained by solving DCDL1 and DCDL2 until the du- eration (see Section 3.1). DCDL1 solves L1-SVM while ality gaps are less than 10−6. Table 2 lists the re- DCDL2 solves L2-SVM. We omit the shrinking setting. sults. Clearly, our dual coordinate descent method

Pegasos: the primal estimated sub-gradient solver 2The equivalent setting for Pegasos is λ = 1/(Cl). For perf (Shalev-Shwartz et al., 2007) for L1-SVM. The source SVM , its penalty parameter is Cperf = 0.01Cl. A Dual Coordinate Descent Method for Large-scale Linear SVM

(a) L1-SVM: astro-physic (b) L2-SVM: astro-physic (a) L1-SVM: astro-physic (b) L2-SVM: astro-physic

(c) L1-SVM: news20 (d) L2-SVM: news20 (c) L1-SVM: news20 (d) L2-SVM: news20

(e) L1-SVM: rcv1 (f) L2-SVM: rcv1 (e) L1-SVM: rcv1 (f) L2-SVM: rcv1 Figure 1. Time versus the relative error (20). DCDL1-S, Figure 2. Time versus the difference of testing accuracy be- DCDL2-S are DCDL1, DCDL2 with shrinking. The dotted tween the current model and the reference model (obtained line indicates the relative error 0.01. Time is in seconds. using strict stopping conditions). Time is in seconds. for both L1- and L2-SVM is significantly faster than in updating w. Also, Pegasos has a jumpy test set other solvers. To check details, we choose astro-physic, performance while DCDL1 gives a stable behavior. news20, rcv1, and show the relative error along time In the comparison of L2-SVM solvers, DCDL2 and PCD in Figure 1. In Section 3.2, we pointed out that the are both coordinate descent methods. The former one shrinking technique is very suitable for DCD. In Fig- is applied to the dual, but the latter one to the pri- ure 1, we also include them (DCDL1-S and DCDL2-S) mal. DCDL2 has a closed form solution for each sub- for comparison. Like in Table 2, our solvers are effi- problem, but PCD has not. The cost per PCD outer cient for both L1- and L2-SVM. With shrinking, its iteration is thus higher than that of DCDL2. There- performance is even better. fore, while PCD is very competitive (only second to Another evaluation is to consider how fast a solver ob- DCDL1/DCDL2 in Table 2), DCDL2 is even better. tains a model with reasonable testing accuracy. Using Regarding TRON, as a Newton method, it possesses the optimal solutions from the above experiment, we fast final convergence. However, since it takes signifi- generate the reference models for L1- and L2-SVM. We cant effort at each iteration, it hardly generates a rea- evaluate the testing accuracy difference between the sonable model quickly. From the experiment results, current model and the reference model along the train- DCDL2 converges as fast as TRON, but also performs ing time. Figure 2 shows the results. Overall, DCDL1 well in early iterations. and DCDL2 are more efficient than other solvers. Note Due to the space limitation, we give the following ob- that we omit DCDL1-S and DCDL2-S in Figure 2, as servations without details. First, Figure 1 indicates the performances with/without shrinking are similar. that our coordinate descent method converges faster Among L1-SVM solvers, SVMperf is competitive with for L2-SVM than L1-SVM. As L2-SVM has the diag- Pegasos for small data. But in the case of a huge num- onal matrix D with Dii = 1/(2C), we suspect that ber of instances, Pegasos outperforms SVMperf . How- its Q¯ is better conditioned, and hence leads to faster ever, Pegasos has slower convergence than DCDL1. As convergence. Second, all methods have slower conver- discussed in Section 4.2, the learning rate of stochas- gence when C is large. However, small C’s are usually tic gradient descent may be the cause, but for DCDL1 enough as the accuracy is stable after a threshold. In we exactly solve sub-problems to obtain the step size practice, one thus should try from a small C. More- A Dual Coordinate Descent Method for Large-scale Linear SVM over, if n  l and C is too large, then our DCDL2 is Friess, T.-T., Cristianini, N., & Campbell, C. (1998). slower than TRON or PCD (see problem a9a in Table The kernel adatron algorithm: a fast and sim- 2, where the accuracy does not change after C ≥ 0.25). ple learning procedure for support vector machines. If n  l, clearly one should solve the primal, whose ICML. number of variables is just n. Such data are not our fo- cus. Indeed, with a small number of features, one usu- Joachims, T. (1998). Making large-scale SVM learning ally maps data to a higher space and train a nonlinear practical. Advances in Kernel Methods - Support SVM. Third, we have checked the online Algorithm 2. Vector Learning. Cambridge, MA: MIT Press. Its performance is similar to DCDL1 and DCDL2 (i.e., Joachims, T. (2006). Training linear SVMs in linear batch setting without shrinking). Fourth, we have in- time. ACM KDD. vestigated real document classification which involves many two-class problems. Using the proposed method Kao, W.-C., Chung, K.-M., Sun, C.-L., & Lin, C.-J. as the solver is more efficient than using others. (2004). Decomposition methods for linear support vector machines. Neural Comput., 16, 1689–1704. 6. Discussion and Conclusions Keerthi, S. S., & DeCoste, D. (2005). A modified finite Newton method for fast solution of large scale linear We can apply the proposed method to solve regular- SVMs. JMLR, 6, 341–361. ized least square problems, which have the loss func- T 2 tion (1−yiw xi) in (1). The dual is simply (4) with- Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & out constraints, so the implementation is simpler. Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural In summary, we present and analyze an efficient dual Comput., 13, 637–649. coordinate decent method for large linear SVM. It is very simple to implement, and possesses sound op- Langford, J., Li, L., & Strehl, A. (2007). Vowpal Wab- timization properties. Experiments show that our bit. http://hunch.net/~vw. method is faster than state of the art implementations. Lin, C.-J., Weng, R. C., & Keerthi, S. S. (2008). Trust region Newton method for large-scale logistic regres- References sion. JMLR, 9, 623–646. Bordes, A., Bottou, L., Gallinari, P., & Weston, J. Luo, Z.-Q., & Tseng, P. (1992). On the convergence of (2007). Solving multiclass support vector machines coordinate descent method for convex differentiable with LaRank. ICML. minimization. J. Optim. Theory Appl., 72, 7–35. Boser, B. E., Guyon, I., & Vapnik, V. (1992). A train- Mangasarian, O. L., & Musicant, D. R. (1999). Suc- ing algorithm for optimal margin classifiers. COLT. cessive overrelaxation for support vector machines. IEEE Trans. Neural Networks, 10, 1032–1037. Bottou, L. (2007). Stochastic gradient descent exam- ples. http://leon.bottou.org/projects/sgd. Osuna, E., Freund, R., & Girosi, F. (1997). Train- ing support vector machines: An application to face Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library detection. CVPR. for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Platt, J. C. (1998). Fast training of support vector ma- chines using sequential minimal optimization. Ad- Chang, K.-W., Hsieh, C.-J., & Lin, C.-J. (2007). Coor- vances in Kernel Methods - Support Vector Learn- dinate descent method for large-scale L2-loss linear ing. Cambridge, MA: MIT Press. SVM (Technical Report). http://www.csie.ntu. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). edu.tw/~cjlin/papers/cdl2.pdf. Pegasos: primal estimated sub-gradient solver for Collins, M., Globerson, A., Koo, T., Carreras, X., SVM. ICML. & Bartlett, P. (2008). Exponentiated gradient al- Smola, A. J., Vishwanathan, S. V. N., & Le, Q. (2008). gorithms for conditional random fields and max- Bundle methods for machine learning. NIPS. margin markov networks. JMLR. To appear. Zhang, T. (2004). Solving large scale linear predic- Crammer, K., & Singer, Y. (2003). Ultraconservative tion problems using stochastic gradient descent al- online algorithms for multiclass problems. JMLR, gorithms. ICML. 3, 951–991.