Very Large SVM Training using Core Vector Machines

Ivor W. Tsang James T. Kwok Pak-Ming Cheung Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay Hong Kong

Abstract However, on very large data sets, the resulting rank of the kernel matrix may still be too high to be handled efficiently. Standard SVM training has O(m3) time and Another approach to scale up kernel methods is by chunk- O(m2) space complexities, where m is the train- ing or more sophisticated decomposition methods. How- ing set size. In this paper, we scale up kernel ever, chunking needs to optimize the entire set of non-zero methods by exploiting the “approximateness” in Lagrange multipliers that have been identified, and the re- practical SVM implementations. We formulate sultant kernel matrix may still be too large to fit into mem- many kernel methods as equivalent minimum en- ory. Osuna et al. (1997) suggested optimizing only a fixed- closing ball problems in computational geome- size subset of the training data (working set) each time, try, and then obtain provably approximately opti- while the variables corresponding to the other patterns are mal solutions efficiently with the use of core-sets. frozen. Going to the extreme, the sequential minimal opti- Our proposed Core Vector Machine (CVM) al- mization (SMO) algorithm (Platt, 1999) breaks a large QP gorithm has a time complexity that is linear in m into a series of smallest possible QPs, each involving only and a space complexity that is independent of m. two variables. In the context of classification, Mangasar- Experiments on large toy and real-world data sets ian and Musicant (2001) proposed the Lagrangian SVM demonstrate that the CVM is much faster and can (LSVM) that avoids the QP (or LP) altogether. Instead, handle much larger data sets than existing scale- the solution is obtained by a fast iterative scheme. How- up methods. In particular, on our PC with only ever, for nonlinear kernels (which is the focus in this pa- 512M RAM, the CVM with Gaussian kernel can per), it still requires the inversion of an m m matrix. Fur- × process the checkerboard data set with 1 million ther speed-up is possible by employing the reduced SVM points in less than 13 seconds. (RSVM) (Lee & Mangasarian, 2001), which uses a rectan- gular subset of the kernel matrix. However, this may lead to performance degradation (Lin & Lin, 2003). 1 Introduction In practice, state-of-the-art SVM implementations typically have a training time complexity that scales between O(m) In recent years, there has been a lot of interest on using and O(m2.3) (Platt, 1999). This can be further driven down kernels in variousmachine learning problems, with the sup- to O(m) with the use of a parallel mixture (Collobert et al., port vector machines (SVM) being the most prominent ex- 2002). However, these are only empirical observations and ample. Many of these kernel methods are formulated as not theoretical guarantees. For reliable scaling behavior to quadratic programming(QP) problems. Denote the number very large data sets, our goal is to develop an algorithm that of training patterns by m. The training time complexity of can be proved (using tools in analysis of algorithms) to be QP is O(m3) and its space complexity is at least quadratic. asymptotically efficient in both time and space. Hence, a major stumbling block is in scaling up these QP’s Moreover, practical SVM implementations, as in many nu- to large data sets, such as those commonly encountered in merical routines, only approximate the optimal solution by data mining applications. an iterative strategy. Typically, the stopping criterion uti- To reduce the time and space complexities, a popular tech- lizes either the precision of the Lagrange multipliers (e.g., nique is to obtain low-rank approximations on the kernel (Joachims, 1999; Platt, 1999)) or the duality gap (e.g., matrix, by using the Nystr¨om method (Williams & Seeger, (Smola & Sch¨olkopf, 2004)). However, while approxi- 2001), greedy approximation (Smola & Sch¨olkopf, 2000) mation algorithms (with provable performance guarantees) or matrix decompositions (Fine & Scheinberg, 2001). have been extensively used in tackling computationally dif- ficult problems like NP-complete problems (Garey & John- Here, we will focus on approximateMEB algorithms based son, 1979), such “approximateness” has never been ex- on core-sets. Let B(, ) be the ball with center c and ploited in the design of SVM implementations. radius R. Given ǫ > 0, a ball B(c, (1 + ǫ)R) is an (1+ ǫ)-approximation of MEB( ) if R r and In this paper, we first transform the SVM optimization MEB(S) B(c, (1 + ǫ)R). A subset XS is≤ a core-set of Sif an ⊂ problem (with a possibly nonlinear kernel) to the minimum expansion by a factor (1 + ǫ) ⊆of S its MEB contains S , i.e., enclosing ball (MEB) problem in computational geometry. B(c, (1+ǫ)r), where B(c, r)= MEB(X) (FigureS 1). The MEB problem computes the ball of minimum radius S ⊂ enclosing a given set of points (or, more generally, balls). To obtain such an (1+ ǫ)- Traditional algorithms for finding exact MEBs do not scale approximation, B˘adoiu and well with the dimensionality d of the points. Consequently, Clarkson (2002) proposed εR recent attention has shifted to the development of approxi- a simple iterative scheme: mation algorithms. Lately, a breakthrough was obtained by At the tth iteration, the R B˘adoiu and Clarkson (2002), who showed that an (1 + ǫ)- current estimate B(ct, rt) approximation of the MEB can be efficiently obtained us- is expanded incrementally ing core-sets. Generally speaking, in an optimization prob- by including the furthest lem, a core-set is a subset of the input points, such that point outside the (1 + ǫ)- we can get a good approximation (with an approximation ball B(ct, (1 + ǫ)rt). This ratio1 specified by a user-defined ǫ parameter) to the orig- is repeated until all the inal input by solving the optimization problem directly on points in are covered by S Figure 1: The inner cir- the core-set. Moreover, a surprising property of (B˘adoiu & B(ct, (1 + ǫ)rt). Despite cle is the MEB of the set Clarkson, 2002) is that the size of its core-set is indepen- its simplicity, B˘adoiu and of squares and its (1 + ǫ) dent of both d and the size of the point set. Clarkson (2002) showed expansion (the outer cir- that the number of itera- Inspired from this core-set-based approximate MEB al- cle) covers all the points. tions, and hence the size of gorithm, we will develop an approximation algorithm for The set of squares is thus a the final core-set, depends SVM training that has an approximation ratio of (1 + ǫ)2. core-set. only on ǫ but not on d or m. Its time complexity is linear in m while its space complex- This independence of d is important on applying this algo- ity is independent of m. The rest of this paper is organized rithm to kernel methods (Section 3) as the kernel-induced as follows. Section 2 gives a short introductionon the MEB feature space can be infinite-dimensional. As for the inde- problem and its approximation algorithm. The connection pendence on m, it allows both the time and space complex- between kernel methods and the MEB problem is given in ities of our algorithm to grow slowly, as will be shown in Section 3. Section 4 then describes our proposed Core Vec- Section 4.3. tor Machine (CVM) algorithm. Experimental results are presented in Section 5, and the last section gives some con- cluding remarks. 3 MEB Problems and Kernel Methods

Obviously, the MEB is equivalent to the hard-margin sup- 2 MEB in Computational Geometry port vector data description (SVDD) (Tax & Duin, 1999), which will be briefly reviewed in Section 3.1. The MEB Given a set of points = x1,..., xm , where each xi problem can also be used for finding the radius compo- Rd S { } ∈ , the minimum enclosing ball of (denoted MEB( )) nent of the radius-margin bound (Chapelle et al., 2002). is the smallest ball that contains allS the points in . TheS S Thus, as pointed out by Kumar et al. (2003), the MEB MEB problem has found applications in diverse areas such problem is useful in support vector clustering and SVM as computer graphics (e.g., collision detection, visibility parameter tuning. However, we will show in Section 3.2 culling), machine learning (e.g., similarity search) and fa- that other kernel-related problems, including the training cility locations problems. of soft-margin one-class and two-class L2-SVMs, can also 1Let C be the cost (or value of the objective function) of be viewed as MEB problems. the solution returned by an approximate algorithm, and C∗ be the cost of the optimal solution. Then, the approximate algo- 3.1 Hard-Margin SVDD rithm has an approximation ratio ρ(n) for an input size n if Given a kernel k with the associated feature map ϕ, let the ∗ C C ∗ , ρ n . Intuitively, this measures how bad MEB in the kernel-induced feature space be B(c, R). The max “ C C ” ≤ ( ) the approximate solution is compared with the optimal solution. primal problem in the hard-margin SVDD is A large (small) approximation ratio means the solution is much min R2 : c ϕ(x ) 2 R2, i =1,...,m. (1) worse than (more or less the same as) the optimal solution. Ob- k − i k ≤ serve that ρ(n) is always ≥ 1. If the ratio does not depend on The corresponding dual is n, we may just write ρ and call the algorithm an ρ-approximation algorithm. max α′diag(K) α′Kα : 0 α, α′1 =1, (2) − ≤ ′ where α = [αi,...,αm] are the Lagrange multipli- liers from the normal data by solving the primal problem: ′ ′ ers, 0 = [0,..., 0] , 1 = [1,..., 1] and Km×m = m ′ [k(xi, xj )] = [ϕ(xi) ϕ(xj )] is the kernel matrix. As is 2 2 ′ min w 2ρ + C ξi : w ϕ(xi) ρ ξi, well-known, this is a QP problem. The primal variables w,ρ,ξi k k − ≥ − Xi=1 can be recovered from the optimal α as where w′ϕ(x) = ρ is the desired hyperplane and C is a m user-defined parameter. Note that constraints ξi 0 are c = α ϕ(x ), R = α′diag(K) α′Kα. (3) ≥ i i − not needed for the L2-SVM. The corresponding dual is Xi=1 p 1 max α′ K + I α : 0 α, α′1 =1 3.2 Viewing Kernel Methods as MEB Problems −  C  ≤ = max α′K˜ α : 0 α, α′1 =1, (7) In this paper, we consider the situation where − ≤ where I is the m m identity matrix and K˜ = [k˜(z , z )] = x x (4) i j k( , )= κ, δij× [k(xi, xj ) + C ]. It is thus of the form in (5). Since 2 ˜ 1 a constant . This will be the case when either (1) the k(x, x)= κ, k(z, z)= κ + C κ˜ is also a constant. This isotropic kernel k(x, y)= K( x y ) (e.g., Gaussian ker- one-class SVM thus corresponds≡ to the MEB problem (1), nel); or (2) the dot product kernelk −k(xk, y)= K(x′y) (e.g., in which ϕ is replaced by the nonlinear map ϕ˜ satisfying ′ polynomial kernel) with normalized inputs; or (3) any nor- ϕ˜(zi) ϕ˜(zj ) = k˜(zi, zj ). From the Karush-Kuhn-Tucker K(x,y) m malized kernel k(x, y)= is used. Using (KKT) conditions, we can recover w = i=1 αiϕ(xi) and x x y y √K( , )√K( , ) ξ = αi , and ρ = w′ϕ(x )+ αi from any support vector the condition α′1 = 1 in (2), we have α′diag(K) = κ. i C i C P x . Dropping this constant term from the dual objective in (2), i we obtain a simpler optimization problem: 3.2.2 Two-Class L2-SVM ′ ′ max α Kα : 0 α, α 1 =1. (5) m − ≤ Given a training set zi = (xi,yi) i=1 with yi 1, 1 , the primal of the two-class{ L2-SVM} is ∈ {− } Conversely, when the kernel k satisfies (4), QP’s of the form (5) can always be regarded as a MEB problem (1). m α 2 2 2 Note that (2) and (5) yield the same set of ’s, Moreover, minw,b,ρ,ξi w + b 2ρ + C ξi ∗ ∗ k k − let d1 and d2 denote the optimal dual objectives in (2) and Xi=1 (5) respectively, then, obviously, s.t. y (w′ϕ(x )+ b) ρ ξ . (8) i i ≥ − i ∗ ∗ d1 = d2 + κ. (6) The corresponding dual is

′ ′ ′ 1 ′ In the following, we will show that when (4) is satisfied, max α K yy + yy + I α : α 1 =1 0≤α −  ⊙ C  the duals in a number of kernel methods can be rewrit- = max α′K˜ α : 0 α, α′1 =1, (9) ten in the form of (5). While the 1-norm error has been − ≤ commonly used for the SVM, our main focus will be on ′ where denotes the Hadamard product, y = [y1,...,ym] the 2-norm error. In theory, this could be less robust in ˜ ⊙ and K = [k˜(zi, zj )] with the presence of outliers. However, experimentally, its gen- eralization performance is often comparable to that of the δ k˜(z , z )= y y k(x , x )+ y y + ij , (10) L1-SVM (e.g., (Lee & Mangasarian, 2001; Mangasarian & i j i j i j i j C Musicant, 2001). Besides, the 2-norm error is more advan- involving both input and label information. Again, this is of tageous here because a soft-margin L2-SVM can be trans- the form in (5), with k˜(z, z)= κ +1+ 1 κ˜, a constant. formed to a hard-margin one. While the 2-norm error has C Again, we can recover ≡ been used in classification (Section 3.2.2), we will also ex- tend its use for novelty detection (Section 3.2.1). m m α w = α y ϕ(x ), b = α y , ξ = i , (11) i i i i i i C 3.2.1 One-Class L2-SVM Xi=1 Xi=1 from the optimal α and ρ = y (w′ϕ(x )+ b)+ αi from Given a set of unlabeled patterns z m where z only has i i C i i=1 i any support vector z . Note that all the support vectors the input part x , the one-class L2-SVM{ } separates the out- i i of this L2-SVM, including those defining the margin and 2In this case, it can be shown that the hard (soft) margin SVDD those that are misclassified, now reside on the surface of the yields identical solution as the hard (soft) margin one-class SVM ball in the feature space induced by k˜. A similar relation- (Sch¨olkopf et al., 2001). Moreover, the weight w in the one-class ship connecting one-class classification and binary classifi- c SVM solution is equal to the center in the SVDD solution. cation is also described in (Sch¨olkopf et al., 2001). 4 Core Vector Machine (CVM) opposing classes, which is also the heuristic used in initial- izing the SimpleSVM (Vishwanathan et al., 2003). After formulating the as a MEB problem, we obtain a transformed kernel k˜, together with the associ- 4.1.2 Distance Computations ated feature space ˜, mapping ϕ˜ and constant κ˜ = k˜(z, z). F To solve this kernel-induced MEB problem, we adopt the Steps 2 and 3 involve computing ct ϕ˜(zℓ) for zℓ . approximation algorithm3 described in the proof of Theo- Now, k − k ∈ S rem 2.2 in (B˘adoiu & Clarkson, 2002). As mentioned in 2 kct − ϕ˜(zℓ)k (12) Section 2, the idea is to incrementally expand the ball by = αiαj k˜(zi, zj ) − 2 αik˜(zi, z )+ k˜(z , z ), including the point furthest away from the current center. X X ℓ ℓ ℓ z ,z ∈S z ∈S In the following, we denote the core-set, the ball’s center i j t i t and radius at the tth iteration by , c and R respectively. on using (3). Hence, computations are based on kernel St t t Also, the center and radius of a ball B are denoted by cB evaluations instead of the explicit ϕ˜(zi)’s, which may be and rB. Given an ǫ> 0, the CVM then works as follows: infinite-dimensional. Note that, in contrast, existing MEB algorithms only consider finite-dimensional spaces. 1. Initialize , c and R . S0 0 0 However, in the feature space, ct cannot be obtained as 2. Terminate if there is no ϕ˜(z) (where z is a training an explicit point but rather as a convex combination of (at most) t ϕ˜(zi)’s. Computing (12) for all m training point) falling outside the (1+ǫ)-ball B(ct, (1+ǫ)Rt). points takes|S O| ( 2 + m )= O(m ) time at the tth |St| |St| |St| 3. Find z such that ϕ˜(z) is furthest away from ct. Set iteration. This becomes very expensive when m is large. St+1 = t z . Here, we use the probabilistic speedup method in (Smola S ∪{ } & Sch¨olkopf, 2000). The idea is to randomly sample a suf- 4. Find the new MEB( ) from (5) and set c = ′ St+1 t+1 ficiently large subset from , and then take the point in c and using (3). ′ S S MEB(St+1) Rt+1 = rMEB(St+1) that is furthest away from c as the approximate furthest S t 5. Increment t by 1 and go back to step 2. point over . As shown in (Smola & Sch¨olkopf, 2000), by using aS small random sample of, say, size 59, the fur- thest point obtained from ′ is with probability 0.95 among In the sequel, points that are added to the core-set will be the furthest 5% of pointsS from the whole . Instead of called core vectors. Details of each of the above steps will taking time, this randomized methodS only takes be described in Section 4.1. Despite its simplicity, CVM O(m t ) 2 |S | 2 time, which is much faster as has an approximation guarantee (Section 4.2) and also O( t + t ) = O( t ) |S | m.|S This| trick can|S | also be used in initialization. provably small time and space complexities (Section 4.3). |St| ≪ 4.1 Detailed Procedure 4.1.3 Adding the Furthest Point 4.1.1 Initialization Points outside MEB( ) have zero α ’s (Section 4.1.1) and St i B˘adoiu and Clarkson (2002) simply used an arbitrary point so violate the KKT conditions of the dual problem. As in z to initialize S = z . However, a good initial- (Osuna et al., 1997), one can simply add any such violating ∈ S 0 { } ization may lead to fewer updates and so we follow the point to t. Our step 3, however, takes a greedy approach S scheme in (Kumar et al., 2003). We start with an arbi- by including the point furthest away from the current cen- trary point z and find z that is furthest away ter. In the classification case4 (Section 3.2.2), we have ∈ S a ∈ S from z in the feature space ˜. Then, we find another 2 F arg max ct ϕ˜(zℓ) point z that is furthest away from z in ˜. The ini- zℓ∈/B(ct,(1+ǫ)Rt) k − k b ∈ S a F tial core-set is then set to be 0 = za, zb . Obviously, x x S 1{ } = arg min αiyiyℓ(k( i, ℓ)+1) MEB( 0) (in ˜) has center c0 = (˜ϕ(za)+ϕ ˜(zb)) On zℓ∈/B(ct,(1+ǫ)Rt) 2 zXi∈St S F 1 using (3), we thus have αa = αb = 2 and all the other ′ 1 = arg min yℓ(w ϕ(xℓ)+ b), (13) α ’s are zero. The initial radius is R = ϕ˜(z ) z ∈/B(c ,(1+ǫ)R ) i 0 2 k a − ℓ t t ϕ˜(z ) = 1 ϕ˜(z ) 2 + ϕ˜(z ) 2 2˜ϕ(z )′ϕ˜(z ) = b k 2 k a k k b k − a b on using (10), (11) and (12). Hence, (13) chooses the worst 1 ˜ p violating pattern corresponding to the constraint (8). Also, 2 2˜κ 2k(za, zb). q − as the dual objective in (9) has gradient 2K˜ α, so for a In a classification problem, one may further require za and pattern ℓ currently outside the ball − zb to come from different classes. On using (10), R0 then m 1 1 δiℓ becomes 2 2 κ +2+ C +2k(xa, xb). As κ and C are (K˜ α) = α y y k(x , x )+ y y + q ℓ i  i ℓ i ℓ i ℓ C  constants, choosing the pair(xa, xb) that maximizes R0 is Xi=1 ′ then equivalent to choosing the closest pair belonging to = yℓ(w ϕ(xℓ)+ b), 3A similar algorithm is also described in (Kumar et al., 2003). 4The case for one-class classification (Section 3.2.1) is similar. on using (10), (11) and αℓ = 0. Thus, the pattern chosen a parameter similar to our ǫ is required at termination. For in (13) also makes the most progress towards maximizing example, in SMO and SVMlight (Joachims, 1999), train- the dual objective. This subset selection heuristic has been ing stops when the KKT conditions are fulfilled within ǫ. commonly used by various decomposition algorithms (e.g., Experience with these indicate that near-optimal (Chang & Lin, 2004; Joachims, 1999; Platt, 1999)). solutions are often good enough in practical applications. Moreover, it can also be shown that when the CVM ter- 4.1.4 Finding the MEB minates, all the points satisfy loose KKT conditions as in SMO and SVMlight. At each iteration of step 4, we find the MEB by using the QP formulation in Section 3.2. As the size t of the 4.3 Time and Space Complexities core-set is much smaller than m in practice (Section|S | 5), the computational complexity of each QP sub-problem is Existing decomposition algorithms cannot guarantee the much lower than solving the whole QP. Besides, as only number of iterations and consequently the overall time one core vector is added at each iteration, efficient rank-one complexity (Chang & Lin, 2004). In this Section, we show update procedures (Cauwenberghs & Poggio, 2001; Vish- how this can be obtained for CVM. In the following, we as- wanathan et al., 2003) can also be used. The cost then be- sume that a plain QP implementation, which takes O(m3) comes quadratic rather than cubic. In the current imple- time and O(m2) space for m patterns, is used for the MEB mentation (Section 5), we use SMO. As only one point is sub-problem in Section 4.1.4. Moreover, we assume that added each time, the new QP is just a slight perturbation of each kernel evaluation takes constant time. the original. Hence, by using the MEB solution obtained from the previous iteration as starting point (warm start), As proved in (B˘adoiu & Clarkson, 2002), CVM converges SMO can often converge in a small number of iterations. in at most 2/ǫ iterations. In other words, the total number of iterations, and consequently the size of the final core-set, are of τ = O(1/ǫ). In practice, it has often been observed 4.2 Convergence to (Approximate) Optimality that the size of the core-set is much smaller than this worst- case theoretical upper bound (Kumar et al., 2003). This First, consider ǫ = 0. The proof in (B˘adoiu & Clarkson, will also be corroborated by our experiments in Section 5. 2002) does not apply as it requires ǫ> 0. Nevertheless, as the number of core vectors increases by one at each itera- Consider first the case where probabilistic speedup is not tion and the training set size is finite, so CVM must termi- used in Section 4.1.2. As only one core vector is added at nate in a finite number (say, τ) of iterations, With ǫ = 0, each iteration, S = t +2. Initialization takes O(m) time | t| MEB( τ ) is an enclosing ball for all the points on termina- while distance computations in steps 2 and 3 take O((t + S 2 2 tion. Because τ is a subset of the whole training set and 2) + tm)= O(t + tm) time. Finding the MEB in step 4 S the MEB of a subset cannot be larger than the MEB of the takes O((t + 2)3) = O(t3) time, and the other operations whole set. Hence, MEB( τ ) must also be the exact MEB take constant time. Hence, the tth iteration takes O(tm + S of the whole (ϕ˜-transformed) training set. In other words, t3) time, and the overall time for τ = O(1/ǫ) iterations is when ǫ = 0, CVM outputs the exact solution of the kernel τ m 1 problem. O(tm + t3)= O(τ 2m + τ 4)= O + ,  ǫ2 ǫ4  Now, consider ǫ> 0. Assume that the algorithm terminates Xt=1 at the τth iteration, then which is linear in m for a fixed ǫ. As for space5, since only the core vectors are involved Rτ rMEB(S) (1 + ǫ)Rτ (14) ≤ ≤ in the QP, the space complexity for the tth iteration is ∗ 2 by definition. Recall that the optimal primal objective p O( t ). As τ = O(1/ǫ), the space complexity for the |S | of the kernel problem in Section 3.2.1 (or 3.2.2) is equal to whole procedure is O(1/ǫ2), which is independent of m ∗ for a fixed . the optimal dual objective d2 in (7) (or (9)), which in turn ǫ ∗ 2 is related to the optimal dual objective d1 = rMEB(S) in (2) ∗ On the other hand, when probabilistic speedup is used, ini- by (6). Together with (14), we can then bound p as tialization only takes O(1) time while distance computa- tions in steps 2 and 3 take O((t + 2)2)= O(t2) time. Time R2 p∗ +˜κ (1 + ǫ)2R2 . (15) τ ≤ ≤ τ for the other operations remains the same. Hence, tth iter- 3 2 ∗ ation takes O(t ) time and the whole procedure takes Rτ p +˜κ 2 Hence, max p∗+˜κ , R2 (1 + ǫ) and thus CVM is τ ≤ τ 2   1 an (1 + ǫ) -approximation algorithm. This also holds with O(t3)= O(τ 4)= O . ǫ4  high probability when probabilistic speedup is used. Xt=1 As mentioned in Section 1, practical SVM implementa- 5As the patterns may be stored out of core, we ignore the tions also output approximated solutions only. Typically, O(m) space required for storing the m patterns. For a fixed ǫ, it is thus constant, independent of m. The Gaussian kernel k(x, y) = exp( x y 2/β), with 1 m 2 −k − k space complexity, which depends only on the number of β = m2 i,j=1 xi xj . iterations τ, is still O(1/ǫ2). k − k Our CVMP implementation is adapted from LIBSVM, and If more efficient QP solvers were used in the MEB sub- uses SMO for each QP sub-problem in Section 4.1.4. As in problem of Section 4.1.4, both the time and space complex- LIBSVM, our CVM also uses caching (with the same cache ities can be further improved. For example, with SMO, the size as in the other LIBSVM implementations above) and space complexity for the tth iteration is reduced to O( t ) stores all training patterns in main memory. For simplicity, and that for the whole procedure driven down to O(1/ǫ|S).| shrinking is not used in our current CVM implementation. Moreover, we employ probabilistic speedup (Section 4.1.2) Note that when ǫ decreases, the CVM solution becomes and set ǫ = 10−6 in all the experiments. As in other de- closer to the exact optimal solution, but at the expense of composition methods, the use of a very stringent stopping higher time and space complexities. Such a tradeoff be- criterion is not necessary in practice. Preliminary studies tween efficiency and approximation quality is typical of all show that ǫ = 10−6 is acceptable for most tasks. Using an approximation schemes. Morever, be cautioned that the O- even smaller ǫ does not show improved generalization per- notation is used for studying the asymptotic efficiency of formance, but may increase the training time unnecessarily. algorithms. As we are interested on handling very large data sets, an algorithm that is asymptotically more effi- 5.1 Checkerboard Data cient (in time and space) will be the best choice. However, We first experiment on the 4 4 checkerboard data used on smaller problems, this may be outperformed by algo- by Lee and Mangasarian (2001)× for evaluating large-scale rithms that are not as efficient asymptotically. These will SVM implementations. We use training sets with a maxi- be demonstrated experimentally in Section 5. mum of 1 million points and 2000 independent points for 5 Experiments testing. Of course, this problem does not need so many points for training, but it is convenient for illustrating the In this Section, we implement the two-class L2-SVM in scaling properties. Experimentally, L2-SVM with low rank Section 3.2.2 and illustrate the scaling behavior of CVM (in approximation does not yield satisfactory performance on C++) on both toy and real-world data sets. For comparison, this data set, and so its result is not reported here. RSVM, we also run the following SVM implementations6: on the other hand, has to keep a rectangular kernel matrix of size m¯ m and cannot be run on our machine when m × 1. L2-SVM: LIBSVM implementation (in C++); exceeds 10K. Similarly, the SimpleSVM has to store the kernel matrix of the active set, and runs into storage prob- 2. L2-SVM: LSVM implementation (in MATLAB), with lem when m exceeds 30K. low-rank approximation (Fine & Scheinberg, 2001) of As can be seen from Figure 2, CVM is as accurate as the the kernel matrix added; others. Besides, it is much faster7 and produces far fewer 3. L2-SVM: RSVM (Lee & Mangasarian, 2001) imple- support vectors (which implies faster testing) on large data mentation (in MATLAB). The RSVM addresses the sets. In particular, one million patterns can be processed in scale-up issue by solving a smaller optimization prob- under 13s. On the other hand, for relatively small training lem that involves a random m¯ m rectangular subset sets, with less than 10K patterns, LIBSVM is faster. This, of the kernel matrix. Here, m¯ is× set to 10% of m; however, is to be expected as LIBSVM uses more sophis- ticated heuristics and so will be more efficient on small-to- 4. L1-SVM: LIBSVM implementation (in C++); medium sized data sets. Figure 2(b) also shows the core-set size, which can be seen to be small and its curve basically 5. L1-SVM: SimpleSVM (Vishwanathan et al., 2003) overlaps with that of the CVM. Thus, almost all the core implementation (in MATLAB). vectors are useful support vectors. Moreover, it also con- firms our theoretical findings that both time and space are Parameters are used in their default settings unless other- constant w.r.t. the training set size, when it is large enough. wise specified. All experimentsare performed on a 3.2GHz 5.2 Forest Cover Type Data8 Pentium–4 machine with 512M RAM, running Windows XP. Since our focus is on nonlinear kernels, we use the This data set has been used for large scale SVM training by Collobert et al. (2002). Following (Collobert et al., 6Our CVM implementation can be downloaded from 7 http://www.cs.ust.hk/∼jamesk/cvm.zip. LIBSVM can be As some implementations are in MATLAB, so not all the downloaded from http://www.csie.ntu.edu.tw/∼cjlin/libsvm/; CPU time measurements can be directly compared. However, it LSVM from http://www.cs.wisc.edu/dmi/lsvm; and Sim- is still useful to note the constant scaling exhibited by the CVM pleSVM from http://asi.insa-rouen.fr/∼gloosli/. Moreover, we and its speed advantage over other C++ implementations, when followed http://www.csie.ntu.edu.tw/∼cjlin/libsvm/faq.html in the data set is large. adapting the LIBSVM package for L2-SVM. 8http://kdd.ics.uci.edu/databases/covertype/covertype.html 6 5 10 10 40 L2−SVM (CVM) L2−SVM (CVM) L2−SVM (CVM) L2−SVM (LIBSVM) core−set size L2−SVM (LIBSVM) 5 35 10 L2−SVM (RSVM) L2−SVM (LIBSVM) L2−SVM (RSVM) L1−SVM (LIBSVM) L2−SVM (RSVM) L1−SVM (LIBSVM) L1−SVM (SimpleSVM) L1−SVM (SimpleSVM) L1−SVM (LIBSVM) 30 4 10 L1−SVM (SimpleSVM) 4 10 25 3 10 20

2 10 error rate (in %) number of SV's 15 3 10 CPU time (in seconds) 1 10 10

0 10 5

−1 2 10 10 0 1K 3K 10K 30K 100K 300K 1M 1K 3K 10K 30K 100K 300K 1M 1K 3K 10K 30K 100K 300K 1M size of training set size of training set size of training set

(a) CPU time. (b) number of SV’s. (c) testing error. Figure 2: Results on the checkerboard data set (Except for the CVM, all the other implementations have to terminate early because of not enough memory and / or the training time is too long). Note that the CPU time, number of support vectors, and size of the training set are in log scale.

2002), we aim at separating class 2 from the other classes. ber of training patterns. From another perspective, recall 1% 90% of the whole data set (with a maximum of that the worst case core-set size is 2/ǫ, independent of 522,911− patterns) are used for training while the remaining m (Section 4.3). For the value of ǫ = 10−6 used here, are used for testing. We set β = 10000 for the Gaussian 2/ǫ =2 106. Although we have seen that the actual size kernel. Preliminary studies show that the number of sup- of the core-set× is often much smaller than this worst case port vectors is over ten thousands. Consequently, RSVM value, however, when m 2/ǫ, the number of core vec- and SimpleSVM cannot be run on our machine. Similarly, tors can still be dependenton≪ m. Moreover,as has been ob- for low rank approximation, preliminary studies show that served in Section 5.1, the CVM is slower than the more so- over thousands of basis vectors are required for a good ap- phisticated LIBSVM on processing these smaller data sets. proximation. Therefore, only the two LIBSVM implemen- tations will be compared with the CVM here. 6 Conclusion Figure 3 shows that CVM is, again, as accurate as the oth- In this paper, we exploit the “approximateness” in SVM ers. Note that when the training set is small, more training implementations. We formulate kernel methods as equiv- patterns bring in additional information useful for classi- alent MEB problems, and then obtain provably approxi- fication and so the number of core vectors increases with mately optimal solutions efficiently with the use of core- training set size. However, after processing around 100K sets. The proposed CVM procedure is simple, and does not patterns, both the time and space requirements of CVM be- require sophisticated heuristics as in other decomposition gin to exhibit a constant scaling with the training set size. methods. Moreover, despite its simplicity, CVM has small With hindsight, one might simply sample 100K training asymptotic time and space complexities. In particular, for patterns and hope to obtain comparable results9. However, a fixed ǫ, its asymptotic time complexity is linear in the for satisfactory classification performance, different prob- training set size m while its space complexity is indepen- lems require samples of different sizes and CVM has the dent of m. When probabilistic speedup is used, it even has important advantage that the required sample size does not constant asymptotic time and space complexities for a fixed have to be pre-specified. Without such prior knowledge, ǫ, independent of the training set size m. Experimentally, random sampling gives poor testing results, as has been on large data sets, it is much faster and produce far fewer demonstrated in (Lee & Mangasarian, 2001). support vectors (and thus faster testing) than existing meth- ods. On the other hand, on relatively small data sets where 10 5.3 Relatively Small Data Sets: UCI Adult Data m 2/ǫ, SMO can be faster. CVM can also be used for Following (Platt, 1999), we use training sets with up to other≪ kernel methods such as support vector regression, and 32,562 patterns. As can be seen in Figure 4, CVM is details will be reported elsewhere. still among the most accurate methods. However, as this data set is relatively small, more training patterns do carry References more classification information. Hence, as discussed in Section 5.2, the number of iterations, the core set size B˘adoiu, M., & Clarkson, K. (2002). Optimal core-sets for balls. and consequently the CPU time all increase with the num- DIMACS Workshop on Computational Geometry.

9 In fact, we tried both LIBSVM implementations on a random Cauwenberghs, G., & Poggio, T. (2001). Incremental and decre- sample of 100K training patterns, but their testing accuracies are mental support vector machine learning. Advances in Neural inferior to that of CVM. Information Processing Systems 13. Cambridge, MA: MIT 10http://research.microsoft.com/users/jplatt/smo.html Press. 6 6 10 10 25 L2−SVM (CVM) L2−SVM (CVM) L2−SVM (CVM) L2−SVM (LIBSVM) core−set size L2−SVM (LIBSVM) L1−SVM (LIBSVM) L2−SVM (LIBSVM) L1−SVM (LIBSVM) 5 L1−SVM (LIBSVM) 10 20

5 10 4 10 15

3 10 10 number of SV's 4 error rate (in %) 10 CPU time (in seconds)

2 10 5

1 3 10 10 0 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 size of training set 5 size of training set 5 5 x 10 x 10 size of training set x 10 (a) CPU time. (b) number of SV’s. (c) testing error. Figure 3: Results on the forest cover type data set. Note that the y-axes in Figures 3(a) and 3(b) are in log scale.

5 5 10 10 20 L2−SVM (CVM) L2−SVM (CVM) L2−SVM (CVM) L2−SVM (LIBSVM) core−set size L2−SVM (LIBSVM) L2−SVM (low rank) 4 L2−SVM (low rank) L2−SVM (LIBSVM) 19 10 L2−SVM (RSVM) L2−SVM (low rank) L2−SVM (RSVM) L1−SVM (LIBSVM) L2−SVM (RSVM) L1−SVM (LIBSVM) L1−SVM (SimpleSVM) L1−SVM (LIBSVM) L1−SVM (SimpleSVM) 3 4 10 10 L1−SVM (SimpleSVM) 18

2 10 17 error rate (in %) number of SV's 1 3 10 10 16 CPU time (in seconds)

0 10 15

−1 2 10 10 14 1000 3000 6000 10000 30000 1000 3000 6000 10000 30000 1000 3000 6000 10000 30000 size of training set size of training set size of training set

(a) CPU time. (b) number of SV’s. (c) testing error. Figure 4: Results on the UCI adult data set (The other implementations have to terminate early because of not enough memory and/or training time is too long). Note that the CPU time, number of SV’s and size of training set are in log scale.

Chang, C.-C., & Lin, C.-J. (2004). LIBSVM: a li- Mangasarian, O., & Musicant, D. (2001). Lagrangian support brary for support vector machines. available at vector machines. Journal of Machine Learning Research, 1, http://www.csie.ntu.edu.tw/˜cjlin/libsvm. 161–177. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Osuna, E., Freund, R., & Girosi, F. (1997). Training support vec- Choosing multiple parameters for support vector machines. tor machines: an application to face detection. Proceedings of Machine Learning, 46, 131–159. Computer Vision and Pattern Recognition (pp. 130–136). San Juan, Puerto Rico. Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture of SVMs for very large scale problems. Neural Computation, Platt, J. (1999). Fast training of support vector machines using 14, 1105–1114. sequential minimal optimization. In B. Sch¨olkopf, C. Burges and A. Smola (Eds.), Advances in kernel methods – support Fine, S., & Scheinberg, K. (2001). Efficient SVM training using vector learning, 185–208. Cambridge, MA: MIT Press. low-rank kernel representation. Journal of Machine Learning Research, 2, 243–264. Sch¨olkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., & Williamson, R. (2001). Estimating the support of a high- Garey, M., & Johnson, D. (1979). Computers and intractability: dimensional distribution. Neural Computation, 13, 1443–1471. A guide to the theory of NP-completeness. W.H. Freeman. Smola, A., & Sch¨olkopf, B. (2000). Sparse greedy matrix approx- Joachims, T. (1999). Making large-scale support vector machine imation for machine learning. Proceedings of the Seventeenth learning practical. In B. Sch¨olkopf, C. Burges and A. Smola International Conference on Machine Learning (pp. 911–918). (Eds.), Advances in kernel methods – Support vector learning, Standord, CA, USA. 169–184. Cambridge, MA: MIT Press. Smola, A., & Sch¨olkopf, B. (2004). A tutorial on support vector Kumar, P., Mitchell, J., & Yildirim, A. (2003). Approximate min- regression. Statistics and Computing, 14, 199–222. imum enclosing balls in high dimensions using core-sets. ACM Tax, D., & Duin, R. (1999). Support vector domain description. Journal of Experimental Algorithmics, 8. Pattern Recognition Letters, 20, 1191–1199. Lee, Y.-J., & Mangasarian, O. (2001). RSVM: Reduced support Vishwanathan, S., Smola, A., & Murty, M. (2003). SimpleSVM. vector machines. Proceeding of the First SIAM International Proceedings of the Twentieth International Conference on Ma- Conference on Data Mining. chine Learning (pp. 760–767). Washington, D.C., USA. Lin, K.-M., & Lin, C.-J. (2003). A study on reduced support Williams, C., & Seeger, M. (2001). Using the Nystr¨om method vector machines. IEEE Transactions on Neural Networks, 14, to speed up kernel machines. Advances in Neural Information 1449–1459. Processing Systems 13. Cambridge, MA: MIT Press.