Very Large SVM Training Using Core Vector Machines

Very Large SVM Training using Core Vector Machines Ivor W. Tsang James T. Kwok Pak-Ming Cheung Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay Hong Kong Abstract However, on very large data sets, the resulting rank of the kernel matrix may still be too high to be handled efficiently. Standard SVM training has O(m3) time and Another approach to scale up kernel methods is by chunk- O(m2) space complexities, where m is the training or more sophisticated decomposition methods. How- ing set size. In this paper, we scale up kernel ever, chunking needs to optimize the entire set of non-zero methods by exploiting the “approximateness” in Lagrange multipliers that have been identified, and the re- practical SVM implementations. We formulate sultant kernel matrix may still be too large to fit into mem- many kernel methods as equivalent minimum en- ory. Osuna et al. (1997) suggested optimizing only a fixed- closing ball problems in computational geome- size subset of the training data (working set) each time, try, and then obtain provably approximately opti- while the variables corresponding to the other patterns are mal solutions efficiently with the use of core-sets. frozen. Going to the extreme, the sequential minimal opti- Our proposed Core Vector Machine (CVM) al- mization (SMO) algorithm (Platt, 1999) breaks a large QP gorithm has a time complexity that is linear in m into a series of smallest possible QPs, each involving only and a space complexity that is independent of m. two variables. In the context of classification, Mangasar- Experiments on large toy and real-world data sets ian and Musicant (2001) proposed the Lagrangian SVM demonstrate that the CVM is much faster and can (LSVM) that avoids the QP (or LP) altogether. Instead, handle much larger data sets than existing scale- the solution is obtained by a fast iterative scheme. How- up methods. In particular, on our PC with only ever, for nonlinear kernels (which is the focus in this pa- 512M RAM, the CVM with Gaussian kernel can per), it still requires the inversion of an m m matrix. Fur- × process the checkerboard data set with 1 million ther speed-up is possible by employing the reduced SVM points in less than 13 seconds. (RSVM) (Lee & Mangasarian, 2001), which uses a rectan- gular subset of the kernel matrix. However, this may lead to performance degradation (Lin & Lin, 2003). 1 Introduction In practice, state-of-the-art SVM implementations typically have a training time complexity that scales between O(m) In recent years, there has been a lot of interest on using and O(m2.3) (Platt, 1999). This can be further driven down kernels in variousmachine learning problems, with the sup- to O(m) with the use of a parallel mixture (Collobert et al., port vector machines (SVM) being the most prominent ex- 2002). However, these are only empirical observations and ample. Many of these kernel methods are formulated as not theoretical guarantees. For reliable scaling behavior to quadratic programming(QP) problems. Denote the number very large data sets, our goal is to develop an algorithm that of training patterns by m. The training time complexity of can be proved (using tools in analysis of algorithms) to be QP is O(m3) and its space complexity is at least quadratic. asymptotically efficient in both time and space. Hence, a major stumbling block is in scaling up these QP’s Moreover, practical SVM implementations, as in many nu- to large data sets, such as those commonly encountered in merical routines, only approximate the optimal solution by data mining applications. an iterative strategy. Typically, the stopping criterion uti- To reduce the time and space complexities, a popular tech- lizes either the precision of the Lagrange multipliers (e.g., nique is to obtain low-rank approximations on the kernel (Joachims, 1999; Platt, 1999)) or the duality gap (e.g., matrix, by using the Nyström method (Williams & Seeger, (Smola & Schölkopf, 2004)). However, while approxi- 2001), greedy approximation (Smola & Schölkopf, 2000) mation algorithms (with provable performance guarantees) or matrix decompositions (Fine & Scheinberg, 2001). have been extensively used in tackling computationally dif- ficult problems like NP-complete problems (Garey & John- Here, we will focus on approximateMEB algorithms based son, 1979), such “approximateness” has never been ex- on core-sets. Let B(c, R) be the ball with center c and ploited in the design of SVM implementations. radius R. Given ǫ > 0, a ball B(c, (1 + ǫ)R) is an (1+ ǫ)-approximation of MEB( ) if R r and In this paper, we first transform the SVM optimization MEB(S) B(c, (1 + ǫ)R). A subset XS is≤ a core-set of Sif an ⊂ problem (with a possibly nonlinear kernel) to the minimum expansion by a factor (1 + ǫ) ⊆of S its MEB contains S , i.e., enclosing ball (MEB) problem in computational geometry. B(c, (1+ǫ)r), where B(c, r)= MEB(X) (FigureS 1). The MEB problem computes the ball of minimum radius S ⊂ enclosing a given set of points (or, more generally, balls). To obtain such an (1+ ǫ)- Traditional algorithms for finding exact MEBs do not scale approximation, B˘adoiu and well with the dimensionality d of the points. Consequently, Clarkson (2002) proposed εR recent attention has shifted to the development of approxi- a simple iterative scheme: mation algorithms. Lately, a breakthrough was obtained by At the tth iteration, the R B˘adoiu and Clarkson (2002), who showed that an (1 + ǫ)- current estimate B(ct, rt) approximation of the MEB can be efficiently obtained us- is expanded incrementally ing core-sets. Generally speaking, in an optimization prob- by including the furthest lem, a core-set is a subset of the input points, such that point outside the (1 + ǫ)- we can get a good approximation (with an approximation ball B(ct, (1 + ǫ)rt). This ratio1 specified by a user-defined ǫ parameter) to the orig- is repeated until all the inal input by solving the optimization problem directly on points in are covered by S Figure 1: The inner cir- the core-set. Moreover, a surprising property of (B˘adoiu & B(ct, (1 + ǫ)rt). Despite cle is the MEB of the set Clarkson, 2002) is that the size of its core-set is indepen- its simplicity, B˘adoiu and of squares and its (1 + ǫ) dent of both d and the size of the point set. Clarkson (2002) showed expansion (the outer cir- that the number of itera- Inspired from this core-set-based approximate MEB al- cle) covers all the points. tions, and hence the size of gorithm, we will develop an approximation algorithm for The set of squares is thus a the final core-set, depends SVM training that has an approximation ratio of (1 + ǫ)2. core-set. only on ǫ but not on d or m. Its time complexity is linear in m while its space complex- This independence of d is important on applying this algo- ity is independent of m. The rest of this paper is organized rithm to kernel methods (Section 3) as the kernel-induced as follows. Section 2 gives a short introductionon the MEB feature space can be infinite-dimensional. As for the inde- problem and its approximation algorithm. The connection pendence on m, it allows both the time and space complex- between kernel methods and the MEB problem is given in ities of our algorithm to grow slowly, as will be shown in Section 3. Section 4 then describes our proposed Core Vec- Section 4.3. tor Machine (CVM) algorithm. Experimental results are presented in Section 5, and the last section gives some con- cluding remarks. 3 MEB Problems and Kernel Methods Obviously, the MEB is equivalent to the hard-margin sup- 2 MEB in Computational Geometry port vector data description (SVDD) (Tax & Duin, 1999), which will be briefly reviewed in Section 3.1. The MEB Given a set of points = x1,..., xm , where each xi problem can also be used for finding the radius compo- Rd S { } ∈ , the minimum enclosing ball of (denoted MEB( )) nent of the radius-margin bound (Chapelle et al., 2002). is the smallest ball that contains allS the points in . TheS S Thus, as pointed out by Kumar et al. (2003), the MEB MEB problem has found applications in diverse areas such problem is useful in support vector clustering and SVM as computer graphics (e.g., collision detection, visibility parameter tuning. However, we will show in Section 3.2 culling), machine learning (e.g., similarity search) and fa- that other kernel-related problems, including the training cility locations problems. of soft-margin one-class and two-class L2-SVMs, can also 1Let C be the cost (or value of the objective function) of be viewed as MEB problems. the solution returned by an approximate algorithm, and C∗ be the cost of the optimal solution. Then, the approximate algo- 3.1 Hard-Margin SVDD rithm has an approximation ratio ρ(n) for an input size n if Given a kernel k with the associated feature map ϕ, let the ∗ C C ∗ , ρ n . Intuitively, this measures how bad MEB in the kernel-induced feature space be B(c, R). The max “ C C ” ≤ ( ) the approximate solution is compared with the optimal solution. primal problem in the hard-margin SVDD is A large (small) approximation ratio means the solution is much min R2 : c ϕ(x ) 2 R2, i =1,...,m. (1) worse than (more or less the same as) the optimal solution.

Load more