Streamed Learning: One-Pass SVMs

Piyush Rai, Hal Daume´ III, Suresh Venkatasubramanian University of Utah, School of Computing {piyush,hal,suresh}@cs.utah.edu

Abstract In spite of the severe limitations imposed by the streaming framework, streaming have been successfully em- We present a streaming model for large-scale clas- [ ] ployed in many different domains Guha et al., 2003 . Many sification (in the context of 2-SVM) by leveraging of the problems in geometry can be adapted to the - connections between learning and computational ing setting and since many learning problems have equivalent geometry. The streaming model imposes the con- geometric formulations, streaming algorithms naturally mo- straint that only a single pass over the data is al- tivate the development of efficient techniques for solving (or lowed. The 2-SVM is known to have an equivalent approximating) large-scale batch learning problems. formulation in terms of the minimum enclosing ball In this paper, we study the application of the stream model (MEB) problem, and an efficient based to the problem of maximum-margin classification, in the on the idea of core sets exists (CVM) [Tsang et al., context of 2-SVMs [Vapnik, 1998; Cristianini and Shawe- ] (1+ε) 2005 . CVM learns a -approximate MEB for Taylor, 2000]. Since the support vector machine is a widely a set of points and yields an approximate solution used classification framework, we believe success here will to corresponding SVM instance. However CVM encourage further research into other frameworks. SVMs are works in batch mode requiring multiple passes over known to have a natural formulation in terms of the minimum the data. This paper presents a single-pass SVM enclosing ball problem in a high dimensional space [Tsang et which is based on the minimum enclosing ball of al., 2005; 2007]. This latter problem has been extensively streaming data. We show that the MEB updates for studied in the computational geometry literature and admits the streaming case can be easily adapted to learn the natural streaming algorithms [Zarrabi-Zadeh and Chan, 2006; SVM weight vector in a way similar to using on- Agarwal et al., 2004]. We adapt these algorithms to the clas- line stochastic gradient updates. Our algorithm per- sification setting, provide some extensions, and outline some forms polylogarithmic computation at each exam- open issues. Our experiments show that we can learn effi- ple, and requires very small and constant storage. ciently in just one pass and get competetive classification ac- Experimental results show that, even in such re- curacies on synthetic and real datasets. strictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state- of-the-art SVM solvers (batch and online). We also 2 Scaling up SVM Training give an analysis of the algorithm, and discuss some Support Vector Machines (SVM) are maximum-margin open issues and possible extensions. kernel-based linear classifiers [Cristianini and Shawe-Taylor, 2000] that are known to provide provably good generaliza- tion bounds [Vapnik, 1998]. Traditional SVM training is for- 1 Introduction mulated in terms of a quadratic program (QP) which is typ- Learning in a streaming model poses the restriction that we ically optimized by a numerical solver. For a training size are constrained both in terms of time, as well as storage. of N points, the typical time complexity is O(N 3) and stor- Such scenarios are quite common, for example, in cases such age required is O(N 2) and such requirements make SVMs as analyzing network traffic data, when the data arrives in a prohibitively expensive for large scale applications. Typical streamed fashion at a very high rate. Streaming model also approaches to large scale SVMs, such as chunking [Vapnik, applies to cases such as disk-resident large datasets which 1998], decomposition methods [Chang and Lin, 2001] and cannot be stored in memory. Unfortunately, standard learning SMO [Platt, 1999] work by dividing the original problem into algorithms do not scale well for such cases. To address such smaller subtasks or by scaling down the training data in some scenarios, we propose applying the stream model of computa- manner [Yu et al., 2003; Lee and Mangasarian, 2001].How- tion [Muthukrishnan, 2005] to supervised learning problems. ever, these approaches are typically heuristic in nature: they In the stream model, we are allowed only one pass (or a small may converge very slowly and do not provide rigorous guar- number of passes) over an ordered data set, and polylogarith- antees on training complexity [Tsang et al., 2005]. There has mic storage and polylogarithmic computation per element. been a recent surge in interest in the online learning literature

1211 for SVMs due to the success of various gradient descent ap- 4 Approximate and Streaming MEBs proaches such as stochastic gradient based methods [Zhang, The minimum enclosing ball problem has been extensively 2004] and stochastic sub-gradient based approaches[Shalev- studied in the computational geometry literature. An in- Shwartz et al., 2007]. These methods solve the SVM opti- stance of MEB, with a metric defined by an inner product, mization problem iteratively in steps, are quite efficient, and can be solved using quadratic programming[Boyd and Van- have very small computational requirements. Another recent denberghe, 2004]. However, this becomes prohibitively ex- online algorithm LASVM [Bordes et al., 2005] combines on- pensive as the dimensionality and cardinality of the data in- line learning with active sampling and yields considerably creases; for an N-point SVM instance in D dimensions, the good performance doing single pass (or more passes) over resulting MEB instance consists of N points in N + D di- the data. However, although fast and easy to train, for most mensions. of the stochastic gradient based approaches, doing a single pass over the data does not suffice and they usually require Thus, attention has turned to efficient approximate solu- tions for the MEB. A δ-approximate solution to the MEB running for several iterations before converging to a reason- δ>1 c max d(x , c) ≤ δR∗ able solution. ( ) is a point such that n n ,where R∗ is the radius of the true MEB solution. For example, A (1 + )-approximation for the MEB can be obtained by 3 Two-Class Soft Margin SVM as the MEB extracting a very small subset (of size O(1/)) of the input Problem called a core-set [Agarwal et al., 2005], and running an ex- A minimum enclosing ball (MEB) instance is defined by a set act MEB algorithm on this set [B˘adoiu and Clarkson, 2002]. D D D ≥0 of points x1, ..., xN ∈ R and a metric d : R ×R → R . This is the method originally employed in the CVM [Tsang The goal is to find a point (the center) c ∈ RD that minimizes et al., 2005]. [Har-Peled et al., 2007] take a more direct ap- the radius R =maxn d(xn, c). proach, constructing an explicit core set for the (approximate) The 2-class 2-SVM [Tsang et al., 2005] is defined by a maximum-margin hyperplane, without relying on the MEB hypothesis f(x)=wT ϕ(x), and a training set consisting formulation. Both these algorithms take linear training time N δ of N points {zn =(xn,yn)}n=1 with yn ∈{−1, 1} and and require very small storage. Note that a -approximation D xn ∈ R . The primal of the two-classs 2-SVM (we consider for the MEB directly yields a δ-approximation for the regu- the unbiased case one—the extension is straightforward) can larized cost function associated with the SVM problem. be written as  Unfortunately, the core-set approach cannot be adapted to 2 2 min ||w|| + C ξi (1) a streaming setting, since it requires O(1/) passes over the w,ξ i i=1,m training data. Two one-pass streaming algorithms for the  MEB problem are known. The first [Agarwal et al., 2004] s.t. yi(w ϕ(xi)) ≥ 1 − ξi,i=1, ..., N (2) finds a (1 + ) approximation using O((1/ε)D/2) storage The only difference between the 2-SVM and the standard and O((1/ε)D/2N) time. Unfortunately, the exponential (C ξ 2) SVM is that the penalty term has the form n n rather dependence on D makes this algorithm impractical. At the (C ξ ) than n n . other end of the space-approximation tradeoff, the second al- K We assume a kernel with associated nonlinear feature gorithm [Zarrabi-Zadeh and Chan, 2006] stores only the cen- ϕ K K(x, x)= map . We further assume that has the property ter and the radius of the current ball, requiring O(D) space. κ κ ,where is a fixed constant [Tsang et al., 2005]. Most stan- This algorithm yields a 3/2-approximation to the optimal en- dard kernels such as the isotropic, dot product (normalized closing ball radius. inputs), and normalized kernels satisfy this criterion. Suppose we replace the mapping ϕ(xn) on xn by another 4.1 The StreamSVM Algorithm nonlinear mapping ϕ˜(zn) on zn such that (for unbiased case)   We adapt the algorithm of [Zarrabi-Zadeh and Chan, 2006] −1/2  ϕ˜(zn)= ynϕ(xn); C en (3) for computing an approximate maximum margin classifier. The algorithm initializes with a single point (and therefore an The mapping is done in a way that that the label information MEB of radius zero). When a new point is read in off the yn is subsumed in the new feature map ϕ˜ (essentially, con- stream, the algorithm checks whether or not the current MEB verting a supervised learning problem into an unsupervised can enclose this point. If so, the point is discarded. If not, the one). The first term in the mapping corresponds to the feature point is used to suitably update the center and radius of the term and the second term accounts for a regularization effect, current MEB. All such selected points define a core set of the where C is the misclassification cost. en is a vector of dimen- original point set. th sion N, having all entries as zero, except the n entry which Let pi be the input point causing an update to the MEB and is equal to one. Bi be the resulting ball after the update. From figure 1, it is It was shown in [Tsang et al., 2005] that the MEB instance easy to verify that the new center ci lies on the line joining (˜ϕ(z1), ϕ˜(z2),...ϕ˜(zN )), with the metric defined by the in- the old center ci−1 and the new point pi. The radius ri and duced inner product, is dual to the corresponding 2-SVM the center ci of the resulting MEB can be defined by simple instance (1). The weight vector w of the maximum margin update equations. c hypothesis can then be obtained from the center of the MEB ri = ri−1 + δi (4) using the constraints induced by the Lagrangian [Tsang et al., 2007]. ||ci − ci−1|| = δi (5)

1212 Here 2δi =(||pi − ci−1|| − ri−1) is the closest distance of Algorithm 1 StreamSVM p B the new point i from the old ball i−1. Using these, we can 1: Input: examples (xn,yn)n∈1...N , slack parameter C define a closed-form analytical update equation for the new 2: Output: weights (w), radius (R), number of support vec- ball Bi: tors (M) M =1;R =0;ξ2 =1, w = y x δi 3: Initialize: 1 1 ci = ci−1 + (pi − ci−1) (6) 4: for n =2to N do ||pi − ci−1|| 5: Compute distance to center: 2 2 d = w − ynxn + ξ +1/C 6: if d ≥ R then 1 7: w = w + 2 (1 − R/d)(ynxn − w) R = R + 1 (d − R) 8:  2    2 2 1 2 1 2 9: ξ = ξ 1 − 2 (1 − R/d) + 2 (1 − R/d) 10: M = M +1 11: end if 12: end for

Figure 1: Ball updates Algorithm 2 StreamSVM with lookahead L Input: examples (xn,yn)n∈1...N , slack parameter C, looka- It can be shown that, for adversarially constructed data, the head parameter L ≥ 1 radius of the MEB√ computed by the algorithm has a lower- Output: weights (w), radius (R), upper bound on number of bound of (1 + 2)/2 and a worst-case upper-bound of 3/2 support vectors (M) [Zarrabi-Zadeh and Chan, 2006]. 2 1: Initialize: M =1;R =0;ξ =1;S = ∅; w = y1x1 We adapt these updates in a natural way in the augmented 2: for n =2to N do feature space ϕ˜ (see Algorithm 1). Each selected point be- 3: Compute distance to center: longs to the core set for the MEB. The support vectors of the d = w − y x 2 + ξ2 +1/C corresponding SVM instance come from this set. It is easy n n 4: if d ≥ R then to verify that the update equations for weight vector (w)and 5: Add example n to the active set: the margin (R) in StreamSVM correspond to the center and S = S ∪{y x } radius updates for the ball in equation 7 and 4 respectively. n n 2 6: if |S| = L then The ξ term is the distance calculation is included to account 2 2 7: Update w,R,ξ to enclose the ball (w,R,ξ ) for the fact that the distance computations are being done in and all points in S the D + N dimensional augmented feature space ϕ˜ which, 8: M = M + L ; S = ∅ for the linear kernel case, is given by:   9: end if −1/2  10: end if ϕ˜(zn)= ynxn; C en . (7) 11: end for |S| > 0 Also note that, because we perform only a single pass over the 12: if then w,R,ξ2 (w,R,ξ2) data and the en components are all mutually orthogonal, we 13: Update to enclose the ball and all S never need to explicitly store them. The number of updates points in M = M + |S| to the weight vector is limited by the number of core vectors 14: of the MEB, which we have experimentally found to be much 15: end if smaller as compared to other algorithms (such as Perceptron). The space complexity of StreamSVM is small since only the weight vector and the radius need be stored. 4.3 StreamSVM approximation bounds and extension to multiple balls 4.2 Kernelized StreamSVM It was shown in [Zarrabi-Zadeh and Chan, 2006] that any Although our main exposition and experiments are with streaming MEB algorithm that√ uses only O(D) storage ob- linear kernels, it is straightforward to extend the algo- tains a lower-bound of (1 + 2)/2 and an upper-bound of rithm for nonlinear kernels. In that case, algorithm 1, 3/2 on the quality of solution (i.e., the radius of final MEB). instead of storing the weight vector w,storesanN- α Clearly, this is a conservative approximation and would af- dimensional vector of Lagrange coefficients initialized fect the obtained margin of the resulting SVM classifier (and as [y1,...,0]. The distance computation is line 5 are re- d2 = α α k(x , x )+k(x , x ) − hence the classification performance). In order to do better in placed by n,m n m n m n n just a single pass, one possible conjecture could be that the 2y α k(x , x )+ξ2 +1/C n m m n m , and the weight vec- algorithm must remember more. To this end, we therefore tor updates in line 7 can be replaced by Lagrange coeffi- extended algorithm-1 to simultaneously store L weight vec- α = α (1 − 1 (1 − R/d)) α = cients updates 1:n−1 1:n−1 2 , n tors (or “balls”). The space complexity of this algorithm is 1 2 (1 − R/d) yn. L(D +1)floats and it still makes only a single pass over the

1213 data. In the MEB setting, our algorithm chooses with each passes of CVM to see how long does it take for CVM to beat arriving datapoint (that is not already enclosed in any of the StreamSVM (we note here that CVM requires at least two balls) how the current L +1balls (the L balls plus the new passes over the data to return a solution). We used a lin- data point) should be merged, resulting again into a set of L ear kernel for both. Shown in Figure 2 are the results on balls. At the end, the final set of L balls are merged together MNIST 8vs9 data and it turns out that it takes several hun- to give the final MEB. A special variant of the L balls case dreds of passes of CVM to beat the single pass accuracy of is when all but one of the L balls are of zero radius. This StreamSVM. Similar results were obtained for other datasets amounts to storing a ball of non-zero radius and to keeping a but we do not report them here due to space limitations. buffer of L many data-points (we call this the lookahead algo- rithm - Algorithm 2). Any incoming point, if not already en- CVM vs StreamSVM: MNIST Data (8 vs 9) 100 closed in the current ball, is stored in the buffer. We solve the One Pass StreamSVM CVM vs number of passes MEB problem (using a quadratic program of size L) when- 95 ever the buffer is full. Note that algorithm 1 is a special case 85 L of algorithm 2 with =1, with the MEB updates available in 75 a closed analytical form (rather than having to solve a QP). 65 Algorithm 1 takes linear time in terms of the input size. Algorithm 2 which uses a lookahead of L solves a quadratic 55 50 program of size L whenever the buffer gets full. This step Percent Accuracy 3 takes O(L ) times. The number of such updates is O(N/L) 30

(in practice, it is considerably less than N/L) and thus the 20 O(NL2) over all complexity for the lookahead case is .For 10 small lookaheads, this is roughly O(N). 0 1 2 3 4 6 15 400 724 Number of passes of CVM 5 Experiments We evaluate our algorithm on several synthetic and real Figure 2: MNIST 8vs9 data: Number of passes CVM takes be- datasets and compare it against several state-of-the-art SVM fore achieving comparable single-pass accuracy of StreamSVM. X solvers. We use 3 crieria for evaluations: a) Single-pass axis represents number of passes of CVM and Y axis represents the classification accuracies compared against single-pass of on- classification accuracy. line SVM solvers such as iterative sub-gradient solver Pega- sos [Shalev-Shwartz et al., 2007], LASVM [Bordes et al., ] [ ] Error bars on accuracy variations w.r.t. random streaming order (for different L) 2005 , and Perceptron Rosenblatt, 1988 . b) Comparison 100 with CVM [Tsang et al., 2005] which is a batch SVM al- gorithm based on the MEB formulation. c) Effect of using 95 lookahead in StreamSVM. For fairness, all the algorithms used a linear kernel. 90 5.1 Single-Pass Classification Accuracies 85 The single-pass classification accuracies of StreamSVM and Percent Accuracy other online SVM solvers are shown in table-1. Details of 80 the datasets used are shown in table-1. To get a sense of how good the single-pass approximation of our algorithm is, we 75 also report the classification accuracies of batch-mode (i.e., all data in memory, and multiple passes) libSVM solver with 70 0 2 4 6 8 10 12 14 16 18 20 linear kernel on all the datasets. The results suggest that our Lookahead (L) single-pass algorithm StreamSVM, using a small reasonable lookahead, performs comparably to the batch-mode libSVM, Figure 3: Single-pass with varying lookahead on MNIST 8vs9 data: and does significantly better than a single pass of other online Performance w.r.t random ordering of streaming. X axis represents SVM solvers. the lookahead parameter and Y axis represents classification accu- racy. Verticle bars represent the standard deviations in accuracies for 5.2 Comparison with CVM a given lookahead. We compared our algorithm with CVM which, like our al- gorithm, is based on a MEB formulation. CVM is highly efficient for large datasets but it operates in batch mode, mak- 5.3 Effect of Lookahead ing one pass through the data for each core vector. We are We also investigated the effect of doing higher-order looka- interested in knowing how many passes the CVM must make heads on the data. For this, we varied L (the lookahead pa- over the data before it achieves an accuracy comparable to our rameter) and, for each L, tested Algorithm 2 on 100 random streaming algorithm. For that purpose, we compared the ac- permutations of the data stream order, also recording the stan- curacy of our single-pass StreamSVM against two and more dard deviation of the classification accuracies with respect to

1214 # Examples libSVM Perceptron Pegasos LASVM StreamSVM Data Set Dim Train Test (batch) k=1 k=20 Algo-1 Algo-2 Synthetic A 2 20,000 200 96.5 95.5 83.8 89.9 96.5 95.5 97.0 Synthetic B 3 20,000 200 66.0 68.0 57.05 65.85 64.5 64.4 68.5 Synthetic C 5 20,000 200 93.2 77.0 55.0 73.2 68.0 73.1 87.5 Waveform 21 4000 1000 89.4 72.5 77.34 78.12 77.6 74.3 78.4 MNIST (0vs1) 784 12,665 2115 99.52 99.47 95.06 99.48 98.82 99.34 99.71 MNIST (8vs9) 784 11,800 1983 96.57 95.9 69.41 90.62 90.32 84.75 94.7 IJCNN 22 35,000 91,701 91.64 64.82 67.35 88.9 74.27 85.32 87.81 w3a 300 44,837 4912 98.29 89.27 57.36 87.28 96.95 88.56 89.06

Table 1: Single pass classification accuracies of various algorithms (all using linear kernel). The synthetic datasets (A,B,C) were generated using normally distributed clusters, and were of about 85% separability. libSVM, used as the absolute benchmark, was run in batch mode (all data in memory). StreamSVM Algo-2 used a small lookahead (∼10). Note: We make the Pegasos implementation do a single sweep over data and have a user chosen block size k for subgradient computations (we used k=1, and k=20 akin to using a lookahead of 20). Perceptron and LASVM are also run for a single pass and do not need block sizes to be specified. All results are averaged over 20 runs (w.r.t. random orderings of the stream) the data-order permutations. Note that the algorithm still per- for the lookahead algorithm as for the no-lookahead algo- forms a single pass over the data. Figure 3 shows the results rithm. To obtain the 3/2-upper bound result, one can show a on the MNIST 8vs9 data (similar results were obtained for nearly identical construction as to [Zarrabi-Zadeh and Chan, other datasets but not shown due to space limitations). In this 2006] where L − 1 points are packed in a small, carefully figure, we see two effects. Firstly, as the lookahead increase, constructed cloud the boundary of the true MEB. performance goes up. This is to be expected since in the limit, Alternatively, one can analyze these algorithms in the ran- as the lookahead approaches the data set size, we will solve dom stream setting. Here, the input points are chosen adver- the exact MEB problem (albeit at a high computational cost). sarially, but their order is permuted randomly. The lookahead The important thing to note here is that even with a small model is not strengthened in this setting either: we can show lookahead of 10, the performance converges. Secondly, we both that the lower bound for no-lookahead algorithms, as see that the standard deviation of the result decreases as the well as the 3/2-upper bound for the specific no-lookahead al- lookahead increases. This shows experimentally that higher gorithm described, generalize. For the former, see Figure 4. lookaheads make the algorithm less susceptible to badly or- We place (N − 1)/2 points around (0, 1) and√ (N − 1)/2 dered data. This is interesting from an empirical perspective, points around (0, −1) and one point√ at (1 + 2, 0).Theal- L

1215 [Kumar et al., 2005]. An ellipsoid in RD is defined as fol- [Cristianini and Shawe-Taylor, 2000] Nello Cristianini and John lows: {x :(x − c)A(x − c) <=1} where c ∈ RD, Shawe-Taylor. An introduction to support vector machines. Cam- A ∈ RDxD,andA 0 (positive semi-definite). bridge University Press, 2000. Note that a ball, upon inclusion of a new point, expands [Dredze et al., 2008] Mark Dredze, Koby Crammer, and Fernando equally in all dimensions which may be unnecessary. On the Pereira. Confidence-weighted linear classification. In ICML ’08: other hand, an ellipsoid can have several axes and scales of Proceedings of the 25th international conference on Machine variations (modulated by the covariance matrix A). This al- learning, pages 264–271, New York, NY, USA, 2008. ACM. lows the ellipsoid to expand only along those directions where [Guha et al., 2003] Sudipto Guha, Adam Meyerson, Nina Mishra, needed. In addition, such an approach can also be seen along , and Liadan O’Callaghan. Clustering data the lines of confidence weighted linear classifiers [Dredze et streams: Theory and practice. IEEE Transactions on Knowledge al., 2008]. The confidence weighted (CW) method assumes and Data Engineering, 15(3):515–528, 2003. a Gaussian distribution over the space of weight vectors and [Har-Peled et al., 2007] Sariel Har-Peled, Dan Roth, and Dav Zi- updates the mean and covariance parameters upon witnessing mak. Maximum margin coresets for active and noise tolerant each incoming example. Just as CW maintains the models learning. In IJCAI, 2007. uncertainty using a Gaussian, an ellipsoid generaization can [Kumar et al., 2005] P. Kumar, E. A. Yildirim, and Communi- model the uncertainty using the covariance matrix A. Recent cated Y. Zhang. Minimum volume enclosing ellipsoids and core work has shown that there exist streaming possibilities for sets. Journal of Optimization Theory and Applications, 126:1– MVE [Mukhopadhyay and Greene, 2008]. The approxima- 21, 2005. tion gaurantees, however, are very conservative. It would be [Lee and Mangasarian, 2001] Yuh-Jye Lee and Olvi L. Mangasar- interesting to come up with improved streaming algorithms ian. Rsvm: Reduced support vector machines. In Proc. of Sym- for the MVE case and adapt them for classification settings. posium on Data Mining (SDM), 2001. [Mukhopadhyay and Greene, 2008] Asish Mukhopadhyay and Eu- 7Conclusion gene Greene. A streaming algorithm for computing an approxi- mate minimum spanning ellipse. 18th Fall Workshop on Compu- Within the streaming framework for learning, we have pre- tational Geometry, 2008. sented an efficient, single-pass 2-SVM learning algorithm [Muthukrishnan, 2005] S. Muthukrishnan. Data streams: Algo- using a streaming algorithm for the minimum enclosing ball rithms and applications. Foundations and Trends in Theoretical problem. We have also extended this algorithm to use a Computer Science, 1(2), 2005. lookahead to increase robustness against poorly ordered data. [Platt, 1999] John C. Platt. Fast training of support vector machines Our algorithm, StreamSVM, satisfies a proven theoretical 3 using sequential minimal optimization. pages 185–208, Cam- bound: it provides a 2 -approximation to the optimal solu- bridge, MA, USA, 1999. MIT Press. tion. Despite this conservative bound, our algorithm is exper- [Rosenblatt, 1988] F. Rosenblatt. The perception: a probabilistic imentally competitive with alternative techniques in terms of model for information storage and organization in the brain.MIT accuracy, and learns much simpler solutions. We believe that Press, 1988. a careful study of stream-based learning would lead to high [Shalev-Shwartz et al., 2007] Shai Shalev-Shwartz, Yoram Singer, quality scalable solutions for other classification problems, and Nathan Srebro. Pegasos: Primal estimated sub-gradient possibly with alternative losses and with tighter approxima- solver for svm. In Proc. ICML, pages 807–814, New York, NY, tion bounds. USA, 2007. ACM Press. [Tsang et al., 2005] Ivor W. Tsang, James T. Kwok, and Pak-Ming References Cheung. Core vector machines: Fast svm training on very large [Agarwal et al., 2004] Pankaj K. Agarwal, Sariel Har-Peled, and data sets. volume 6, Cambridge, MA, USA, 2005. MIT Press. Kasturi R. Varadarajan. Approximating extent measures of [Tsang et al., 2007] Ivor W. Tsang, Andras Kocsor, and James T. points. volume 51, pages 606–635, New York, NY, USA, 2004. Kwok. Simpler core vector machines with enclosing balls. In ACM Press. Proc. ICML, pages 911–918, New York, NY, USA, 2007. ACM [Agarwal et al., 2005] P. Agarwal, S. Har-Peled, and K. Varadara- Press. jan. Geometric approximations via coresets. Combinatorial and [Vapnik, 1998] Vladimir Vapnik. Statistical learning theory. Wiley, Computational Geometry - MSRI Publications, 52:1–30, 2005. 1998. [Bordes et al., 2005] Antoine Bordes, Seyda Ertekin, Jason We- [Yu et al., 2003] Hwanjo Yu, Jiong Yang, and Jiawei Han. Classify- ston, and Leon Bottou. Fast kernel classifiers with online and ing large data sets using svms with hierarchical clusters. In Proc. active learning. volume 6, Cambridge, MA, USA, 2005. MIT ACM KDD, pages 306–315, New York, NY, USA, 2003. ACM Press. Press. [Boyd and Vandenberghe, 2004] Stephen Boyd and Lieven Van- [Zarrabi-Zadeh and Chan, 2006] Hamid Zarrabi-Zadeh and Timo- denberghe. Convex Optimization. Cambridge University Press, thy M. Chan. A simple streaming algorithm for minimum en- 2004. closing balls. In Proc. of Canadian Conference on Computational Geometry (CCCG), 2006. [B˘adoiu and Clarkson, 2002] Mihai B˘adoiu and Kenneth L. Clark- son. Optimal core-sets for balls. In Proc. of DIMACS Workshop [Zhang, 2004] Tong Zhang. Solving large scale linear prediction on Computational Geometry, 2002. problems using stochastic gradient descent algorithms. In ICML ’04: Proceedings of the twenty-first international conference on [ ] Chang and Lin, 2001 Chih-Chung Chang and Chih-Jen Lin. LIB- Machine learning, page 116, New York, NY, USA, 2004. ACM. SVM: a library for support vector machines, 2001.

1216