Streamed Learning: One-Pass Svms

Streamed Learning: One-Pass SVMs Piyush Rai, Hal Daume´ III, Suresh Venkatasubramanian University of Utah, School of Computing {piyush,hal,suresh}@cs.utah.edu Abstract In spite of the severe limitations imposed by the streaming framework, streaming algorithms have been successfully em- We present a streaming model for large-scale clas- [ ] ployed in many different domains Guha et al., 2003 . Many sification (in the context of 2-SVM) by leveraging of the problems in geometry can be adapted to the stream- connections between learning and computational ing setting and since many learning problems have equivalent geometry. The streaming model imposes the con- geometric formulations, streaming algorithms naturally mo- straint that only a single pass over the data is al- tivate the development of efficient techniques for solving (or lowed. The 2-SVM is known to have an equivalent approximating) large-scale batch learning problems. formulation in terms of the minimum enclosing ball In this paper, we study the application of the stream model (MEB) problem, and an efficient algorithm based to the problem of maximum-margin classification, in the on the idea of core sets exists (CVM) [Tsang et al., context of 2-SVMs [Vapnik, 1998; Cristianini and Shawe- ] (1+ε) 2005 . CVM learns a -approximate MEB for Taylor, 2000]. Since the support vector machine is a widely a set of points and yields an approximate solution used classification framework, we believe success here will to corresponding SVM instance. However CVM encourage further research into other frameworks. SVMs are works in batch mode requiring multiple passes over known to have a natural formulation in terms of the minimum the data. This paper presents a single-pass SVM enclosing ball problem in a high dimensional space [Tsang et which is based on the minimum enclosing ball of al., 2005; 2007]. This latter problem has been extensively streaming data. We show that the MEB updates for studied in the computational geometry literature and admits the streaming case can be easily adapted to learn the natural streaming algorithms [Zarrabi-Zadeh and Chan, 2006; SVM weight vector in a way similar to using on- Agarwal et al., 2004]. We adapt these algorithms to the clas- line stochastic gradient updates. Our algorithm per- sification setting, provide some extensions, and outline some forms polylogarithmic computation at each exam- open issues. Our experiments show that we can learn effi- ple, and requires very small and constant storage. ciently in just one pass and get competetive classification ac- Experimental results show that, even in such re- curacies on synthetic and real datasets. strictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state- of-the-art SVM solvers (batch and online). We also 2 Scaling up SVM Training give an analysis of the algorithm, and discuss some Support Vector Machines (SVM) are maximum-margin open issues and possible extensions. kernel-based linear classifiers [Cristianini and Shawe-Taylor, 2000] that are known to provide provably good generaliza- tion bounds [Vapnik, 1998]. Traditional SVM training is for- 1 Introduction mulated in terms of a quadratic program (QP) which is typ- Learning in a streaming model poses the restriction that we ically optimized by a numerical solver. For a training size are constrained both in terms of time, as well as storage. of N points, the typical time complexity is O(N 3) and stor- Such scenarios are quite common, for example, in cases such age required is O(N 2) and such requirements make SVMs as analyzing network traffic data, when the data arrives in a prohibitively expensive for large scale applications. Typical streamed fashion at a very high rate. Streaming model also approaches to large scale SVMs, such as chunking [Vapnik, applies to cases such as disk-resident large datasets which 1998], decomposition methods [Chang and Lin, 2001] and cannot be stored in memory. Unfortunately, standard learning SMO [Platt, 1999] work by dividing the original problem into algorithms do not scale well for such cases. To address such smaller subtasks or by scaling down the training data in some scenarios, we propose applying the stream model of computa- manner [Yu et al., 2003; Lee and Mangasarian, 2001].How- tion [Muthukrishnan, 2005] to supervised learning problems. ever, these approaches are typically heuristic in nature: they In the stream model, we are allowed only one pass (or a small may converge very slowly and do not provide rigorous guar- number of passes) over an ordered data set, and polylogarith- antees on training complexity [Tsang et al., 2005]. There has mic storage and polylogarithmic computation per element. been a recent surge in interest in the online learning literature 1211 for SVMs due to the success of various gradient descent ap- 4 Approximate and Streaming MEBs proaches such as stochastic gradient based methods [Zhang, The minimum enclosing ball problem has been extensively 2004] and stochastic sub-gradient based approaches[Shalev- studied in the computational geometry literature. An in- Shwartz et al., 2007]. These methods solve the SVM opti- stance of MEB, with a metric defined by an inner product, mization problem iteratively in steps, are quite efficient, and can be solved using quadratic programming[Boyd and Van- have very small computational requirements. Another recent denberghe, 2004]. However, this becomes prohibitively ex- online algorithm LASVM [Bordes et al., 2005] combines on- pensive as the dimensionality and cardinality of the data in- line learning with active sampling and yields considerably creases; for an N-point SVM instance in D dimensions, the good performance doing single pass (or more passes) over resulting MEB instance consists of N points in N + D di- the data. However, although fast and easy to train, for most mensions. of the stochastic gradient based approaches, doing a single pass over the data does not suffice and they usually require Thus, attention has turned to efficient approximate solu- tions for the MEB. A δ-approximate solution to the MEB running for several iterations before converging to a reason- δ>1 c max d(x , c) ≤ δR∗ able solution. ( ) is a point such that n n ,where R∗ is the radius of the true MEB solution. For example, A (1 + )-approximation for the MEB can be obtained by 3 Two-Class Soft Margin SVM as the MEB extracting a very small subset (of size O(1/)) of the input Problem called a core-set [Agarwal et al., 2005], and running an ex- A minimum enclosing ball (MEB) instance is defined by a set act MEB algorithm on this set [B˘adoiu and Clarkson, 2002]. D D D ≥0 of points x1, ..., xN ∈ R and a metric d : R ×R → R . This is the method originally employed in the CVM [Tsang The goal is to find a point (the center) c ∈ RD that minimizes et al., 2005]. [Har-Peled et al., 2007] take a more direct ap- the radius R =maxn d(xn, c). proach, constructing an explicit core set for the (approximate) The 2-class 2-SVM [Tsang et al., 2005] is defined by a maximum-margin hyperplane, without relying on the MEB hypothesis f(x)=wT ϕ(x), and a training set consisting formulation. Both these algorithms take linear training time N δ of N points {zn =(xn,yn)}n=1 with yn ∈{−1, 1} and and require very small storage. Note that a -approximation D xn ∈ R . The primal of the two-classs 2-SVM (we consider for the MEB directly yields a δ-approximation for the regu- the unbiased case one—the extension is straightforward) can larized cost function associated with the SVM problem. be written as Unfortunately, the core-set approach cannot be adapted to 2 2 min ||w|| + C ξi (1) a streaming setting, since it requires O(1/) passes over the w,ξ i i=1,m training data. Two one-pass streaming algorithms for the MEB problem are known. The first [Agarwal et al., 2004] s.t. yi(w ϕ(xi)) ≥ 1 − ξi,i=1, ..., N (2) finds a (1 + ) approximation using O((1/ε)D/2) storage The only difference between the 2-SVM and the standard and O((1/ε)D/2N) time. Unfortunately, the exponential (C ξ 2) SVM is that the penalty term has the form n n rather dependence on D makes this algorithm impractical. At the (C ξ ) than n n . other end of the space-approximation tradeoff, the second al- K We assume a kernel with associated nonlinear feature gorithm [Zarrabi-Zadeh and Chan, 2006] stores only the cen- ϕ K K(x, x)= map . We further assume that has the property ter and the radius of the current ball, requiring O(D) space. κ κ ,where is a fixed constant [Tsang et al., 2005]. Most stan- This algorithm yields a 3/2-approximation to the optimal en- dard kernels such as the isotropic, dot product (normalized closing ball radius. inputs), and normalized kernels satisfy this criterion. Suppose we replace the mapping ϕ(xn) on xn by another 4.1 The StreamSVM Algorithm nonlinear mapping ϕ˜(zn) on zn such that (for unbiased case) We adapt the algorithm of [Zarrabi-Zadeh and Chan, 2006] −1/2 ϕ˜(zn)= ynϕ(xn); C en (3) for computing an approximate maximum margin classifier. The algorithm initializes with a single point (and therefore an The mapping is done in a way that that the label information MEB of radius zero). When a new point is read in off the yn is subsumed in the new feature map ϕ˜ (essentially, con- stream, the algorithm checks whether or not the current MEB verting a supervised learning problem into an unsupervised can enclose this point.

Load more