Online Passive-Aggressive Algorithms on a Budget
Total Page:16
File Type:pdf, Size:1020Kb
Online Passive-Aggressive Algorithms on a Budget Zhuang Wang Slobodan Vucetic Dept. of Computer and Information Sciences Dept. of Computer and Information Sciences Temple University, USA Temple University, USA [email protected] [email protected] Abstract 1 INTRODUCTION The introduction of Support Vector Machines (Cortes In this paper a kernel-based online learning and Vapnik, 1995) generated signi¯cant interest in ap- algorithm, which has both constant space plying the kernel methods for online learning. A large and update time, is proposed. The approach number of online algorithms (e.g. perceptron (Rosen- is based on the popular online Passive- blatt, 1958) and PA (Crammer et al., 2006)) can be Aggressive (PA) algorithm. When used in easily kernelized. The online kernel algorithms have conjunction with kernel function, the num- been popular because they are simple, can learn non- ber of support vectors in PA grows with- linear mappings, and could achieve state-of-the-art ac- out bounds when learning from noisy data curacies. Perhaps surprisingly, online kernel classi¯ers streams. This implies unlimited memory can place a heavy burden on computational resources. and ever increasing model update and pre- The main reason is that the number of support vectors diction time. To address this issue, the pro- that need to be stored as part of the prediction model posed budgeted PA algorithm maintains only grows without limit as the algorithm progresses. In ad- a ¯xed number of support vectors. By intro- dition to the potential of exceeding the physical mem- ducing an additional constraint to the origi- ory, this property also implies an unlimited growth in nal PA optimization problem, a closed-form model update and perdition time. solution was derived for the support vector Budgeted online kernel classi¯ers have been proposed removal and model update. Using the hinge to address the problem of unbounded growth in com- loss we developed several budgeted PA algo- putational cost through maintaining a limited num- rithms that can trade between accuracy and ber of support vectors during training. The origi- update cost. We also developed the ramp nally proposed budgeted algorithm (Crammer et al., loss versions of both original and budgeted 2004) achieves this by removing a support vector ev- PA and showed that the resulting algorithms ery time the budget is exceeded. The algorithm has can be interpreted as the combination of ac- constant space and prediction time. Support vector re- tive learning and hinge loss PA. All proposed moval is an underlying idea of several other budgeted algorithms were comprehensively tested on algorithms. These include removal of a randomly se- 7 benchmark data sets. The experiments lected support vector (Cesa-Bianchi and Gentile, 2007; showed that they are superior to the existing Vucetic et al. 2009), the oldest support vector (Dekel budgeted online algorithms. Even with mod- et al., 2008), the support vector with the smallest co- est budgets, the budgeted PA achieved very e±cient (Cheng et al., 2007) and the support vector competitive accuracies to the non-budgeted resulting in the smallest error on the validation data PA and kernel perceptron algorithms. (Weston et al., 2005). Recently studied alternatives to removal includes projecting an SV prior to its re- moval (Orabona et al., 2008) and merging two SVs into a new one (Wang and Vucetic, 2010). In practice, th the assigned budget depends on the speci¯c applica- Appearing in Proceedings of the 13 International Con- tion requirements (e.g. memory limitations, process- ference on Arti¯cial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of ing speed, data throughput). JMLR: W&CP 9. Copyright 2010 by the authors. In this paper, we propose a family of budgeted on- 908 Online Passive-Aggressive Algorithms on a Budget T line classi¯ers based on the online Passive-Aggressive where H (w;(xt; yt)) = max(0; 1¡ytw xt) is the hinge algorithm (PA). PA (Crammer, et al., 2006) is a pop- loss and C is the user-speci¯ed non-negative parame- ular kernel-based online algorithm that has solid the- ter. The intuitive goal of PA is to minimally change the oretical guarantees, is as fast as and more accurate existing weight wt while making as accurate as possi- than kernel perceptrons. Upon receiving a new ex- ble prediction on the t-th example. Parameter C serves ample, PA updates the model such that it is sim- to balance these two competing objectives. Minimiz- ilar to the current one and has a low loss on the ing Q(w) can be rede¯ned as the following constrained new example. When equipped with kernel, PA su®ers optimization problem, from the same computational problems as other ker- 1 2 arg min jjw ¡ wtjj + C ¢ » nelized online algorithms, which limits its applicabil- w 2 T (2) ity to large-scale learning problems or in the memory- s:t: 1 ¡ ytw xt · »; » ¸ 0; constrained environments. In the proposed budgeted PA algorithms, the model updating and budget main- where the hinge loss is replaced with slack variable » tenance are solved as a joint optimization problem. By and two inequality constraints. adding an additional constraint into the PA optimiza- The solution of (2) can be expressed in the closed-form tion problem the algorithm becomes able to explic- as itly bound the number of support vectors by removing ½ ¾ one of them and by properly updating the rest. The H (w;(xt; yt)) wt+1 = wt +®txt; ®t = yt ¢min C; 2 : new optimization problem has a closed-from solution. jjxtjj Under the same underlying algorithmic structure, we (3) develop several algorithms with di®erent tradeo®s be- It can be seen that wt+1 is obtained by adding the tween accuracy and computational cost. weighted new example only if the hinge loss of the current predictor is positive. Following the terminol- We also propose to replace the standard hinge loss with ogy of SVM, the examples with nonzero weights (i.e. the recently proposed ramp loss in the optimization ® 6= 0) are called the support vectors (SVs). problem such that the resulting algorithms are more robust to the noisy data. We use an iterative method, When using ©(x) instead of x in the prediction model, ConCave Convex Procedure (CCCP) (Yuille and Ran- where © denotes a mapping from the original input M garajan, 2002), to solve the resulting non-convex prob- space R to the feature space F, the PA predictor lem. We show that CCCP converges after a single it- becomes able to solve nonlinear problems. Instead eration and results in a closed-from solution which is of directly computing and saving ©(x), which might identical to its hinge loss counterpart when it is cou- be in¯nitely dimensional, after introducing the Mercer 0 T 0 pled with active learning. kernel function k such that k(x; x ) = ©(x) ©(x ), the model weight at the moment of receiving the t-th ex- We evaluated the proposed algorithms against the ample, denoted as wt, can be represented by the linear state-of-the-art budgeted online algorithms on a large combination of SVs, number of benchmark data sets with di®erent problem complexity, dimensionality and noise levels. X wt = ®i©(xi); (4) i2It 2 PROBLEM SETTING AND BACKGROUND where It = fij8i < t; ®i 6= 0g is the set of indices of SVs. In this case, the resulting predictor, denoted as f (x) can be represented as We consider online learning for binary classi¯cation, t X where we are given a data stream D consisting of ex- T ft(x) = w ©(x) = ®tk(xt; x): (5) amples (x1; y1); :::; (xN ; yN ), where x 2 R is an M - t dimensional attribute vector and y 2 f+1; ¡1g is the i2It associated binary label. PA is based on the linear pre- While kernelization gives ability to solve nonlinear T M diction model of the form f(x) = w x, where w 2 R classi¯cation problems, it also increases the compu- is the weight vector. PA ¯rst initializes the weight to tational burden, both in time and space. In order to zero vector (w1 = 0) and, after observing the t-th ex- use the prediction model (4), all SVs and their weights ample, the new weight wt+1 is obtained by minimizing should be stored in memory. The unbounded growth 1 the objective function Q(w) de¯ned as in the number of support vectors that is inevitable on noisy data implies unlimited memory budget and 1 Q(w) = jjw ¡ w jj2 + C ¢ H (w;(x ; y )) (1) runtime. To solve this issue, we modify the PA al- 2 t t t gorithm such that the number of support vectors is 1Here we study the PA-I version of PA. strictly bounded by a prede¯ned budget. 909 Zhuang Wang, Slobodan Vucetic 3 BUDGETED PA (BPA) kij = k(xi; xj); K = [kij]; i; j 2 S is the kernel matrix T T of SVs in S, ki = [kij] ; j 2 S and ¯ = [¯i] ; i 2 S: Let us assume the current weight wt is composed of B support vectors, where B is the prede¯ned budget. If Proof. We introduce the Lagrangian of the optimiza- the hinge loss on the newly received example (xt, yt) tion problem and derive the solution that satis¯es the is positive, the budgeted PA (BPA) is attempting to KKT conditions. After replacing w with the right achieve accurate prediction on the new example and hand side of (6), Lagaragian of (2)+(6) can be ex- produce new weight wt+1 that is 1) similar to wt, 2) pressed as has a small loss on t-th example and 3) spanned by only B of the available B + 1 (because of the addi- L(¯; »; ¿1;¿2) 1 P 2 tion of the new example) SVs.