The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Online Learning from Data Streams with Varying Feature Spaces

Ege Beyazit Jeevithan Alagurajah Xindong Wu University of Louisiana at Lafayette University of Louisiana at Lafayette University of Louisiana at Lafayette Lafayette, LA, USA Lafayette, LA, USA Lafayette, LA, USA [email protected] [email protected] [email protected]

Abstract remains constant. On the other hand, in a wide range of ap- plications, feature spaces dynamically change over time. For We study the problem of online learning with varying fea- example in a spam e-mail classification task, each e-mail can ture spaces. The problem is challenging because, unlike tra- ditional online learning problems, varying feature spaces can have a different set of words and words that are being used introduce new features or stop having some features without in spam e-mails can change over time. In social networks, following a pattern. Other existing methods such as online the features provided by each user can be different than each streaming (Wu et al. 2013), online learning other. In these scenarios, absence of a feature can also con- from trapezoidal data streams (Zhang et al. 2016), and learn- vey information and should not be treated as missing data. ing with feature evolvable streams (Hou, Zhang, and Zhou When the feature space in a data stream keeps changing, we 2017) are not capable to learn from arbitrarily varying fea- refer the feature space as a varying feature space. ture spaces because they make assumptions about the fea- To enable learning from data with varying feature spaces, ture space dynamics. In this paper, we propose a novel online we propose the Online Learning from Varying Features learning algorithm OLVF to learn from data with arbitrarily varying feature spaces. The OLVF algorithm learns to clas- (OLVF) algorithm, which will also be referred to as ‘the al- sify the feature spaces and the instances from feature spaces gorithm’ in this paper. To learn an instance classifier, the simultaneously. To classify an instance, the algorithm dynam- OLVF algorithm projects the existing classifier and the cur- ically projects the instance classifier and the training instance rent instance onto their shared feature subspace, and makes onto their shared feature subspace. The feature space clas- a prediction. Based on the prediction, a loss will be suffered sifier predicts the projection confidences for a given feature and the classifier will be updated according to an empirical space. The instance classifier will be updated by following risk minimization principle with adaptive constraints. The the empirical risk minimization principle and the strength of algorithm re-weights the constraints based on the confidence the constraints will be scaled by the projection confidences. of the classifier and the training instance’s projection given Afterwards, a feature sparsity method is applied to reduce the the feature space. To learn a feature space classifier, the al- model complexity. Experiments on 10 datasets with varying feature spaces have been conducted to demonstrate the per- gorithm follows a maximum likelihood principle, using the formance of the proposed OLVF algorithm. Moreover, ex- current feature space information and the output of the in- periments with trapezoidal data streams on the same datasets stance classifier. After each update, a sparsity step is applied have been conducted to show that OLVF performs better than to decrease the classifier size. The contributions of this paper the state-of-the-art learning algorithm (Zhang et al. 2016). can be listed as follows: • The problem of real-time learning from data where the Introduction feature space arbitrarily changes has been studied. The goal of learning a classifier in an online setup, where • A new learning algorithm, Online Learning from Varying the training instances arrive one by one, has been studied Features (OLVF), that is capable of learning from varying extensively and a wide variety of algorithms have been de- feature spaces on a single pass is proposed. veloped for online learning (Leite, Costa, and Gomide 2010; Yu, Neely, and Wei 2017; Agarwal, Saradhi, and Karnick • Performance of the OLVF algorithm is empirically vali- 2008). These algorithms have made it possible to learn in dated on 10 datasets in two different scenarios: data with applications where it is computationally not feasible to train varying feature spaces and trapezoidal data streams. over the entire dataset. Additionally, online learning algo- rithms have been used in applications where the data contin- Related Work uously grow over time and new patterns need to be extracted Because of the dynamic nature of its feature space and the (Shalev-Shwartz and others 2012). However, online learn- streaming instances, learning from data with varying feature ing algorithms assume that the feature space they learn from spaces can be related to online learning, online learning from Copyright c 2019, Association for the Advancement of Artificial trapezoidal data streams and learning with feature evolvable Intelligence (www.aaai.org). All rights reserved. streams.

3232 Online Learning Learning with Feature Evolvable Streams The online learning is performed in consecutive rounds and Learning with feature evolvable streams (Hou, Zhang, and at each round, the learner is given a training example which Zhou 2017) tackle the problems of classification and regres- can be seen as a question with its answer hidden. The sion for data streams where features vanish and new features learner makes a prediction on this question and the answer occur. However, they assume that the feature space evolves is revealed. Based on the difference between the learner’s in a strictly sequential manner: after the new features arrive, prediction and the actual answer, a loss is suffered. The there exists a time period that both the new and old features aim of the learner is to minimize the total loss through all are available, before some of the old features vanish. Based rounds. Many different algorithms have been developed for on this assumption, they use the overlapping time period to learning in online setting (Tsang, Kocsor, and Kwok 2007; learn a projection matrix from old features to the new ones Roughgarden and Schrijvers 2017; Domingos and Hulten by minimizing a loss. Then to learn a predictor, 2000; Kuzborskij and Cesa-Bianchi 2017). These algorithms they propose two methods: weighted combination and dy- can be broadly grouped into two categories: first-order and namic selection. Weighted combination follows an ensemble second-order methods. First-order algorithms use first-order strategy by combining the outputs of a set of predictors with derivative information for updates (Rosenblatt 1958; Ying weights based on exponential of the cumulative loss. On the and Pontil 2008). Many first order algorithms (Crammer other hand, dynamic selection selects the output of the pre- et al. 2006; Gentile 2001; Kivinen, Smola, and Williamson dictor of larger weight with higher probability. Even though 2004) use the maximum margin principle to find a discrim- learning with feature evolvable streams focuses on feature inant function that maximizes the separation between dif- spaces that can dynamically grow and shrink, the assump- ferent classes. By taking the curvature of the space into ac- tion of sequential evolution is too strong and application count, second-order methods improve the convergence and specific. If there is no overlap period of or linear relation- are able to minimize a quadratic function in a finite num- ship between the old and new features, the projection does ber of steps (Cesa-Bianchi, Conconi, and Gentile 2005; not carry any useful information. Consequently, the algo- Crammer, Dredze, and Pereira 2009; Crammer, Kulesza, rithm reduces to an ensemble based online learning method and Dredze 2009; Wang, Zhao, and Hoi 2016; Orabona and with a runtime overhead. Moreover, it is not always compu- Crammer 2010). Second-order methods are more noise sen- tationally feasible to learn a projection matrix and multiple sitive and more expensive than first-order methods, because predictors if the data stream is high dimensional. they require the calculation of the second derivatives. Tra- Learning from data with varying feature spaces is more ditional online learning methods can not learn from varying challenging than the problems stated above since the classi- feature spaces since they assume the feature space they learn fier needs to adapt itself to a dynamically changing feature from remains the same. space without any assumptions on the dynamics of the fea- ture space or interactions of features. The algorithm needs Online Learning from Trapezoidal Data Streams to continuously learn from data while extracting knowledge when arbitrary features stop being generated as well as new Online learning from trapezoidal data streams (Zhang et al. features arrive, in one pass. The existing learning methods 2015; 2016) is closely related to data streams with varying are not able to solve learning from varying feature spaces feature spaces. Trapezoidal data streams are streams with since none of them handle such dynamic and unconstrained a doubly growing nature: the number of instances and the feature spaces while learning a classifier. number of features provided by instances can grow simulta- neously. In trapezoidal streams, the current training instance Preliminaries either has the same features as previous instances or ad- ditional new features. In other words the feature space at In this section, we first present two preliminary concepts each iteration contains the feature spaces from previous iter- that we make use of to design our algorithm: online learn- ing of linear classifiers, and soft-margin classification. Then, ations. The proposed algorithms OLSF − I and OLSF − II learn a linear classifier from trapezoidal data streams by fol- we discuss the problem setting for learning binary classifiers lowing the passive-aggressive update rule. At each iteration, from varying feature spaces. they receive a training instance and try to make a predic- tion. If the prediction is correct, they do not do any classi- Online Learning of Linear Classifiers fier update, else they update their classifier with the weights We consider binary classification using linear classifiers. At d that minimize the error and are close to the previous clas- each round t, the algorithm receives an instance xt ∈ R sifier weights. While doing this, they also learn additional and tries to predict the true label yt ∈ {−1, +1} of the d weights for the new features, if any introduced by the cur- instance by using its classifier wt ∈ R . Prediction is rent training instance. Moreover, sparsity is introduced by done by using the function yˆt = sign(xt · wt). After pre- applying projection and truncation to the classifier at each diction, the true label of the instance is revealed. Based round, to decrease the model complexity and improve gen- on the difference between prediction yˆt and true label yt, eralization. However OLSF and its variants can not handle a loss l(yt, yˆt) is suffered. One of the widely used loss data with varying feature spaces, where the feature space can functions is the Hinge loss (Gentile and Warmuth 1999; also shrink and the instances can have completely different Wu and Liu 2007; Bartlett and Wegkamp 2008) defined as sets of features from each other. l(yt, yˆt) = max{0, 1−y(xt ·wt)}. It is based on the margin

3233 between the example xt and the classifier wt. The margin spaces is represented by a one and, each feature in the ex- is calculated by y(xt · wt). Because online learning is an isting feature space represented by a zero. incremental task, it is important to minimize the loss while keeping the change to the current model minimum, in order Online Learning from Varying Feature Spaces to preserve the knowledge from previous instances. More- In this section, we explain the building blocks of the OLVF over when the data is noisy, forcing the classifier to correctly algorithm and discuss the motivation behind their design. predict for every instance leads to overfitting and poor gen- Then, we derive the soft-margin classifier update rules for eralization. Therefore, it is more preferable to learn a clas- the instance and feature space classifiers in a binary classi- sifier which is able to separate the bulk of the data while fication setting. Note that the algorithm can easily be gen- ignoring the noise. To accomplish this, a soft decision mar- eralized to a multiclass setting by converting the problem gin (Shawe-Taylor and Cristianini 2002) is used to let the to multiple binary classification problems (Bolon-Canedo, classifier make a few mistakes. Many online learning algo- Sanchez-Marono, and Alonso-Betanzos 2011), using the rithms combine the constraints stated above and formulate One vs Rest or One vs One (Saez´ et al. 2014; Xu 2011) the learning of the weights as an optimization problem: strategies. Finally, we combine the building blocks to form the main steps and discuss the running time complexity of 1 2 wt+1 = argmin kw − wtk + Cξ. the OLVF algorithm. w: 2 (1) l(y ,yˆ )≤ξ t t Learning to Predict Projection Confidences By introducing a slack variable, Equation (1) adds nonlin- In traditional online learning problems the feature space re- earity to the discriminator to allow a certain amount of error mains constant. When a classifier makes a poor prediction, it to be made. This error is bounded by ξ. The parameter C needs to be updated aggressively. However in data with vary- adjusts the slackness of the constraint. Since real life data ing feature spaces, our instance classifier can make poor pre- is noisy, we use the soft-margin approach to model learning dictions because of the unfair conditions caused by varying from varying feature spaces. feature spaces: the instance classifier often needs to predict the labels for instances that belong to feature spaces differ- The Problem Setting and Notation ent than its own. For example, when our learner receives an A feature space that keeps changing on different instances instance that does not have some of the features included by in a data stream is referred to as a varying feature space. the current instance classifier, the learner will not able to use We consider the problem of learning a classifier from data its full potential, since the instance classifier will only be with varying feature spaces, by doing single pass over each using the features in the shared feature space. Similarly, if instance. Let input to the learning algorithm consist of a se- the learner receives an instance with new features, the in- quence of examples (xt, yt) where t = 1, ..., T . Each in- stance loses the information provided by the new feature xt stance xt ∈ R is a vector that contains an arbitrary set space. Hence, if the instance classifier makes a bad predic- of features and yt ∈ {−1, +1} is a binary class label for tion, the reason can also be the high difference between the all t. Let wt be the weight vector for the instance classi- feature spaces of itself and the current instance. fier at round t and yˆt = sign(xt · wt) be the prediction We measure the loss of information after projecting a vec- of the instance classifier for the instance xt. Let w¯t be the tor onto a feature space with projection confidence: proba- weight vector for the feature space classifier at round t and bility that the instance classifier makes a correct prediction xt pt = σ(R · w¯t) is the prediction of the feature space clas- given the vectors projected feature space. We learn to esti- sifier for the feature space Rxt . mate projection confidences by training a feature space clas- In data with varying feature spaces, the feature space can sifier w¯t. Let I(y, yˆ) be an indicator function defined as: dynamically change; new features can be introduced or some 1, if y =y ˆ of the existing features can cease to exist. At each round I(y, yˆ) = (2) 0, t, when an instance arrives, we divide the explored feature otherwise space into three groups: existing, shared and new features. We model the probability that the instance classifier wt The existing features are only contained by the current clas- makes a correct prediction, y =y ˆ, given the feature space xt sifiers, the new features are only contained by the current R parameterized by the feature space classifier w¯t with the instance, and the shared features are the intersection of cur- likelihood function rent classifiers’ and the current instance’s feature spaces. To xt P (I(yt, yˆt)|R ;w ¯t) = make a prediction, the learning algorithm projects its cur- e s n xt e xt s xt n I(yt,yˆt) rent classifiers and the current training instance onto their σ(R · w¯t + R · w¯t + R · w¯t ) + e s n shared feature space. We denote the projection of a classi- xt e xt s xt n 1−I(yt,yˆt) σ(R · w¯t + R · w¯t + R · w¯t ) . (3) fier at round t onto the existing feature space as w e, shared t After taking the negative logarithm of the likelihood func- feature space as w s and the new feature space as w n. This t t tion above, we rearrange and define our to train notation also applies for the projections of other vectors such e s n the feature space classifier w¯t as: as training instances (xt , xt , xt ), and feature space clas- e s n ¯ xt sifiers (w¯t , w¯t , w¯t ). lt(R ,I(yt, yˆt), w¯t) = We define a feature space as a binary bag of features s xs n xn R −I(yt,yˆt)w ¯ t −I(yt,yˆt)w ¯ t vector, where each feature in the shared and new feature log(1 + e t R e t R ) (4)

3234 Since we learn the feature space classifier in an online set- find τ: ting, merely minimizing the loss function can lead to a poor ¯ 2 ¯ 2 performance. As the learner receives new instances and up- 1 2 ∂lt 1 2 ∂lt L(τ) = τ + τ + dates the feature space classifier, it is important to preserve 2 ∂w¯s 2 ∂w¯n the knowledge of previously seen feature spaces. Moreover τ¯l ( xt ,I(y , yˆ ), w¯ ) (9) when a set of new features arrive and need to be added to the t R t t t current feature space classifier, there can be infinitely many −¯l ( xt ,I(y , yˆ ), w¯ ) τ = min{C, t R t t t }. (10) solutions because of the lack of information about those new ¯ 2 ¯ 2 ∂lt + ∂lt features. To avoid overfitting, we need to learn the smallest ∂w¯s ∂w¯n weights possible that minimize the loss. Also, because real- To make a prediction given an instance xt, the learner life data is noisy, we want our algorithm to follow a soft- makes projections of both the instance classifier and the in- margin strategy to avoid poor generalization. As a result, we stance onto the current shared feature space. We use the formulate the problem of learning the feature space classifier feature space classifier w¯t to estimate the projection con- as constrained optimization, using the loss function defined fidences and adaptively re-weight the different parts of the in Equation (4): objective function corresponding to the different pieces of e s n the instance classifier weights wt = [wt , wt , wt ]. This re- 1 e e 2 w¯t+1 = argmin kw¯ − w¯t k + weighting strategy helps to adjust the priorities of each con- w¯=[w ¯e,w¯s,w¯n]: 2 straint in the objective function, scales the regularization ¯ lt≤ξ,ξ≥0 weights for each instance xt, and improves the learning con- 1 1 kw¯s − w¯sk2 + kw¯nk2 + Cξ,¯ (5) vergence. 2 t 2 ¯ Learning to Classify Instances from Varying Note that we use the slack variable C¯ to bound the loss lt. We ¯ Feature Spaces solve the optimization problem with the inequalities lt ≤ ξ and ξ ≥ 0 defined in Equation (5) by using a Lagrangian In order to adapt the instance classifier to the dynamic na- function with K.K.T. conditions (Luo and Yu 2006): ture of the varying feature spaces, we modify the Hinge loss by using the projections of the instance classifier and the in- 1 1 stance: L(w, ¯ τ, α, ξ) = kw¯e − w¯ek2 + kw¯s − w¯sk2 + 2 t 2 t s s n n lt = max{0, 1 − y(xt · w ) − y(xt · w )}. (11) 1 kw¯nk2 + Cξ¯ + τ(¯l − ξ) − αξ, (6) 2 t Motivated by the same reasons discussed for the objective function defined in Equation (5), we formulate the problem where τ and α are Lagrange multipliers. Setting the deriva- of learning the instance classifier from varying features as tives of L(w, ¯ τ, α, ξ) with respect to w¯e, w¯s, w¯n and ξ, we obtain the following conditions: 1 e e 2 wt+1 = argmin kw − wt k + e s n 2 w¯e =w ¯e w=[w ,w ,w ]: t lt≤ξ,ξ≥0 ¯ xt s s ∂lt(R ,I(yt, yˆt), w¯t) 1 1 w¯ =w ¯ + τ pw kws − w sk2 + px kwnk2 + Cξ, (12) t ∂w¯s 2 t t 2 t ¯ xt ∂lt( ,I(yt, yˆt), w¯t) w wt wn = −τ R where pt = σ(R · w¯t) is the projection confidence of the n x xt ∂w¯ instance classifier wt and pt = σ(R · w¯t) is the projec- α = C − τ, tion confidence of the instance xt. Note that the objectives for ws and wn are weighted using these projection con- where fidences. When the projection confidence of the classifier onto the shared subspace pw is close to 1, then the shared ∂¯l ( xt ,I(y , yˆ ), w¯ ) t t R t t t = feature space is not very different than the feature space of ∂w¯s s n the current classifier. Therefore, the objective of minimizing −I(y ,yˆ )(w ¯s xt +w ¯n xt ) 2 log(e t t t R t R ) s s s xt kw − wt k is important in order to remember the informa- − s n I(yt, yˆt) , (7) −I(y ,yˆ )(w ¯s xt +w ¯n xt ) R log(1 + e t t t R t R ) tion received from previous instances. Moreover if the pro- jection confidence of the training instance onto the shared and x n 2 subspace pt is close to 1, the objective of minimizing kw k is easier and can be emphasized more to avoid overfitting. ∂¯l ( xt ,I(y , yˆ ), w¯ ) t R t t t = On the other hand low projection confidences imply a sud- ∂w¯n s n den change in the feature space. In this case, the objective −I(y ,yˆ )(w ¯s xt +w ¯n xt ) log(e t t t R t R ) n function must focus on minimizing the loss more than try- xt − s n I(yt, yˆt) (8) −I(y ,yˆ )(w ¯s xt +w ¯n xt ) R ing to keep the classifier weights small or close to the pre- log(1 + e t t t R t R ) vious weights. We solve the optimization problem with in- Using the conditions and partial derivatives defined above, equalities defined above by using another Lagrangian func- we obtain the following representation of our problem and tion with K.K.T. conditions and obtain the following update

3235 rules: Algorithm 1: The OLVF Algorithm e e Input : C, C¯ > 0 w = wt : Loss bounding parameters λ > 0: Regularization parameter ws = w s + τpwy x s t t t t B ∈ (0, 1]: Proportion of selected features n x n w = τp y x x1 t t t Initialize: w1 = (0, ..., 0) ∈ R x1 α = C − τ. Initialize: w¯1 = (0, ..., 0) ∈ R 1 for t = 1,2,... T do xt Using these conditions, we obtain the following representa- 2 Receive the current instance xt ∈ R s w ∩x tion of our problem and find τ: 3 Identify the shared feature space R = R t t s s s 4 Project wt, xt onto R : wt , xt s s 1 2 w 2 s 2 1 2 x 2 n 2 5 Predict the class label: yˆt = sign(wt · xt ) L(τ) = τ (pt ) kxt k + τ (pt ) kxt k + 2 2 6 Receive the correct label: yt ∈ {+1, −1} x s s 7 Update feature space classifier with y , yˆ , t ; τ(1 − yt(wt · xt )), (13) t t R using Equation (7). 8 Predict the projection confidences: −lt w w 9 p = σ( t · w¯ ) τ = min{C, 2 }. (14) t R t x xt kxtk 10 pt = σ(R · w¯t) w x 11 Update the instance classifier with xt, yt, pt , pt ; After rearranging, we can define the update strategy as: using Equation (15). 12 Project wt+1 using Equation (16) and λ. w = [we , ws , wn ] = t+1 t+1 t+1 t+1 13 Truncate wt+1 and w¯t+1 using B. w s e s w s pt ltytxt 14 end [w , wt + min{Cpt ytxt , 2 }, kxtk pxl y x n min{Cpxy x n, t t t t }]. (15) t t t 2 Time Complexity of OLVF kxtk The pseudocode of our online learning from varying features Model Sparsity (OLVF) algorithm is shown in Algorithm 1. The time com- plexity of the OLVF algorithm is as follows. Assuming at Because the feature space can dynamically change, the num- round t, |wt| = |w¯t| is the number of features in the cur- ber of features kept in the classifiers are not bounded. In rent classifiers, |xt| is the number of features arrived with its current design, the learner keeps remembering even if the training instance, and |Rs| is the number of features in the features stop being generated. This can act as a bias on the shared feature space. Since the operations are being done feature space and instance classifier weights when the fea- for both the weight vectors and the current training instance, tures are reintroduced. Moreover, when vanishing features the time complexity for identifying the shared feature space re-emerge, there is a possibility that their meanings are dif- and projecting the weight vectors with the training instance ferent. As a result, this bias can lead to a poor performance. onto their shared feature space is O(|wt| + |xt|). The time Furthermore, because there is no feature selection mecha- complexity for making a prediction, calculating the projec- nism in the learner, the learner keeps using even the least tion confidences and suffering losses is O(|Rs|), because all important features. Therefore, to improve the memory us- of these steps use the shared feature space. Finally, the time age and running time efficiency, we do not want to keep the complexity for the sparsity step is O(|wt|), since the size of classifier weights for every feature received by the learner. the instance classifier is always equal to the size of the fea- Simply truncating the smallest weights from the instance ture space classifier. For a single round, the worst case time classifier leads to poor performance because it introduces a s complexity of the OLVF algorithm is O(|wt| + |xt| + |R |). sudden change to the result of the dot product. On the other s Because |R | is always less than |wt| + |xt|, we can further hand, the feature space classifier can tolerate this change be- simplify the complexity as O(|wt| + |xt|). Considering the cause of the saturating sigmoid function. Therefore to spar- fact that at any round t the feature sets of wt and xt can be sify the instance classifier, we introduce the following pro- completely disjoint, we do not apply further simplification jection step before truncation: for the time complexity. In other words, the running time of the algorithm linearly scales with the number of features that λ wt := min{1, }wt, (16) are being used at each round t. wt · w¯t where λ is a regularization parameter. With a ratio of B, we Experiments truncate the smallest elements from the classifiers. Since w¯t In this section we empirically evaluate the performance of carries the information of which features contribute to the the OLVF algorithm in three different scenarios: data with classification task, we use it to scale the weights of the in- simulated varying feature spaces, simulated trapezoidal data stance classifier. This strategy helps to identify and truncate streams (Zhang et al. 2016) and real-life varying feature the features that are either redundant or vanishing. spaces. In each scenario, the instances are provided to the

3236 Table 1: Numbers of samples and features of 10 datasets. Table 2: Average number of errors made by OLV F on 9 Dataset # Samples # Features UCI datasets in simulated varying feature spaces. wpbc 198 34 Rem. german ionosphere spambase ionosphere 351 35 0.25 333.4 ± 9.7 77.2 ± 7.1 659.8 ± 14.5 wdbc 569 31 0.5 350.9 ± 7.8 79.5 ± 7.4 864 ± 20.6 wbc 699 10 0.75 365.15 ± 3.6 79.7 ± 5.9 1375.8 ± 21.5 german 1,000 24 Rem. magic04 svmguide3 wbc svmguide3 1,234 21 0.25 6152.4 ± 54.7 346.4 ± 11.6 25.3 ± 1.4 spambase 4,601 57 0.5 6775 ± 27.4 367.2 ± 11.9 60.6 ± 5.3 magic04 19,020 10 0.75 7136.4 ± 2.7 371.2 ± 11.7 123.1 ± 3.6 imdb 25,000 7500 Rem. wpbc wdbc a8a a8a 32,561 123 0.25 88.5 ± 5.8 40.8 ± 3.5 8993.8 ± 40.3 0.5 90.2 ± 5.1 55.2 ± 5.4 9585.8 ± 53.8 0.75 107.7 ± 7.7 202.85 ± 7.5 12453 ± 74.4 classifier incrementally and only a single pass is allowed. Similarly, the classifier does not have any initial knowledge Table 3: Average number of errors made by non-sparse about the full feature space. We use 9 different UCI datasets OLV F and sparse OLV F on 9 UCI datasets in simulated to simulate these scenarios. Additionally, we demonstrate varying feature spaces. the effectiveness of the proposed sparse strategy. Finally Alg german ionosphere spambase we evaluate the performance of OLVF using the real-world ns. 333.4 ± 9.7 77.2 ± 7.1 659.8 ± 14.5 dataset IMDB movie reviews (Maas et al. 2011). The num- s. 358 ± 9.2 79 ± 15.9 825.1 ± 42.7 bers of instances and features for each dataset are listed in Alg magic04 svmguide3 wbc Table 1. In all experiments, we measure the performance in ns. 6154 ± 47.1 367.4 ± 11.6 25 ± 1.4 terms of the average prediction accuracy on 20 random per- s. 6621.6 ± 46.6 361.4 ± 20.7 32.3 ± 2.6 mutations of each dataset. The parameters C and C¯ are cho- Alg wpbc wdbc a8a sen using grid search. ns. 90.2 ± 5.1 39.6 ± 5.2 8933.2 ± 28.5 s. 97.4 ± 4.6 131 ± 7.9 8588.6 ± 575.3 Experiments on Varying Feature Spaces We simulate data with varying feature spaces by randomly removing features from every training instance. We denote used by the non-sparse and sparse versions of the OLVF the ratio of features removed from each training instance algorithm, along with their parameter settings. We can see as removing ratio (Rem.). At each round, we apply a ran- that with datasets german, ionosphere, spambase, magic04, dom permutation to the complete dataset and randomly re- svmguide3, wbc and a8a sparse and non-sparse versions of move the features. We conduct two different types of ex- the proposed algorithm perform similarly, while the sparse periments for varying features. The first set of experiments version of the algorithm uses a smaller subset of the pro- measure the performance of the proposed algorithm in vary- vided feature space. With the wpbc dataset, the sparse ver- ing feature spaces using removing ratios of 0.25, 0.5 and sion of the algorithm demonstrates a more consistent perfor- 0.75. The second set of experiments show the trade-off be- mance because the sparse strategy helps to reduce the noise tween the accuracy and the sparsity of the classifier, in data and avoid overfitting. In the wdbc dataset, the sparse version with varying feature spaces using 0.25 as the removing ra- of the proposed algorithm performs similarly to the non- tio. 1) Experiments using different removing ratios: We test sparse version. However, it follows a lower accuracy trend the performance of the proposed algorithm using 3 differ- by ∼ 10% because wdbc has a relatively less number of in- ent removing ratios: 0.25, 0.5 and 0.75. If the removing ra- stances and less redundant features after randomly removing tio for a varying feature space is 0.25, then 1/4 of the fea- 1/4 of the features to simulate varying feature spaces. tures are randomly removed from each training instance, and so on. Table 2 shows the average number of wrong predic- Table 4: Parameters used by the sparse OLVF algorithm. tions made by OLVF with different removing ratios on 9 UCI datasets. For all datasets, OLVF achieves higher accu- Dataset B C C racy on the experiments with smaller removing ratio. This is wbc 0.6 1 0.01 expected since a small removing ratio means a high number svmguide3 0.4 0.1 0.001 of features sent to the classifier. 2) Experiments on sparse wpbc 0.7 0.0001 0.01 and non-sparse OLVF: We observe the trade-off between us- ionosphere 0.1 0.01 0.1 ing sparse classifiers to decrease the time and space usage, magic04 0.6 0.1 0.0001 and using the full feature space to increase the accuracy. To german 0.5 0.01 0.001 demonstrate the effect of the sparsity step, we set the remov- spambase 0.3 0.01 0.001 ing ratio to 0.25 while simulating the varying feature spaces. wdbc 0.9 1 0.1 Table 3 shows the average number of errors made by OLVF a8a 0.05 0.001 0.00001 and sparse OLVF. Table 4 shows the number of features

3237 Table 5: Average number of errors made by OLSF and OLV F on 9 UCI datasets in simulated trapezoidal data streams, and on the IMDB dataset. Algorithm german ionosphere OLSF 385.5 ± 10.2 57.9 ± 4.7 OLV F 329.2 ± 9.8 51.8 ± 3.1 Algorithm magic04 svmguide3 OLSF 6147.4 ± 65.3 361.7 ± 29.7 OLV F 5784.0 ± 52.7 351.6 ± 25.9 Algorithm wpbc wdbc OLSF 87.9 ± 5.6 52.9 ± 4.5 OLV F 78.2 ± 5.5 45.4 ± 2.9 Algorithm spambase wbc

OLSF 993.5 ± 25 48.1 ± 12.6 Figure 1: Mean error rates of OLV F and OLSF algorithms OLV F 825.8 ± 20.2 31.1 ± 2.8 as the data streams in (left), and projection confidences esti- Algorithm a8a imdb mated by the feature space classifier (right). OLSF 9420.4 ± 549.9 7851.0 ± 51.4 OLV F 8649.8 ± 526.7 4474.2 ± 39.1 data. We evaluate these two approaches by comparing the prediction accuracies of OLVF and OLSF algorithms. For Experiments on Trapezoidal Data Streams both algorithms, we set B’s to 0.1 and find their best setting We compare the prediction accuracy of OLVF with OLSF for the C and C¯ parameters by using grid search. We set the algorithms (Zhang et al. 2016) on trapezoidal data streams. C to 0.1, and C¯ to 10−5. For each dataset, we use the version of OLSF that performs Table 5 shows that average number of wrong predictions the best to compare with our algorithm. For the sparsity step, made by OLVF is lower than OLSF . Moreover, Figure 1 we use the same λ and B parameters as in (Zhang et al. (left) shows that OLVF has a faster learning trend and higher ¯ 2016). We used grid search to find the best C and C for accuracy than OLSF because of its feature space adaptive each dataset. To simulate trapezoidal data streams, we split constraint weighting strategy. Figure 1 (left) also shows that each dataset into 10 chunks where the number of features the loss suffered by the feature space classifier of OLVF de- included by each chunk increases as the data flows in. For creases as the data streams in. This indicates that the model example, instances in the first chunk have the first 10 per- learns how to classify feature spaces as it receives more in- cent of features, instances in the second chunk have the first stances. Figure 1 (right) shows the change of projection con- 20 percent of features and so on. Table 5 shows the average fidence predictions made by the feature space classifier. Note numbers of errors with variances of the OLVF and the OLSF that the projection confidences randomly fluctuate because algorithms in trapezoidal streams. Note that because the ex- each instance comes from an arbitrary feature space. Addi- isting and shared feature spaces in trapezoidal data streams tional to the fluctuations, the projection confidence of the never change, the only component that is effective while re- instance classifier has an increasing trend. This is because weighting the objective function is the projection confidence the instance classifier learns new features and new informa- of the current training instance. From these experiments, we tion relevant to its existing features, therefore becomes more observe that our OLVF achieves higher prediction accuracy confident against unknown features as the data streams in. while learning classifiers with the same sparsity as OLSF, because of its feature space adaptive constraint re-weighting strategy. In Table 5 we observe that in addition to lower num- Conclusion bers of errors in 9 UCI datasets, the standard deviation of In this paper we explored a new problem of online learn- the errors made by OLVF in 20 rounds is also lower than the ing from data with varying feature spaces, where the feature OLSF algorithms. This shows that OLVF has consistently space of each training instance can be arbitrarily different better performance than the OLSF algorithms on the 9 UCI from other instances. We defined the concept of projection datasets. confidence for learning from varying feature spaces and de- rived an update rule for a projection confidence estimator. Application to Real-World Varying Feature Spaces Then, we presented a new algorithm OLVF that learns fea- In IMDB Movie Reviews dataset, the task is to classify each ture space level and instance level classifiers from data with movie review, provided in raw text, into positive or negative varying features by adaptively reweighting the constraints of sentiment. Each new movie review can include words that the objective function with projection confidences. We eval- the learner have never seen, or exclude the words exist in the uated the proposed OLVF algorithms on 9 UCI datasets and learners feature space. Hence, we can formulate the prob- a real life dataset. Additionally, we compared the sparse and lem as learning from varying feature spaces and use OLVF. non-sparse versions of OLVF algorithm. Finally, we showed The problem can also be seen as learning from trapezoidal that our algorithm outperforms the state-of-the-art OLSF al- data streams by assuming non-existing features are missing gorithms.

3238 Acknowledgments Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; This research is supported by the US National Science Foun- and Potts, C. 2011. Learning word vectors for sentiment dation (NSF) under grants 1652107 and 1763620. analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Lan- guage Technologies, 142–150. Portland, Oregon, USA: As- References sociation for Computational Linguistics. Agarwal, S.; Saradhi, V. V.; and Karnick, H. 2008. Kernel- Orabona, F., and Crammer, K. 2010. New adaptive algo- based online and support vector reduction. rithms for online classification. In Advances in Neural In- Neurocomputing 71(7-9):1230–1237. formation Processing Systems, 1840–1848. Bartlett, P. L., and Wegkamp, M. H. 2008. Classification Rosenblatt, F. 1958. The : A probabilistic model with a reject option using a hinge loss. Journal of Machine for information storage and organization in the brain. Psy- Learning Research 9(Aug):1823–1840. chological Review 65(6):386. Bolon-Canedo, V.; Sanchez-Marono, N.; and Alonso- Roughgarden, T., and Schrijvers, O. 2017. Online predic- Betanzos, A. 2011. Feature selection and classification in tion with selfish experts. In Advances in Neural Information multiple class datasets: An application to kdd cup 99 dataset. Processing Systems, 1300–1310. Expert Systems with Applications 38(5):5947–5957. Saez,´ J. A.; Galar, M.; Luengo, J.; and Herrera, F. 2014. Cesa-Bianchi, N.; Conconi, A.; and Gentile, C. 2005. A Analyzing the presence of noise in multi-class problems: al- second-order perceptron algorithm. SIAM Journal on Com- leviating its influence with the one-vs-one decomposition. puting 34(3):640–668. Knowledge and Information Systems 38(1):179–206. Crammer, K.; Dekel, O.; Keshet, J.; Shalev-Shwartz, S.; and Shalev-Shwartz, S., et al. 2012. Online learning and online Singer, Y. 2006. Online passive-aggressive algorithms. convex optimization. Foundations and Trends in Machine Journal of Machine Learning Research 7(Mar):551–585. Learning 4(2):107–194. Crammer, K.; Dredze, M.; and Pereira, F. 2009. Exact con- Shawe-Taylor, J., and Cristianini, N. 2002. On the gener- vex confidence-weighted learning. In Advances in Neural alization of soft margin algorithms. IEEE Transactions on Information Processing Systems, 345–352. Information Theory 48(10):2721–2735. Crammer, K.; Kulesza, A.; and Dredze, M. 2009. Adap- Tsang, I. W.; Kocsor, A.; and Kwok, J. T. 2007. Simpler tive regularization of weight vectors. In Advances in Neural core vector machines with enclosing balls. In Proceedings Information Processing Systems, 414–422. of the 24th International Conference on Machine Learning, 911–918. ACM. Domingos, P., and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD Wang, J.; Zhao, P.; and Hoi, S. C. 2016. Soft confidence- international conference on Knowledge Discovery and Data weighted learning. ACM Transactions on Intelligent Systems Mining, 71–80. ACM. and Technology (TIST) 8(1):15. Wu, Y., and Liu, Y. 2007. Robust truncated hinge loss sup- Gentile, C., and Warmuth, M. K. 1999. Linear hinge loss port vector machines. Journal of the American Statistical and average margin. In Advances in Neural Information Pro- Association 102(479):974–983. cessing Systems, 225–231. Wu, X.; Yu, K.; Ding, W.; Wang, H.; and Zhu, X. 2013. Gentile, C. 2001. A new approximate maximal margin clas- Online feature selection with streaming features. IEEE sification algorithm. Journal of Machine Learning Research Transactions on Pattern Analysis and Machine Intelligence 2(Dec):213–242. 35(5):1178–1192. Hou, B.-J.; Zhang, L.; and Zhou, Z.-H. 2017. Learning with Xu, J. 2011. An extended one-versus-rest support vec- feature evolvable streams. In Advances in Neural Informa- tor machine for multi-label classification. Neurocomputing tion Processing Systems, 1417–1427. 74(17):3114–3124. Kivinen, J.; Smola, A. J.; and Williamson, R. C. 2004. On- Ying, Y., and Pontil, M. 2008. Online gradient descent learn- line learning with kernels. IEEE Transactions on Signal Pro- ing algorithms. Foundations of Computational Mathematics cessing 52(8):2165–2176. 8(5):561–596. Kuzborskij, I., and Cesa-Bianchi, N. 2017. Nonparametric Yu, H.; Neely, M.; and Wei, X. 2017. Online convex opti- online regression while learning the metric. In Advances in mization with stochastic constraints. In Advances in Neural Neural Information Processing Systems, 667–676. Information Processing Systems, 1428–1438. Leite, D.; Costa, P.; and Gomide, F. 2010. Evolving granular Zhang, Q.; Zhang, P.; Long, G.; Ding, W.; Zhang, C.; and neural network for semi-supervised data stream classifica- Wu, X. 2015. Towards mining trapezoidal data streams. In tion. In Neural Networks (IJCNN), The 2010 International (ICDM), 2015 IEEE International Conference Joint Conference on, 1–8. IEEE. on, 1111–1116. IEEE. Luo, Z.-Q., and Yu, W. 2006. An introduction to convex op- Zhang, Q.; Zhang, P.; Long, G.; Ding, W.; Zhang, C.; and timization for communications and signal processing. IEEE Wu, X. 2016. Online learning from trapezoidal data streams. Journal on Selected Areas in Communications 24(8):1426– IEEE Transactions on Knowledge and Data Engineering 1438. 28(10):2709–2723.

3239