Online Incremental with Denoising

Guanyu Zhou Kihyuk Sohn Honglak Lee Department of EECS Department of EECS Department of EECS University of Michigan University of Michigan University of Michigan Ann Arbor, MI 48109 Ann Arbor, MI 48109 Ann Arbor, MI 48109 [email protected] [email protected] [email protected]

Abstract long learning in complex real environments. To that end, many large-scale learning algorithms have also While determining model complexity is an been proposed. In particular, Support Vector Ma- important problem in , chines (SVMs) [5] have been one of the most popular many feature learning algorithms rely on discriminative methods for large-scale learning which cross-validation to choose an optimal num- includes selective sampling to reduce the number of ber of features, which is usually challenging support vectors [4, 24] and incremental support vec- for online learning from a massive stream of tors to learn an SVM with a small number of examples data. In this paper, we propose an incremen- in the early phase of training [7, 23]. Typically, linear tal feature learning algorithm to determine SVMs are used in large-scale settings due to their ef- the optimal model complexity for large-scale, ficient training as well as the scalability to large-scale online datasets based on the denoising au- datasets. However, this approach is limited in that the toencoder. This algorithm is composed of feature mapping has to be fixed during the training; two processes: adding features and merging therefore, it cannot be adapted to the training data. features. Specifically, it adds new features Alternatively, there are several methods for jointly to minimize the objective function’s resid- training the feature mapping and the classifier, such ual and merges similar features to obtain a as via multi-task learning [31, 6], transfer learning [2, compact feature representation and prevent 22, 17], non-parametric Bayesian learning [1, 10], and over-fitting. Our experiments show that the [16, 3, 28, 27, 20]. Among these, we proposed model quickly converges to the op- are interested in deep learning approaches that have timal number of features in a large-scale on- shown promise in learning features from complex, line setting. In classification tasks, our model high-dimensional unlabeled and . Specif- outperforms the (non-incremental) denoising ically, we present a large-scale feature learning algo- , and deep networks constructed rithm based on the denoising autoencoder (DAE) [32]. from our algorithm perform favorably com- The DAE is a variant of autoencoders [3] that extracts pared to deep belief networks and stacked de- features by adding perturbations to the input data and noising autoencoders. Further, the algorithm attempting to reconstruct the original data. This pro- is effective in recognizing new patterns when cess learns features that are robust to input noise and the data distribution changes over time in the useful for classification. massive online data stream. Despite the promise, determining the optimal model complexity, i.e., the number of features for DAE, still 1 Introduction remains a nontrivial question. When there are too many features, for example, the model may overfit the In recent years, there has been much interest in online data or converge very slowly. On the other hand, when learning algorithms [1, 9, 29, 8] with the purpose of there are too few features, the model may underfit due developing an intelligent agent that can perform life- to the lack of relevant features. Further, finding an op- timal feature set size becomes even more difficult for Appearing in Proceedings of the 15th International Con- large-scale or online datasets whose distribution may ference on Artificial Intelligence and Statistics (AISTATS) change over time, since the cross-validation may be 2012, La Palma, Canary Islands. Volume XX of JMLR: challenging given a limited amount of time or compu- W&CP XX. Copyright 2012 by the authors.

1453 Online Incremental Feature Learning with Denoising Autoencoders tational resources. h through the encoding function defined as follows:

To address this problem, we propose an incremental h = f(x) = sigm(Wx + b), algorithm to learn features from the large-scale online data by adaptively incrementing the features depend- N D N where W R × eis a weight matrix,e b R is a ing on the data and the existing features, using DAE ∈ 1 ∈ hidden bias vector, and sigm(s) = 1+exp( s) is the as a basic building block. Specifically, new features − sigmoid function. Then, we reconstruct x [0, 1]D are added and trained to minimize the residual of the from h using the decoding function: ∈ objective function in generative tasks (e.g., minimizing b reconstruction error) or discriminative tasks (e.g., min- x = g(h) = sigm(WT h + c), imizing supervised ). At the same time, similar features are merged to avoid redundant feature D where c R b is an input bias vector. Here, we implic- representations. Experimental results show that our itly assumed∈ tied weights for encoding and decoding incremental feature learning algorithms perform favor- functions, but we can consider separate weights as well. ably compared to the non-incremental feature learn- ing algorithms, including the standard DAE, the deep In this work, we use cross-entropy as a cost function: belief network (DBN), and the stacked denoising au- D toencoder (SDAE), on classification tasks with large (x) = (x, x) = x log x + (1 x ) log(1 x ) L H − i i − i − i datasets. Moreover, we show that incremental feature i=1 learning is more effective in quickly recognizing new X b b b patterns than the non-incremental algorithms when Finally, the objective is to learn model parameters that the data distribution (e.g., the ratio of labels) is highly minimize the cost function non-stationary. W, b, c = arg min (x(j)), { } L W,b,c j 2 Preliminaries X c b b In this paper, we use the denoising autoencoder [32] and this can be optimized by (or con- as a building block for incremental feature learning. jugate gradient, L-BFGS, etc.) with respect to all pa- Based on the idea that good features should be robust rameters for encoding and decoding functions. to input corruption, the DAE tries to recover the input data x from encoded representation f(x) of corrupted 3 Incremental Feature Learning data x via a decoding function g(h). More precisely, In many cases, the number of hidden units N is fixed there are four components in the DAE: e during the training of DAE and the optimal value e A conditional distribution q(x x), that stochasti- is determined by cross-validation. However, cross- • cally perturbs input x to a corrupted| version x. validation is computationally prohibitive and often in- N feasible for large datasets. Furthermore, it may not An encoding function f(x) eh R , which gives • ≡ ∈ perform well when the distribution of training data a representation of input data x. e significantly changes over time. A decoding function g(he) x RD, which re- • ≡ ∈ covers the data from the representation h. To address these issues, we present an adaptive feature A differentiable cost function b (x) computes the learning algorithm that can handle large-scale datasets • L dissimilarity between input and reconstruction. without explicitly performing cross-validation over the number of features on the whole dataset. In detail, our Throughout the paper, we consider a case in which the incremental feature learning algorithm is composed of input and the hidden variables are binary (or bounded two key processes: (1) adding new feature mappings to between 0 and 1), i.e., x [0, 1]D and h [0, 1]N . As the existing feature set and (2) merging parts of the ex- ∈ ∈ described in [32], we consider the conditional distribu- isting feature mappings that are considered redundant. D tion q(x x) = i=1 q(xi xi) that randomly sets some We describe the details of the proposed algorithm in of the coordinates| to zeros.| By corrupting the input this section. Q elements,e the DAE attemptse to learn informative rep- resentations that can successfully recover the missing 3.1 Overview coordinates to reconstruct the original input data. In The incremental feature learning algorithm addresses other words, the DAE is trained to fill in the missing the following two issues: (1) when to add and merge values introduced by corruption. features and (2) how to add and merge features. To The DAE perturbs the input x to x and maps it to the deal with these problems systematically, we define an hidden representation, or abusing notation, the feature objective function to determine when to add or merge e 1454 Guanyu Zhou, Kihyuk Sohn, Honglak Lee

Algorithm 1 Incremental feature learning well as the weight matrix W and the hidden bias b. repeat We use θ to denote all parameters W, b, c, Γ, ν . { } Compute the objective (x) for input x. L For incremental feature learning, we use to denote Collect hard examples into a subset B (i.e., B N ← new features and to denote existing or old features. B x if (x) > µ). O ∪ { } L For example, f (x) h [0, 1]N represents an en- if B > τ then O ≡ O ∈ | | coding function with existing features, and f (x) Select 2∆M candidate features and merge them N h [0, 1]∆N denotes an encoding function with≡ into ∆M features (Section 3.3). N e newly∈ added features. A combined encoding function Add ∆N new features by greedily optimizing e is then written as f (x) = [h ; h ] [0, 1]N+∆N . with respect to the subset B (Section 3.2). O∪N O N Likewise, θ refers to the existing parameters,∈ i.e., Set B = and update ∆N and ∆M. O ∅ W , b , c, Γ , ν , and θ = W , b , c, Γ , ν end if O O O e N N N N denotes{ the parameters} for new features.{ } Fine-tune all the features jointly with in current batch of data (i.e., optimize via gradient descent 3.2 Adding features with respect to all parameters). until convergence In this section, we describe an efficient learning algo- rithm for adding features with different types of objec- tive functions. The basic idea for efficient training is that only the new features and corresponding parame- features and how to learn the new features. Specif- ters θ are trained to minimize the objective function ically, we form a set B which is composed of hard whileN keeping θ fixed. In the following subsections, training examples whose objective function values are we expand on theO different training criteria, such as greater than a certain threshold (µ) and use these ex- generative, discriminative, and hybrid objectives. amples as input data to greedily initialize the new fea- tures. However, some of the collected hard examples 3.2.1 Generative training may be irrelevant to the task of interest. Therefore, we only begin adding features when there are enough A generative objective function measures an average examples collected in B (i.e., B > τ). In our im- reconstruction error between the input x and the re- plementation, we set µ as the average| | of the objective construction x. The cross-entropy function, assum- function values for recent 10,000 training examples; ing that the input and output values are both within the τ was set also to 10,000. These values were not interval [0, 1],b is used as an objective function. The tuned. Note that it is not atypical to use 10,000 ex- optimization problem is posed as: amples for training when using conjugate gradient or 1 (i) min gen(x ), (1) L-BFGS (e.g., [21]). W ,b B L N N | | i B After incrementing new features (i.e., initializing new X∈ where (x) = (x, x). For incremental feature features and adding them to the feature set), we jointly Lgen H train all features with the upcoming training exam- learning, the DAE optimizes the new features to re- ples. Note that our feature learning algorithm still construct the data x fromb the corrupted data x. The optimizes the original objective function for the train- encoding and decoding functions for the new features ing data, and the subset B is used only for initializ- can be written as: e ing the feature increments. The overall algorithm is h = sigm(W x + b ), (2) N N N schematically shown in Figure 1, and a pseudo-code is T T provided in Algorithm 1. In the following section, we x = sigm(W h + W h + c). (3) N eN O O will describe adding new features and merging simi- The key idea here is that, although the output x de- lar existing features based both on the generative and pends on bothb new features h and old features h as discriminative objective functions. N O described in equation (3), the training of incrementalb Notations. In this paper, we denote D for input di- feature θ that minimizes the residual of the objec- tive functionN is still highly efficient because θ is fixed mension, N for the number of (existing) features, and O K for the number of class labels. Based on these no- during the training. From another point of view, we T tations, the parameters for generative training are de- can interpret W h as a part of the bias. Thus, we O O N D can rewrite equation (3) as: fined as the (tied) weight matrix W R × , the N ∈ D hidden bias b R , and the input bias c R . Sim- T ∈ ∈ x = sigm(W h + cd(h )), (4) ilarly, the parameters for discriminative training are N N O K N T composed of the weight matrix Γ R × between where cd(h ) W h + c is viewed as a dynamic ∈ Ob ≡ O O hidden and output units, the output bias ν RK , as decoding bias. Since the parameters for existing fea- ∈ 1455 Online Incremental Feature Learning with Denoising Autoencoders

Incrementing Merging similar features features Hidden (feature) layer … … …

Input layer … … (e.g. pixel values)

Figure 1: Illustration of single layer incremental feature learning. During the training, our incremental feature learning algorithm optimizes the parameters of new features (the orange units and the blue units on the right), while holding the other features (yellow units outside the dotted blue box on the left). The orange units are incremental features, and the blue unit is the merged feature from the similar existing features depicted as yellow units in the dotted blue box. See text for details. tures and the corresponding activations are not chang- sions. In our experiments, we found that any value of ing during the training of new features, we compute h λ [0.1, 0.4] gave roughly the best performance. O ∈ once and recall the cd(h ) repeatedly for each train- ing example without additionalO computational cost. In 3.3 Merging features this way, we can efficiently optimize the new parame- As described in the previous section, incremental fea- ters greedily via stochastic gradient descent. ture mappings can be efficiently trained to improve 3.2.2 Discriminative training the generative and discriminative objectives for on- line datasets. However, monotonically adding features A discriminative objective function computes an av- could potentially result in many redundant features erage classification loss between the actual label y and overfitting. To deal with this problem, we con- [0, 1]K and the predicted label y [0, 1]K . More pre-∈ sider merging similar features to produce more com- cisely, the cross-entropy is used as∈ a performance mea- pact feature representations, which can mitigate the sure and we pose an optimizationb problem as follows: problem of overfitting as well.

1 (i) (i) Our merging process is done in two steps: we select a min disc(x , y ), (5) W ,b ,Γ B L pair of candidate features and merge them to a single N N N i B | | X∈ feature. Detailed descriptions are given below: where disc(x, y) = (y, y(x)). Note that the label y is a binaryL vector withH a softmax unit that allows one Select a set of candidate features to be merged, • element to be 1 out of K dimensions for K-way classi- = m1, m2 , and replace f by f b M { } ⊂ O O O\M fication problem (i.e., y = 1 if and only if the example (i.e., remove m1-th and m2-th features from ). k O is in the k-th class). Formally speaking, the discrim- Add a new feature mapping f that replaces the • N inative model predicts y as a posterior probability of merged features (See also Section 3.2). class labels via the softmax activation function The candidate feature pair and the new feature map- y = softmax(ν +bΓ f (x) + Γ f (x)), (6) O O N N ping can be determined by solving the following opti-

exp(ak) mization problem: where softmax(a)k = exp(a ) , k = 1, ..., K for a b k0 ke0 e ∈ K 1 R . A similar interpretationP for ν+Γ f (x), as in the (i) (i) O O min hybrid(x , y ). (8) generative training, is possible, and therefore, we can θ , B L N M i B | | X∈ efficiently train the new parameters We , b , Γ using gradient descent. { N N N } However, optimizing (8) jointly with respect to θ and is computationally expensive because the com-N 3.2.3 Hybrid training plexityM of an exhaustive search over all pairs of can- didate features increases quadratically in the number Considering the discriminative model as a single objec- of features. As an approximation, we decompose the tive function has a risk of overfitting. As a remedy, we original optimization problem into a two-step process, can use a hybrid objective function that combines the where we first find the most similar feature pair for discriminative and generative loss function as follows: merging candidates, and then solve the optimization problem with θ only. In addition to being efficient hybrid(x, y) = disc(x, y) + λ gen(x). (7) L L L in training, thisN approximate optimization process also In hybrid objective function, we further normalize minimizes the upper bound of the objective function both loss functions with respect to their target dimen- in equation (8). The process is described below: 1456 Guanyu Zhou, Kihyuk Sohn, Honglak Lee

Find a pair of features whose distance is minimal: 4 Connection to Boosting and • = arg min m1,m2 d(Wm1 , Wm2 ). Related Work M { } Add new features to θ by solving θ = • O\M N In this section, we provide a comparison between in- argc min 1 (x(i), y(i)). θ B i B hybrid cremental feature learning and boosting algorithms be- N | | ∈ L b P cause they share similar motivations and properties. Given the candidate features to merge, a newly added Boosting (e.g., [18, 13, 30, 25, 15, 14, 33]) is a learning feature can be trained via gradient descent as de- framework that trains a single strong learner from a scribed in the previous section. Unlike section 3.2, set of weak learners. Incremental feature learning and however, we initialize the new feature parameters as boosting algorithms are similar in that they add weak a weighted average (i.e., linear combination) of two learners (or feature mapping corresponding to newly- candidate feature parameters for faster convergence as added hidden units in the context of feature learning) to improve the existing classifiers. However, incremen- x B P (hm1 x; θm1 )θm1 + P (hm2 x; θm2 )θm2 θ = ∈ | | . tal feature learning determines the number of feature N x B P (hm1 x; θm1 ) + P (hm2 x; θm2 ) increments based on the performance, while typical P ∈ | | P boosting methods introduce the pre-determined num- ber of weak learners at each stage. Further, the merg- 3.4 General formulations ing process makes our algorithm robust to overfitting In this section, we discuss variations of our algorithm and to be more adaptive to the online stream of data given four criteria: (1) encoding and decoding func- whose distribution (e.g., data or class labels) may fluc- tions of DAE depending on the type of input data, (2) tuate over time, since it removes the redundant weak discriminative objectives, (3) distance functions for de- learners effectively. termining merged candidates, and (4) an extension to Importantly, the weak learners in boosting algorithms a deep incremental network. are typically used as classifiers (for predicting class la- Encoding and decoding functions. In previous bels); whereas in our setting, they can be feature map- sections, we developed our algorithm based on sig- pings to improve classification performance or to learn moid functions for both encoder and decoder. How- representations from input data, which makes our ever, other linear and non-linear functions (e.g., tanh, algorithm flexibly applicable to unsupervised, semi- linear, or rectified-linear) can be adapted. For exam- supervised, or self-taught learning [26] settings. Sim- ple, we can use a linear decoding function with a mean ilar to our approach, Chen et al. [11] presented an squared error loss for real-valued input. incremental feature learning algorithm, called boosted sparse coding, that adds a single feature at a time us- Discriminative training. In section 3.2.2, we used a ing sparse coding objectives. Our incremental learning softmax classifier with cross-entropy for discriminative algorithm is different in that (1) we add multiple fea- tasks. However, it can easily be generalized to other tures at a time based on several performance criteria classifiers. For example, SVM can be adopted to our (i.e., generative/discriminative/hybrid objectives); (2) algorithm by defining a hinge-loss (primal objective of we have a merging process that helps avoid overfit- the SVM) at the output layer of the network. ting; and (3) we optimize the whole network (i.e., via Merging process. We used cosine distance (normal- fine-tuning) after initializing new features. ized inner product) for all the experiments. Alterna- In addition to boosting, our model is also related to tively, we can consider other similarity functions, such cascade correlation [12]. A cascade correlation neu- as KL divergence, intersection kernel, and Chi-square ral network adds a new hidden unit as a new layer at kernel, to select candidate features to merge. Further- each step, which will depend on all previously added more, we note that in addition to our merging tech- units. Thus, a cascade correlation neural network can- nique as described in Sec. 3.3, other merging meth- not effectively learn a large number of hidden units ods or pruning techniques, such as “forward search,” (which would mean a very deep network). However, “backward search,” or pruning infrequently activated in our proposed model, all hidden units (assuming one features, can be used to reduce model complexity. hidden-layer model) can be independently computed Deep incremental network. The proposed incre- given input data, which makes our training more ef- mental learning method can be readily adapted to deep ficient. In addition, instead of simply adding hidden architectures. For example, we can train the deep in- units, our model is more adaptive to the input data cremental network that determines the optimal feature by adding or merging features. In our experiments, set size of multiple layers in a greedy way by fixing the the cascade correlation neural network performed well number of features in the lower layers and increment- with dozens of hidden units; however, in more compli- ing the features only at the top-layer sequentially. cated tasks, our model always outperformed the cas- 1457 Online Incremental Feature Learning with Denoising Autoencoders cade correlation by a large margin.

5 Experimental Results

5.1 Experimental setup

We evaluate the performance of the proposed learning (a) bg-img-1M (b) bg-rand-1M (c) rot-1M algorithm on the large-scale extended datasets based on the MNIST variation and rect-images datasets.1 The MNIST variation dataset is composed of digit im- ages with different variation types, such as rotation, random, or image background, and their combination. For rect-images dataset, each image contains a single rectangle with different aspect ratios filled with an im- (d) rot-img-1M (e) rect-img-1M age that is different from its background image. Since the original datasets have a limited number of training Figure 2: Samples from the extended MNIST datasets examples, we extended the dataset sizes to 1 million. and the extended rect dataset. Samples from these datasets are shown in Figure 2. We note that our extended datasets are more challenging ity. Note that all results reported in this section are since we used more diverse background images (ran- based on the latter update rule. domly selected from 2000 images from CIFAR-10 [19]), whereas the original MNIST variation datasets used We also evaluated the classification accuracy with background patches extracted from 20 images down- other variants of update rules (for ∆N & ∆M) and loaded from the internet. (See supplementary material found that the performance was fairly robust to the for further discussion about the datasets.) choice of update rules. For more details about select- ing the number of incremental and merged features, In all experiments, we used a 20% corruption rate see the supplementary material. for the DAE, and we implemented all algorithms via Jacket GPU library running on nVidia GTX 470. For 5.2 Evaluation on generative criteria optimization, we used L-BFGS2 with a minibatch size of 1,000, since a similar setting has shown good per- In this experiment, we compare our algorithm to the formance in training autoencoders without tuning the non-incremental DAE in generative criteria. Specifi- [21]. For all feature learning algorithms, cally, we tested on the bg-img-1M dataset and show we trained features by pre-training on the first 12,000 the online reconstruction error and the number of fea- training examples, followed by supervised fine-tuning tures over training time in Figure 3(a) and 3(b), re- in an online setting. spectively. We report the results with 400, 500, 600 and 800 initial features for both incremental and non- For incremental feature learning, the number of feature incremental learning. In addition, we report the re- increments ∆N and the number of merged features t sult of non-incremental learning with 1200 features, ∆M at time t are adaptively determined by moni- t which is approximately the optimal feature set size toring performance. In this study, we mainly consid- obtained from incremental feature learning. In Fig- ered two types of update rules. With the first type of ure 3(b) and 3(c), we observe that the incremental update rule, when the performance of current online feature learning algorithm could find an approximately data batch is far from the convergence, we increase optimal number of features regardless of its initial fea- ∆N to accelerate. On the other hand, if the perfor- t ture set size, and it led to the smaller or comparable mance difference between two consecutive batches is reconstruction error compared to its non-incremental very small, we regard it as a convergence and reduce counterpart. ∆N t to decelerate the pace. To balance the model, we keep ∆Mt proportional to ∆N t during updates. Al- ternatively, with the second type of update rule, we 5.3 Evaluation on discriminative criteria considered a combination of AIC and BIC to control To evaluate discriminative performance of our algo- the number of features. This way, the model starting rithm, we ran experiments using a hybrid objective with a large number of features can still recover from function, a combination of generative and discrim- potential overfitting by penalizing the model complex- inative objective functions (with λ = 0.2). First, 1 we compare the classification performance of the in- http://www.iro.umontreal.ca/~lisa/twiki/bin/ view.cgi/Public/MnistVariations cremental feature learning algorithm with that of its 2 http://www.di.ens.fr/~mschmidt/Software/ non-incremental counterpart on bg-img-1M dataset. 1458 Guanyu Zhou, Kihyuk Sohn, Honglak Lee

7 5 DAE(400) IncMDAE(400) DAE 1200 4.5 IncMDAE(400) IncMDAE(500) 6 DAE(500) IncMDAE(500) IncMDAE(600) 4 1000 IncMDAE(800) DAE(600) IncMDAE(600) 5 3.5 DAE(800) IncMDAE(800) 800 3 4 DAE(1200) 2.5 600 3 2 Number of Features Reconstruction Error 400 2 1.5 IncMDAE(400) 1 IncMDAE(500) 1 200 Reconstruction Error at Convergence IncMDAE(600) 0.5 IncMDAE(800) 0 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 400 600 800 1000 1200 1400 1600 1800 2000 Training Time (*min) Training Time (*min) Number of Features (a) (b) (c)

25 1200 25 DAE(400) IncMDAE(400) DAE IncMDAE(400) DAE(500) IncMDAE(500) IncMDAE(500) 1000 IncMDAE(600) 20 20 IncMDAE(800) DAE(600) IncMDAE(600)

DAE(800) IncMDAE(800) 800 15 15 DAE(1000) 600

10 10

Classification Error 400 Number of Features

IncMDAE(400) 5 5

200 IncMDAE(500) Classification Error at Convergence IncMDAE(600) IncMDAE(800) 0 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 200 400 600 800 1000 1200 1400 1600 1800 2000 Training Time (*min) Training Time (*min) Number of Features (d) (e) (f) Figure 3: (a,d) Test (online) reconstruction error and classification error (on the bg-img-1M dataset) of in- cremental (IncMDAE) and non-incremental DAE. (b,e) The number of features for incremental learning with generative and hybrid criteria. (c,f) Test reconstruction error and classification error at convergence. The numbers in parentheses refer to the initial number of features.

Similar to section 5.2, we report the classification er- vised/hybrid fine-tuning). In addition, IncMDAE con- rors in Figure 3(d) using 400, 500, 600 and 800 ini- sistently achieved classification performance similar to tial features for both incremental and non-incremental or better than IncDAE, suggesting that the merging algorithms and additionally plot the result of non- process is effective at maintaining the balance between incremental learning with 1000 features. Overall, as specificity and robustness by learning less redundant Figure 3(d), 3(e) and 3(f) indicate, we observed trends features. similar to the results using the generative criteria. Fig- We also evaluated deep network extensions of our al- ure 3(d) and 3(f) shows that incremental DAEs find an gorithm and compared against the deep belief net- approximately optimal number of features (regardless work (referred to as “DBNl1 3”) and the stacked de- of the initial feature set sizes) while achieving con- − noising autoencoder (referred to as “SDAEl1 3”) that sistently lower classification errors compared to those − both contained 3 layers. For both the DBN and the from non-incremental counterparts. SDAE, the number of hidden units for each layer was For more extensive evaluation, we tested the incremen- selected out of 500, 1000, 2000 via cross-validation. tal feature learning algorithms with merging (IncM- As shown in Table{ 1, the deep} incremental networks l1,2 l1 3 DAE) and without merging (IncDAE) on the extended (IncMDAE and IncMDAE − ) performed favor- datasets. In addition, we evaluated (non-incremental) ably compared to these standard deep networks. The 1-layer DAE, linear SVM, and online multiclass LP- results suggest that our incremental DAE can be a Boost [30]. All algorithms were tested in an online useful building block in learning deep networks. manner (i.e., evaluating on the next mini-batch of ex- amples that had not previously been seen during train- 5.4 Evaluating on non-stationary ing). As Table 1 shows, both the 1-layer incremental distributions feature learning algorithm with merging (IncMDAEl1) One challenge in online learning is to fit the model and without merging (IncDAEl1) outperformed linear to the data whose distribution fluctuates over time. SVM, DAEl1 and online multiclass LPBoost. The re- In this section, we perform an experiment to see how sults suggest that, although it is similar to boosting, well the incremental learning algorithm can adapt to our method can effectively use both unlabeled data the online stream of data as compared to its non- and labeled data (via pre-training followed by super- incremental counterpart. 1459 Online Incremental Feature Learning with Denoising Autoencoders

Dataset rot-bg-img-1M rot-1M bg-rand-1M bg-img-1M rect-img-1M linear SVM 55.17 0.45 24.02 0.33 19.17 0.33 29.24 0.46 33.18 0.38 ± ± ± ± ± DAEl1 38.51 0.45 10.23 0.35 7.94 0.24 7.65 0.30 24.27 0.31 OMCLPBoost 42.11 ± 0.04 12.02 ± 0.02 7.22 ± 0.03 10.29± 0.09 29.51 ± 0.13 l1 3 ± ± ± ± ± DBN − 30.01 0.31 8.00 0.31 6.27 0.35 9.20 0.19 16.36 0.31 l1 3 ± ± ± ± ± SDAE − 30.13 0.25 7.76 0.35 6.22 0.33 7.56 0.31 13.20 0.30 ± ± ± ± ± IncDAEl1 35.50 0.41 10.28 0.23 7.00 0.10 7.20 0.17 18.50 0.34 ± ± ± ± ± IncMDAEl1 35.41 0.49 8.20 0.38 6.90 0.33 6.37 0.28 14.15 0.32 ± ± ± ± ± IncMDAEl1,2 29.04 0.45 6.60 0.33 6.14 0.30 6.20 0.31 9.28 0.44 l1 3 ± ± ± ± ± IncMDAE − 27.60 0.40 6.80 0.44 6.30 0.26 5.24 0.34 9.89 0.54 ± ± ± ± ± Table 1: Test (online) error for classification tasks on large-scale datasets. The best performing methods for each dataset (within the standard errors) are shown in bold.

0.7 40 10 Digit 0 DAE(400) IncMDAE(400) DAE(400) IncMDAE(400) Digit 1 9 Digit 2 35 0.6 DAE(500) IncMDAE(500) DAE(500) IncMDAE(500) Digit 3 8 Digit 4 30 DAE(600) IncMDAE(600) DAE(600) IncMDAE(600) 0.5 Digit 5 Digit 6 7 DAE(1000) DAE(1000) Digit 7 25 Digit 8 6 0.4 Digit 9 20 5 0.3 4 15 Ratio for each class 0.2 3

Mean of Classification Error 10 2

0.1 5 Standrad Deviation of Classification Error 1

0 0 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Training Time (*min) Training Time (*min) Training Time (*min) (a) (b) (c) Figure 4: Effects of non-stationary data distribution on bg-img-1M dataset. (a) An example non-stationary distribution (of class labels) generated from a Gaussian Process. (b) Average test (online) classification error. (c) Standard deviation of the test (online) classification error.

We generated training data using bg-img-1M dataset to avoid expensive cross-validation in selecting the whose label distribution changes over time; the ratio number of features in large datasets. Our experimen- of labels at time t was randomly assigned for each class tal results suggest that (1) incremental feature learning exp a (t) via ratio (t) = { k } , k = 1, ..., K, where provides an efficient way to learn from large datasets k K exp a (t) j=1 { j } by starting with a small set of initial features, and a (t) is a random curve generated by Gaussian process k P automatically adjusting the number of features as the (an example class distribution is shown in Figure 4(a)). training proceeds; (2) the proposed algorithm leads We report results averaged over 5 random trials. to comparable or lower reconstruction and classifica- Figure 4 shows the mean and the standard deviation tion errors than its non-incremental counterparts; (3) of test classification error on the fluctuating bg-img- the incremental learning algorithm outperforms other 1M dataset. Our incremental learning model shows a state-of-the-art methods on large online datasets; and significantly lower mean classification error. Further- (4) incremental feature learning is more adaptive and more, in Figure 4(c), the low standard deviations of robust to highly non-stationary input distributions. errors achieved by the IncMDAE models imply that We believe our approach holds promise in develop- our algorithm is more robust and can rapidly adapt to ing scalable feature learning algorithms for large-scale the fluctuating distribution. This result suggests that datasets. the incremental feature learning could be an effective training method for online learning with challenging Acknowledgments non-stationary data distributions. This work was supported in part by a Google Faculty 6 Conclusion Research Award.

In this paper, we proposed an incremental feature References learning algorithm based on denoising autoencoders. The proposed algorithm can learn good feature rep- [1] R. P. Adams, H. M. Wallach, and Z. Ghahramani. resentations from large datasets by incrementing and Learning the stucture of deep sparse graphical models. merging features. Further, this algorithm can be used In AISTATS, 2010. 1460 Guanyu Zhou, Kihyuk Sohn, Honglak Lee

[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task [20] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, feature learning. NIPS, 2007. and Y. Bengio. An empirical evaluation of deep archi- tectures on problems with many factors of variation. [3] Y. Bengio, P. Lamblin, D. Popovici, and In ICML, 2007. H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, 2007. [21] Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for deep [4] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast learning. In ICML, 2011. kernel classifiers with online and active learning. Jour- nal of Machine Learning Research, 6:1579–1619, 2005. [22] S.-I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller. Learning a meta-level prior for feature relevance from [5] B. E. Boser, I. M. Guyon, and V. Vapnik. A training multiple related tasks. In ICML, 2007. algorithm for optimal margin classifiers. In COLT, 1992. [23] X. Liu, G. Zhang, Y. Zhan, and E. Zhu. An Incre- mental Feature Learning Algorithm Based on Least [6] R. Caruana. Multitask learning. Machine Learning, Square Support Vector Machine. Frontiers in Algo- 28(1):41–75, 1997. rithmics, pages 330–338, 2008.

[7] G. Cauwenberghs and T. Poggio. Incremental and [24] G. Loosli, S. Canu, and L. Bottou. Training invari- decremental support vector machine learning. In ant support vector machines using selective sampling. NIPS, 2001. In Large Scale Kernel Machines, pages 301–320. MIT Press, 2007. [8] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algo- [25] N. C. Oza. Online . Ph.D. Thesis, rithms. IEEE Transactions on Information Theory, University of California, Berkeley, 2001. 50(9):2050–2057, 2004. [26] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. [9] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. An Self-taught learning: Transfer learning from unlabeled online algorithm for large scale image similarity learn- data. In ICML, 2007. ing. In NIPS, 2009. [27] M. Ranzato, Y. Boureau, and Y. LeCun. Sparse fea- [10] B. Chen, G. Polatkan, G. Sapiro, D. Dunson, and ture learning for deep belief networks. NIPS, 2007. L. Carin. The hierarchical beta process for convo- [28] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. lutional and deep learning. In ICML, Efficient learning of sparse representations with an 2011. energy-based model. In NIPS, 2006.

[11] B. Chen, K. Swersky, B. Marlin, and N. de Freitas. [29] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Sparsity priors and boosting for learning localized dis- Subgradient methods for maximum margin structured tributed feature representations. Technical report, learning. In ICML, 2006. University of British Columbia, 2010. [30] A. Saffari, M. Godec, T. Pock, C. Leistner, and [12] S. E. Fahlman and C. Lebiere. The cascade-correlation H. Bischo. Online multi-class lpboost. CVPR, 2010. learning architecture. NIPS, 1990. [31] S. Thrun. Is learning the n-th thing any easier than [13] Y. Freund and R. E. Schapire. Experiments with a learning the first? NIPS, 1996. new boosting algorithm. ICML, 1996. [32] P. Vincent, H. Larochelle, Y. Bengio, and P. Man- [14] J. H. Friedman. Greedy function approximation: zagol. Extracting and composing robust features with A machine. Annals of Statistics, denoising autoencoders. In ICML, 2008. 29(5):1189–1232, 2001. [33] R. S. Zemel and T. Pitassi. A gradient-based boosting [15] A. Grubb and J. A. Bagnell. Boosted algorithm for regression problems. NIPS, 2001. learning for training deep modular networks. ICML, 2010.

[16] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Com- putation, 18(7):1527–1554, 2006.

[17] T. Jebara. Multi-task feature and kernel selection for SVMs. In ICML, 2004.

[18] M. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM, 41(1):67–95, 1994.

[19] A. Krizhevsky. Learning multiple layers of features from Tiny Images. Master’s thesis, University of Toronto, 2009. 1461