Online Incremental Feature Learning with Denoising Autoencoders

Online Incremental Feature Learning with Denoising Autoencoders Guanyu Zhou Kihyuk Sohn Honglak Lee Department of EECS Department of EECS Department of EECS University of Michigan University of Michigan University of Michigan Ann Arbor, MI 48109 Ann Arbor, MI 48109 Ann Arbor, MI 48109 [email protected] [email protected] [email protected] Abstract long learning in complex real environments. To that end, many large-scale learning algorithms have also While determining model complexity is an been proposed. In particular, Support Vector Ma- important problem in machine learning, chines (SVMs) [5] have been one of the most popular many feature learning algorithms rely on discriminative methods for large-scale learning which cross-validation to choose an optimal num- includes selective sampling to reduce the number of ber of features, which is usually challenging support vectors [4, 24] and incremental support vec- for online learning from a massive stream of tors to learn an SVM with a small number of examples data. In this paper, we propose an incremen- in the early phase of training [7, 23]. Typically, linear tal feature learning algorithm to determine SVMs are used in large-scale settings due to their ef- the optimal model complexity for large-scale, ficient training as well as the scalability to large-scale online datasets based on the denoising au- datasets. However, this approach is limited in that the toencoder. This algorithm is composed of feature mapping has to be fixed during the training; two processes: adding features and merging therefore, it cannot be adapted to the training data. features. Specifically, it adds new features Alternatively, there are several methods for jointly to minimize the objective function’s resid- training the feature mapping and the classifier, such ual and merges similar features to obtain a as via multi-task learning [31, 6], transfer learning [2, compact feature representation and prevent 22, 17], non-parametric Bayesian learning [1, 10], and over-fitting. Our experiments show that the deep learning [16, 3, 28, 27, 20]. Among these, we proposed model quickly converges to the op- are interested in deep learning approaches that have timal number of features in a large-scale on- shown promise in learning features from complex, line setting. In classification tasks, our model high-dimensional unlabeled and labeled data. Specif- outperforms the (non-incremental) denoising ically, we present a large-scale feature learning algo- autoencoder, and deep networks constructed rithm based on the denoising autoencoder (DAE) [32]. from our algorithm perform favorably com- The DAE is a variant of autoencoders [3] that extracts pared to deep belief networks and stacked de- features by adding perturbations to the input data and noising autoencoders. Further, the algorithm attempting to reconstruct the original data. This pro- is effective in recognizing new patterns when cess learns features that are robust to input noise and the data distribution changes over time in the useful for classification. massive online data stream. Despite the promise, determining the optimal model complexity, i.e., the number of features for DAE, still 1 Introduction remains a nontrivial question. When there are too many features, for example, the model may overfit the In recent years, there has been much interest in online data or converge very slowly. On the other hand, when learning algorithms [1, 9, 29, 8] with the purpose of there are too few features, the model may underfit due developing an intelligent agent that can perform life- to the lack of relevant features. Further, finding an optimal feature set size becomes even more difficult for Appearing in Proceedings of the 15th International Con- large-scale or online datasets whose distribution may ference on Artificial Intelligence and Statistics (AISTATS) change over time, since the cross-validation may be 2012, La Palma, Canary Islands. Volume XX of JMLR: challenging given a limited amount of time or compu- W&CP XX. Copyright 2012 by the authors. 1453 Online Incremental Feature Learning with Denoising Autoencoders tational resources. h through the encoding function defined as follows: To address this problem, we propose an incremental h = f(x) = sigm(Wx + b), algorithm to learn features from the large-scale online data by adaptively incrementing the features depend- N D N where W R × eis a weight matrix,e b R is a ing on the data and the existing features, using DAE ∈ 1 ∈ hidden bias vector, and sigm(s) = 1+exp( s) is the as a basic building block. Specifically, new features − sigmoid function. Then, we reconstruct x [0, 1]D are added and trained to minimize the residual of the from h using the decoding function: ∈ objective function in generative tasks (e.g., minimizing b reconstruction error) or discriminative tasks (e.g., min- x = g(h) = sigm(WT h + c), imizing supervised loss function). At the same time, similar features are merged to avoid redundant feature D where c R b is an input bias vector. Here, we implic- representations. Experimental results show that our itly assumed∈ tied weights for encoding and decoding incremental feature learning algorithms perform favor- functions, but we can consider separate weights as well. ably compared to the non-incremental feature learning algorithms, including the standard DAE, the deep In this work, we use cross-entropy as a cost function: belief network (DBN), and the stacked denoising au- D toencoder (SDAE), on classification tasks with large (x) = (x, x) = x log x + (1 x ) log(1 x ) L H − i i − i − i datasets. Moreover, we show that incremental feature i=1 learning is more effective in quickly recognizing new X b b b patterns than the non-incremental algorithms when Finally, the objective is to learn model parameters that the data distribution (e.g., the ratio of labels) is highly minimize the cost function non-stationary. W, b, c = arg min (x(j)), { } L W,b,c j 2 Preliminaries X c b b In this paper, we use the denoising autoencoder [32] and this can be optimized by gradient descent (or con- as a building block for incremental feature learning. jugate gradient, L-BFGS, etc.) with respect to all pa- Based on the idea that good features should be robust rameters for encoding and decoding functions. to input corruption, the DAE tries to recover the input data x from encoded representation f(x) of corrupted 3 Incremental Feature Learning data x via a decoding function g(h). More precisely, In many cases, the number of hidden units N is fixed there are four components in the DAE: e during the training of DAE and the optimal value e A conditional distribution q(x x), that stochasti- is determined by cross-validation. However, cross- • cally perturbs input x to a corrupted| version x. validation is computationally prohibitive and often in- N feasible for large datasets. Furthermore, it may not An encoding function f(x) eh R , which gives • ≡ ∈ perform well when the distribution of training data a representation of input data x. e significantly changes over time. A decoding function g(he) x RD, which re- • ≡ ∈ covers the data from the representation h. To address these issues, we present an adaptive feature A differentiable cost function b (x) computes the learning algorithm that can handle large-scale datasets • L dissimilarity between input and reconstruction. without explicitly performing cross-validation over the number of features on the whole dataset. In detail, our Throughout the paper, we consider a case in which the incremental feature learning algorithm is composed of input and the hidden variables are binary (or bounded two key processes: (1) adding new feature mappings to between 0 and 1), i.e., x [0, 1]D and h [0, 1]N . As the existing feature set and (2) merging parts of the ex- ∈ ∈ described in [32], we consider the conditional distribu- isting feature mappings that are considered redundant. D tion q(x x) = i=1 q(xi xi) that randomly sets some We describe the details of the proposed algorithm in of the coordinates| to zeros.| By corrupting the input this section. Q elements,e the DAE attemptse to learn informative representations that can successfully recover the missing 3.1 Overview coordinates to reconstruct the original input data. In The incremental feature learning algorithm addresses other words, the DAE is trained to fill in the missing the following two issues: (1) when to add and merge values introduced by corruption. features and (2) how to add and merge features. To The DAE perturbs the input x to x and maps it to the deal with these problems systematically, we define an hidden representation, or abusing notation, the feature objective function to determine when to add or merge e 1454 Guanyu Zhou, Kihyuk Sohn, Honglak Lee Algorithm 1 Incremental feature learning well as the weight matrix W and the hidden bias b. repeat We use θ to denote all parameters W, b, c, Γ, ν . { } Compute the objective (x) for input x. L For incremental feature learning, we use to denote Collect hard examples into a subset B (i.e., B N ← new features and to denote existing or old features. B x if (x) > µ). O ∪ { } L For example, f (x) h [0, 1]N represents an en- if B > τ then O ≡ O ∈ | | coding function with existing features, and f (x) Select 2∆M candidate features and merge them N h [0, 1]∆N denotes an encoding function with≡ into ∆M features (Section 3.3). N e newly∈ added features. A combined encoding function Add ∆N new features by greedily optimizing e is then written as f (x) = [h ; h ] [0, 1]N+∆N .

Load more