Feature Selection Guided Auto-Encoder
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Feature Selection Guided Auto-Encoder Shuyang Wang,1 Zhengming Ding,1 Yun Fu 1,2 1Department of Electrical & Computer Engineering, 2College of Computer & Information Science, Northeastern University, Boston, MA, USA {shuyangwang, allanding, yunfu}@ece.neu.edu Abstract 6HOHFWHG'LVFHUQLQJ8QLWV )RUIXWXUH WDVN Recently the auto-encoder and its variants have demonstrated ܺ ்݂ܲሺܺሻ their promising results in extracting effective features. Specif- ݃ሺ݂ሺܺሻሻ ically, its basic idea of encouraging the output to be as similar as input, ensures the learned representation could faithfully reconstruct the input data. However, one problem arises that not all hidden units are useful to compress the discrimina- tive information while lots of units mainly contribute to rep- resent the task-irrelevant patterns. In this paper, we propose a novel algorithm, Feature Selection Guided Auto-Encoder, which is a unified generative model that integrates feature ,QSXW selection and auto-encoder together. To this end, our pro- 7DVNLUUHOHYDQW8QLWV 2XWSXW posed algorithm can distinguish the task-relevant units from the task-irrelevant ones to obtain most effective features for future classification tasks. Our model not only performs fea- Figure 1: The feature selection is adopted in hidden layer ture selection on learned high-level features, but also dynam- to distinguish discerning units from task-irrelevant units, ically endows the auto-encoder to produce more discrimina- which in turn constrains the encoder to focus on compress- tive units. Experiments on several benchmarks demonstrate ing important patterns to selected units. All of the units our method’s superiority over state-of-the-art approaches. are contributed to reconstruct the input, while only selected units are used for future tasks. Introduction When dealing with high-dimensional data, the curse of this scheme leads to one problem that the majority of the dimensionality is a fundamental difficulty in many prac- learned high-level features may be blindly used to repre- tical machine learning problems (Duda, Hart, and Stork sent the irrelevant patterns in the training data. Although 2001). For many real-world data (e.g., video analysis, bio- the effort to incorporate supervision (Socher et al. 2011) has informatics), their dimensions are usually very high, which been deployed, it is still challenging to learn task-relevant results in the significant increase of the computational time hidden-layer representation since there must be some hid- and space. In practice, not all features are equally impor- den units mainly used to faithfully reconstruct the irrelevant tant and discriminative, since most of them are often highly or noisy part of the input. It is unreasonable to endow the dis- correlated or even redundant to each other (Guyon and Elis- criminablity to this kind of task-irrelevant units. Take object seeff 2003). The redundant features generally would make recognition for example, lots of hidden units are mainly used learning methods over-fitting and less interpretable. Conse- to reconstruct the background clutters, then its performance quently, it is necessary to reduce the data dimensionality and could be improved significantly if we can distinguish im- select the most important features. portant hidden units (e.g., those encoding foreground) from Recently, the auto-encoder and its variants have drawn large amounts of distracting hidden units (e.g., those encod- increasing attention as nonlinear dimensionality reduction ing background). methods (Hinton and Salakhutdinov 2006; Wang, Ding, and To address this issue, a unified framework is proposed to Fu 2016). The conventional auto-encoder tries to learn an integrate feature selection and auto-encoder (Fig.1). Intu- approximation to the identity by encouraging the output to itively, the feature selection is applied on learned hidden- be as similar to the input as possible. The architecture forces layer to extract the discriminative features from the irrel- the network to seek a compressed representation of the data evant ones. Simultaneously, the task-relevant hidden units while preserving the most important information. However, can feed back to optimize the encoding layer to achieve Copyright c 2017, Association for the Advancement of Artificial more discriminability only on selected hidden units. There- Intelligence (www.aaai.org). All rights reserved. fore, our model not only performs dynamic feature selection 2725 on high-level features, but also separates important and ir- hidden layer of auto-encoder, which aims at guiding the en- relevant information into different groups of hidden units coder to compress task-relevant and irrelevant information separately through a joint learning mechanism with auto- into two groups of hidden units. encoder. We highlight our main contributions as follows: • We propose the Feature Selection Guided Auto-Encoder The Proposed Algorithm (FSAE) that jointly performs feature selection and auto- In this section, we first provide the preliminary and mo- encoder in a unified framework. The framework selects tivation of our proposed algorithm, followed by our de- the discerning high-level features and simultaneously en- tailed model by jointly selecting features and training auto- hances the discriminability on the selected units. encoder. Then, we discuss two most relevant algorithms to • Our proposed method can be extended to different sce- our approach. Moreover, the deep architecture is described. narios (e.g., classification, clustering), by shifting feature selection criterion (e.g., Fisher score, Laplacian score) on Preliminary and Motivations the hidden layer. The general idea of auto-encoder is to represent the data • The proposed FSAE can be adopted as a building block through a nonlinear encoder to a hidden layer and use the to form a stacked deep network. We deploy several exper- hidden units as the new feature representations: iments to demonstrate the effectiveness of our algorithm hi = σ(W1xi + b1);x ˆi = σ(W2hi + b2) (1) by comparing with state-of-the-art approaches. z where hi ∈ R is the hidden representation, and xˆi ∈ Rd is interpreted as a reconstruction of normalized input Related work d xi ∈ R . The parameter set includes weight matrices W1 ∈ Two lines of related works, feature selection and auto- z×d d×z z d R ,W2 ∈ R , and offset vectors b1 ∈ R ,b2 ∈ R encoder, are introduced in this section. with dimensionality z and d. σ is a non-linear activation Feature selection The past decade has witnessed a num- function (e.g., sigmoid). ber of proposed feature selection criterions, such as Fisher The auto-encoder with single hidden layer is generally a score (Gu, Li, and Han 2012), Relief (Liu and Motoda neural network with identical input and target, namely, 2007), Laplacian score (He, Cai, and Niyogi 2005), and Trace Ratio criterion (Nie et al. 2008). In detail, suppose n 1 2 the original set of features denoted as S, the goal is to find a min xi − xˆi2, (2) W1,W2,b1,b2 2n subset T to maximize the above performance criterion C, i=1 T =argmaxC(T), s.t. |T| = m, m d, where n is sample size of the data, xˆi is the reconstructed T⊆S output and xi is the target. A good representation thus can where m and d are the feature dimension of selected and be obtained with the ability to well reconstruct the data. original, respectively. It often requires prohibitively expen- As we mentioned before, all the high-level hidden units sive computational cost in this combinatorial optimization contribute to capture the intrinsic information of input data problem. Therefore, instead of subset-level selection, one during data reconstruction, however, these units are not common traditional method first calculates the score of each equally important in terms of our classification task. For feature independently and then select the top-m ranked fea- example, some units play an essential role to reconstruct tures (feature-level selection). However, such features se- the background in an object image, but they have nothing lected one by one are suboptimal, which neglects the subset- to do with our final object classification task. We consider level score and results in discarding good combination of these units as task-irrelevant units, which are undesirable in features or preserving redundant features. To address this our learned new features. On the other hand, since the tra- problem, (Nie et al. 2008) and (Gu, Li, and Han 2012) pro- ditional unsupervised model has limited capacity to model posed globally optimal solution based on Trace Ratio crite- the marginal input distribution for the goal supervised task, rion and Fisher score respectively. some existing works exploited the label information on hid- Auto-encoder is usually adopted as a basic building den units using a softmax layer (Socher et al. 2011). Consid- block to construct a deep structure (Hinton and Salakhut- ering the previous assumption about task-irrelevant units, it dinov 2006; Ding, Shao, and Fu 2016). To encourage struc- is inappropriate or even counterproductive to endow all the tural feature learning, further constraints have been imposed hidden units with discriminability. on parameters during model training. Sparse Auto-Encoder Therefore, we have two conclusions: 1) feature selec- (SAE) was proposed to constrain the average response of tion is essential to distinguish discerning units out of task- each hidden unit to a small value (Coates, Ng, and Lee irrelevant units, and 2) the discriminative information should 2011). Yu et al. proposed a graph regularized auto-encoder, be only applied on the selected task-relevant units. Based on aiming to adopt graph to guide the encoding and decoding above discussion, we propose our joint feature selection and (Yu et al. 2013). However, it is still challenging to learn auto-encoder model in a unified framework. with lots of irrelevant patterns in the data, and current auto- encoder variants have not yet considered the hidden units Feature Selection Guided Auto-Encoder into two parts, one is task-relevant and the other is task- In this section, we propose our joint learning framework irrelevant.