Joint Feature Selection and Subspace Learning

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Joint Feature Selection and Subspace Learning Quanquan Gu, Zhenhui Li and Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign {qgu3,zli28,hanj}@illinois.edu Abstract nal features, thus it is often difficult to interpret the results. Sparse subspace learning methods attempted to solve this Dimensionality reduction is a very important topic problem. For example, [Zou et al., 2004] proposed a sparse in machine learning. It can be generally classi- PCA algorithm based on 2-norm and 1-norm regularization. fied into two categories: feature selection and sub- [Moghaddam et al., 2006] proposed both exact and greedy space learning. In the past decades, many meth- algorithms for binary class sparse LDA as well as its spec- ods have been proposed for dimensionality reduc- tral bound. [Cai et al., 2007] proposed a unified sparse sub- tion. However, most of these works study fea- space learning (SSL) framework based on 1-norm regular- ture selection and subspace learning independently. ized Spectral Regression. In this paper, we present a framework for joint However, the selected features by sparse subspace methods feature selection and subspace learning. We re- are independent and generally different for each dimension of formulate the subspace learning problem and use the subspace. See Figure 1 (a) for an illustrative toy example L2,1-norm on the projection matrix to achieve row- of the projection matrix learned by SSL. Each row of the pro- sparsity, which leads to selecting relevant features jection matrix corresponds to a feature, while each column and learning transformation simultaneously. We corresponds to a dimension of the subspace. We can see that discuss two situations of the proposed framework, for the first dimension of the subspace, the 3rd and 6th fea- and present their optimization algorithms. Experi- tures are not selected, while for the second dimension of the ments on benchmark face recognition data sets il- subspace, the selected features are all except the 1st and 4th lustrate that the proposed framework outperforms one. Hence it is still unclear which features are really use- the state of the art methods overwhelmingly. ful as a whole. Our goal is to learn a projection matrix like Figure 1 (b), which has row-sparsity (elements in a row are 1 Introduction all zero). Hence it is able to discard the irrelevant features High-dimensional data in the input space is usually not good (e.g., the 1st, 5th and 7th features) and transform the relevant for classification due to the curse of dimensionality.Acom- ones simultaneously. One intuitive way is performing feature [ ] mon way to resolve this problem is dimensionality reduction, selection Guyon and Elisseeff, 2003 before subspace learn- which has attracted much attention in machine learning com- ing. However, since these two sub-processes are conducted munity in the past decades. Generally speaking, dimensional- individually, the whole process is likely suboptimal. ity reduction techniques can be classified into two categories: 1IVJ1QJQ`%G]:HV (1) feature selection [Guyon and Elisseeff, 2003]: to select a subset of most representative or discriminative features from the input feature set, and (2) subspace learning [Belhumeur et 8 8 8 8 8 8 al., 1997][He and Niyogi, 2003][He et al., 2005][Yan et al., 2007] (a.k.a feature transformation): to transform the original R8 R8 8 8 R8 R8 input features to a lower dimensional subspace. 8 R8 8 8 R8 8 The most popular subspace learning methods include Prin- R8 8 8 8 8 R8 cipal Component Analysis (PCA) [Belhumeur et al., 1997], V: %`V 8 8 8 8 8 8 Linear Discriminant Analysis (LDA) [Belhumeur et al., 8 R8 R8 R8 R8 8 1997], Locality Preserving Projection (LPP) [He and Niyogi, R8 8 8 8 8 8 2003] and Neighborhood Preserving Embedding (NPE) [He et al., 2005]. Despite different motivations of these meth- ^:_ ^G_ ods, they can all be interpreted in a unified Graph Embedding framework [Yan et al., 2007]. Figure 1: An illustrative toy example of the projection matri- One major disadvantage of the above methods is that the ces learned by (a) Sparse subspace learning; and (b) feature learned projection is a linear combination of all the origi- selection and subspace learning 1294 Based on the above motivation, in this paper, we aim to With different choices of W, the linear graph embedding jointly perform feature selection and subspace learning. To framework leads to many popular linear dimensionality re- achieve this goal, we reformulate subspace learning as solv- duction methods, e.g. PCA [Belhumeur et al., 1997],LDA ing a linear system equation, during which we use L2,1- [Belhumeur et al., 1997],LPP[He and Niyogi, 2003] and norm on the projection matrix, encouraging row-sparsity. It NPE [He et al., 2005]. We briefly give two examples below. is worth noting that L2,1-norm has already been successfully LDA: Suppose we have c classes and the k-th class have applied in Group Lasso [Yuan et al., 2006], multi-task fea- nk samples, n1 + ···+ nc = n.Define ture learning [Argyriou et al., 2008], joint covariate selection 1 , if xi and xj belong to the k-th class [ ] = nk and joint subspace selection Obozinski et al., 2010 .There- Wij 0 . (2) sulted optimization problem includes two situations, for each , otherwise of which we present a very simple algorithm, that is theo- LPP [He and Niyogi, 2003]: Define retically guaranteed to converge. Experiments on benchmark (x x ) x ∈N (x ) x ∈N (x ) = d i, j , if j k i or i k j face recognition data sets demonstrate the effectiveness of the Wij 0 , proposed framework. , otherwise The remainder of this paper is organized as follows. In (3) where Nk(xi) denotes the set of k nearest neighbors of xi, Section 2, we briefly introduce the graph embedding view of (x x ) x x subspace learning. In Section 3, we present a framework for d i, j measures the similarity between i and j ,which x −x 2 − i j joint feature selection and subspace learning. In Section 4, we can be chosen as Gaussian kernel e 2σ2 or cosine dis- xT x review some related works. Experiments on benchmark face tance i j . For more examples and other extensions of recognition data sets are demonstrated in Section 5. Finally, xixj we draw a conclusion and point out future work in Section 6. graph embedding, please refer to [Yan et al., 2007]. 1.1 Notations 3 Joint Feature Selection and Subspace d×n Learning Given a data matrix X =[x1,...,xn] ∈ R ,weaim to learn a projection matrix A ∈ Rd×m, projecting the Since each row of the projection matrix corresponds to a fea- input data into an m-dimensional subspace. For a matrix ture in the original space, in order to do feature selection, it A ∈ Rd×m, we denote the ith row of A by ai,andthejth is desirable to have some rows of the projection matrix be all column of A by aj . The Frobenius norm of A is defined as zeros. This motivates us to use L2,1-norm on the projection d i 2 matrix, which leads to row-sparsity of the projection matrix. ||A||F = ||a || ,andtheL2,1-norm of A is defined as i 2 As a result, based on Eq. (1), we formulate joint feature se- d i ||A||2,1 = i ||a ||2. lection and subspace learning as follows, T T minA ||A||2,1 + μtr(A XLX A) 2 Graph Embedding View of Subspace s.t. AT XDXT A = I, (4) Learning where μ is a regularization parameter. Although the objective Many dimensionality reduction methods have been proposed function is convex, the constraint is not. Hence it is difficult to to find low-dimensional representation of xi. Despite dif- optimize. In the following, we will reformulate the problem ferent motivations of these methods, they can be nicely in- to make it easy to be solved. terpreted in a general graph embedding framework [Yan et Theorem 3.1. Let Y ∈ Rn×m be a matrix of which each al., 2007]. In graph embedding, we construct a data graph G column is an eigenvector of eigen-problem Wy = λDy. n×n whose vertices correspond to {x1,...,xn}.LetW ∈ R If there exists a matrix A ∈ Rd×m such that XT A = Y, be a symmetric adjacency matrix with Wij characterizes the then each column of A is an eigenvector of eigen-problem favorite relationship among the data. The purpose of graph XWXT a = λXDXT a with the same eigenvalue λ. embedding is to find the optimal low-dimensional vector representation for the vertices of graph G that best preserves the Proof. This is the corollary of Theorem 1 in [Cai et al., 2007]. relationship between the data points. In this paper, we fo- cus on linear dimensionality reduction. That is, XT A.The A Theorem 3.1 shows that instead of solving the eigen- optimal is given by the following optimization problem, problem XWXT A = ΛXDXT A, A can be obtained by T T the following two steps: minA tr(A XLX A) WY = ΛDY Y AT XDXT A = I 1. Solve the eigen-problem to get ; s.t. , (1) T 2. Find A which satisfies X A = Y. = L = D − T where Dii j Wij is a diagonal matrix, and Note that only the second step involves A. X A = Y is a W is called graph Laplacian [Chung, 1997], I is the identity linear system problem, which may behave in any one of three matrix with proper size.

Load more