
ACM Multimedia, Augsburg, Germany, September 23-29, 2007. Feature Selection Using Principal Feature Analysis Yijuan Lu Ira Cohen Xiang Sean Zhou Qi Tian University of Texas at San Hewlett-Packard Labs Siemens Medical Solutions University of Texas at San Antonio 1501 Page Mill Road USA, Inc. Antonio One UTSA Circle Palo Alto, CA 94304 51 Valley Stream Parkway One UTSA Circle San Antonio, TX, 78249 [email protected] Malvern, PA, 19355 San Antonio, TX, 78249 [email protected] xiang.zhou@siemens. [email protected] com ABSTRACT Dimensionality reduction of a feature set is a common 1. INTRODUCTION In many real world problems, reducing dimension is an essential preprocessing step used for pattern recognition and classification step before any analysis of the data can be performed. The general applications. Principal Component Analysis (PCA) is one of the criterion for reducing the dimension is the desire to preserve most popular methods used, and can be shown to be optimal using different optimality criteria. However, it has the disadvantage that of the relevant information of the original data according to some optimality criteria. In pattern recognition and general measurements from all the original features are used in the classification problems, methods such as Principal Component projection to the lower dimensional space. This paper proposes a Analysis (PCA), Independent Component Analysis (ICA) and novel method for dimensionality reduction of a feature set by Fisher Linear Discriminate Analysis (LDA) have been extensively choosing a subset of the original features that contains most of the used. These methods find a mapping from the original feature essential information, using the same criteria as PCA. We call this method Principal Feature Analysis (PFA). The proposed method space to a lower dimensional feature space. is successfully applied for choosing the principal features in face In some applications it might be desired to pick a subset of the tracking and content-based image retrieval (CBIR) problems. original features rather then find a mapping that uses all of the Automated annotation of digital pictures has been a highly original features. The benefits of finding this subset of features challenging problem for computer scientists since the invention of could be in saving cost of computing unnecessary features, saving computers. The capability of annotating pictures by computers cost of sensors (in physical measurement systems), and in can lead to breakthroughs in a wide range of applications excluding noisy features while keeping their information using including Web image search, online picture-sharing communities, “clean” features (for example tracking points on a face using easy and scientific experiments. In our work, by advancing statistical to track points- and inferring the other points based on those few modeling and optimization techniques, we can train computers measurements). about hundreds of semantic concepts using example pictures from Variable selection procedures have been used in different settings. each concept. The ALIPR (Automatic Linguistic Indexing of Among them, the regression area has been investigated Pictures - Real Time) system of fully automatic and high speed extensively. In [1], a multi layer perceptron is used for variable annotation for online pictures has been constructed. Thousands of selection. In [2], stepwise discriminant analysis for variable pictures from an Internet photo-sharing site, unrelated to the selection is used as inputs to a neural network that performs source of those pictures used in the training process, have been pattern recognition of circuitry faults. Other regression techniques tested. The experimental results show that a single computer for variable selection are described in [3]. In contrast to the processor can suggest annotation terms in real-time and with good regression methods, which lack unified optimality criteria, the accuracy. optimality properties of PCA have attracted research on PCA based variable selection methods [4, 5, 6, 7]. As will be shown, Categories and Subject Descriptors these methods have the disadvantage of either being too I.4.7 [Feature Measurement] computationally expensive, or choosing a subset of features with redundant information. This work proposes a computationally General Terms efficient method that exploits the structure of the principal Algorithms, Theory, Performance, Experimentation components of a feature set to find a subset of the original feature vector. The chosen subset of features is shown empirically to Keywords maintain some of the optimal properties of PCA. Feature Extraction, Feature Selection, Principal Component Analysis, Discriminant Analysis The rest of the paper is as follows. The existing PCA based feature selection methods are reviewed in Section 2. The Permission to make digital or hard copies of all or part of this work for proposed method, Principal Feature Analysis (PFA), is described personal or classroom use is granted without fee provided that copies are in Section 3. We apply PFA to face tracking and content-based not made or distributed for profit or commercial advantage and that copies image retrieval problems in Section 4. Discussion will be given in bear this notice and the full citation on the first page. To copy otherwise, Section 5. or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23-28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009...$5.00. 2. BACKGROUND AND NOTATION i’th coefficient of one of the principal components (PC) implies We consider a linear transformation of a random vector that the xi element of X is very dominant in that axes/PC. By n choosing the variables corresponding to the highest coefficients of X ∈ℜ with zero mean and covariance matrix Σ x to a lower each of the first q PC’s, the same projection as that computed by q PCA is approximated. This method is a very intuitive and dimension random vector Y ∈ ℜ , q < n computationally feasible method. However, because it considers T each PC independently, variables with similar information content Y = Aq X (1) might be chosen. T with Aq Aq = Iq where I q is the q × q identity matrix. The method proposed in [6] and [7] chooses a subset of size q by computing its PC projection to a smaller dimensional space, and In PCA, A is a n× q matrix whose columns are the q q minimizing a measure using the Procrustes analysis [8]. This orthonormal eigenvectors corresponding to the first q largest method helps reduce the redundancy of information, but is eigenvalues of the covariance matrix Σ x . There are ten optimal computationally expensive because many combinations of subsets properties for this choice of the linear transformation [4]. One are explored. important property is the maximization of the “spread” of the In our method, we exploit the information that can be inferred by points in the lower dimensional space which means that the points the PC coefficients to obtain the optimal subset of features. But in the transformed space are kept as far apart as possible, and unlike the method in [5], we use all of the PC’s together to gain a therefore retaining the variation in the original space. Another better insight on the structure of our original features. So we can important property is the minimization of the mean square error choose variables without redundancy of information. between the predicted data and the original data. Now, suppose we want to choose a subset of the original 3. PRINCIPAL FEATURE ANALYSIS variables/features of the random vector X. This can be viewed as a Let X be a zero mean n-dimensional random feature vector. Let Σ linear transformation of X using a transformation matrix be the covariance matrix of X. Let A be a matrix whose columns ⎡ Iq ⎤ are the orthonormal eigenvectors of the matrix Σ: A (2) T q = ⎢ ⎥ Σ = AΛA (6) ⎣[0](n−q)×q ⎦ or any matrix that is permutation of the rows of A . Without loss ⎡λ1 ⎤ q ⎢ ⎥ of generality, lets consider the transformation matrix Aq as given . 0 where Λ = ⎢ ⎥ , λ ≥ λ ≥ ... ≥ λ above and rewrite the corresponding covariance matrix of X as ⎢ 0 . ⎥ 1 2 n ⎡ {Σ11}q×q {Σ12}q×(n−q) ⎤ ⎢ ⎥ Σ = ⎢ ⎥ (3) ⎣ λn ⎦ ⎣{Σ21}(n−q)×q {Σ22}(n−q)×(n−q) ⎦ λ , λ ,...λ are the eigenvalues of Σ and AT A = I . Let In [4] it is shown that it is not possible to satisfy all of the 1 2 n n q optimality properties of PCA using the same subset. Finding the Aq be the first q columns of A and let V1 ,V2 ,...Vn ∈ℜ be the subset which maximizes is equivalent to | ΣY |=| Σ11 | rows of the matrix Aq. maximization of the “spread” of the points in the lower Each vector Vi represents the projection of the i’th feature dimensional space, thus retaining the variation of the original (variable) of the vector X to the lower dimensional space, that is, data. the q elements of Vi correspond to the weights of the i’th feature Minimizing the mean square prediction error is equivalent to on each axis of the subspace. The key observation is that features minimizing the trace of that are highly correlated or have high mutual information will Σ = Σ − Σ Σ −1Σ . (4) have similar absolute value weight vectors Vi (changing the sign 22|1 22 21 11 12 has no statistical significance [5]). On the two extreme sides, two This can be seen since the retained variability of a subset can be independent variables have maximally separated weight vectors; measured using while two fully correlated variables have identical weight vectors ⎛ ⎞ (up to a change of sign). To find the best subset, we use the ⎜ ⎟ structure of the rows Vi to first find the subsets of features that are trace(Σ 22|1 ) n 2 Retained Variability = ⎜1− ⎟⋅100% ∑ σ i (5) highly correlated and follow to choose one feature from each ⎜ n 2 ⎟ i=1 subset.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages4 Page
-
File Size-